copy
Copying a dataset from one location to another can be error-prone and time-consuming. This command-line script allows for incremental copying. When the copying process fails, it can be resumed. It can be used to copy files from a local directory to a remote server, from a remote server to a local directory as long as there is a zarr backend to read and write the data.
The script uses multiple threads to make the process faster. However, it is important to consider that making parallel requests to the same server may not be ideal, for instance if the server internally uses a limited number of threads to handle requests.
The option to rechunk the data is available, which can be useful when the data is stored on a platform that does not support having may small files or many file on the same directory. However keep in mind that rechunking has a huge impact on the performance when reading the data: The chunk pattern for the source dataset has been defined for good reasons, and changing it is very likey to have a negative impact on the performance.
Warning
When resuming the copying process (using --resume
), calling the script with the same arguments for --block-size
and --rechunk
is recommended.
Using different values for these arguments to resume copying the same dataset may lead to unexpected behavior.
Copy a dataset from one location to another.
usage: anemoi-datasets copy [-h] [--overwrite | --resume]
[--transfers TRANSFERS] [--verbosity VERBOSITY]
[--nested] [--rechunk RECHUNK]
[--block-size BLOCK_SIZE]
source target
Positional Arguments
- source
Source location.
- target
Target location.
Named Arguments
- --overwrite
Overwrite existing dataset. This will delete the target dataset if it already exists. Cannot be used with –resume.
Default:
False
- --resume
Resume copying an existing dataset. Cannot be used with –overwrite.
Default:
False
- --transfers
Number of parallel transfers.
Default:
8
- --verbosity
Verbosity level. 0 is silent, 1 is normal, 2 is verbose.
Default:
1
- --nested
Use ZARR’s nested directpry backend.
Default:
False
- --rechunk
Rechunk the target data array. Rechunk size should be a diviser of the block size.
- --block-size
For optimisation purposes, data is transfered by blocks. Default is 100.
Default:
100