Copy Command

Copying a dataset from one location to another can be error-prone and time-consuming. This command-line script allows for incremental copying. When the copying process fails, it can be resumed. It can be used to copy files from a local directory to a remote server, from a remote server to a local directory as long as there is a zarr backend to read and write the data.

The script uses multiple threads to make the process faster. However, it is important to consider that making parallel requests to the same server may not be ideal, for instance if the server internally uses a limited number of threads to handle requests.

The option to rechunk the data is available, which can be useful when the data is stored on a platform that does not support having many small files or many files in the same directory. However keep in mind that rechunking has a huge impact on the performance when reading the data: The chunk pattern for the source dataset has been defined for good reasons, and changing it is very likely to have a negative impact on the performance.

Warning

When resuming the copying process (using --resume), calling the script with the same arguments for --block-size and --rechunk is recommended. Using different values for these arguments to resume copying the same dataset may lead to unexpected behaviour.

Copy a dataset from one location to another.

usage: anemoi-datasets copy [-h] [--overwrite | --resume]
                            [--transfers TRANSFERS] [--verbosity VERBOSITY]
                            [--nested] [--rechunk RECHUNK]
                            [--block-size BLOCK_SIZE] [--obfuscate]
                            source target

Positional Arguments

source: Source location.
target: Target location.

Named Arguments

--overwrite

Overwrite existing dataset. This will delete the target dataset if it already exists. Cannot be used with –resume.

Default: False

--resume

Resume copying an existing dataset. Cannot be used with –overwrite.

Default: False

--transfers

Number of parallel transfers.

Default: 8

--verbosity

Verbosity level. 0 is silent, 1 is normal, 2 is verbose.

Default: 1

--nested

Use ZARR’s nested directpry backend.

Default: False

--rechunk

Rechunk the target data array. Rechunk size should be a diviser of the block size.

--block-size

For optimisation purposes, data is transfered by blocks.

--obfuscate

Obfuscate the data during transfer. This will generate random data that match the statistics. Useful for testing and benchmarking.

Default: False