Downloading multiple files with wget

wget is a great tool for downloading multiple files from the CEDA archive. It can be used from a command line or inside scripts. It has many options for controlling what is selected for download and how the downloads are stored on your local computer. It provides an alternative to users who previously used FTP to download data from CEDA.

Obtaining wget

wget is usually installed by default on linux and Mac.

wget is also available for Windows as an executable file that can be run from a command window. For convenience, move it to a location that is on PATH, so it can be run from any location. You can also run wget via Windows Subsystem for Linux or MobaXterm.

Obtaining a CEDA access token

Some CEDA data can be accessed using wget without authentication. However, many datasets have specific auditing policies or access restrictions in place which prevent non logged-in or unauthorised users from downloading the data. In order to download data you may therefore need to obtain an access token. If you try downloading data without using an access token and find that the files downloaded are empty or do not contain the expected content then it may be because you need to use an access token.

You can obtain an access token via the CEDA services portal. If you don't already have a CEDA account you will need to register first. If the data you wish to download has restricted access then you will need to apply for access to the data via your CEDA account. Your token will only allow you to access data that your account is permitted to access.

Once you have created a token you can use the 'Copy' button to copy the token text. For more details about access tokens, including how they can be obtained and used from scripts, see Using Archive Access Tokens

Using an archive access token with wget

To submit your access token, add the following option to your wget command, where TOKEN_TEXT should be replaced by the token string:

--header "Authorization: Bearer TOKEN_TEXT"

As the token text is quite long, you may find it more convenient to store it as an environment variable instead. From most Linux computers you can create a variable named TOKEN containing your token text with the following command (replace 'TOKEN_TEXT' with your token string):

export TOKEN="TOKEN_TEXT"

You can then substitute the token text into the wget option as follows:

--header "Authorization: Bearer $TOKEN"

Testing token access

If you want to test that your token is working you can run the following command (substituting your token string or environment variable for TOKEN_TEXT as described above):

wget -O - -q https://dap.ceda.ac.uk/badc/ARCHIVE_INFO/ACCESS_TEST/RESTRICTED/TOKEN_CHECK --header "Authorization: Bearer TOKEN_TEXT"

If it works then you should see the following short message:

Congratulations, you have successfuly authenticated with CEDA using a token

If you get a large amount of html output then it has failed.

Simple downloads

When you browse the CEDA archive using https://data.ceda.ac.uk click on the bulk downloads options button at the top:

This will show a wget command that can be used to download all data starting from the current directory, including all files in subdirectories below this point. For example:

wget -e robots=off --mirror --no-parent -r https://dap.ceda.ac.uk/badc/acsoe/data/c-130/97-flights-trajectories/a574/

You can modify the options to this command to select only a subset of the files for download and to customise how they are stored when downloaded . See the Advanced options section of this page.

The wget command shown by the archive browser does not include authentication using your access token, as described above as it is currently unable to determine if authentication is required for the data you want to access. If you try the command and it downloads files, but the content does not look right then add the token authentication option (TOKEN_TEXT is either your token text or environment variable containing the token text, see above):

--header "Authorization: Bearer TOKEN_TEXT"

Advanced options

wget has many options. Type "wget —help" to see the options available for the version you are using (options may vary according to the version). You can also see an online version.

The most useful options are those that allow you to select which files to download and to control how they are then stored on your local machine.

-A	Specifies a comma-separated list of file name patterns to accept, which can include wildcard characters. For example, '.dat' or '2024*'. If you include wildcards then enclose the pattern in quotes. Note that this pattern only applies to the filename and not the directory path.
-R	Same as -A, but specifies file name patterns to be excluded.
—accept-regex	Specifies a regular expression to accept covering the entire url (includes the directory path). For example '.2024.' to include any file where the filename or directory path contains '2024'.
—reject-regex	Same as —accept-regex, except it specifies pattern to reject.
-l	Sets the maximum subdirectory depth that will be retrieved. Use this to prevent downloading large amounts of unwanted data. Note that the —mirror option includes setting this value to 'unlimited'.
—no-directories	Puts all the retrieved files into the current directory instead of creating a directory hierarchy.

Problems

No files downloaded? Check the url that you have used. Unless you are downloading a single file it should end in a slash ('/') character.

If you download files, but the content is not as expected then it is probably because you need to authenticate with an access token. This is required for many datasets so we can report usage statistics to the dataset originators. Please see above for details of how to obtain and use access tokens.