Downloading multiple files with wget
wget is a great tool for downloading multiple files from the CEDA archive. It can be used from a command line or inside scripts. It has many options for controlling what is selected for download and how the downloads are stored on your local computer. It provides an alternative to users who previously used FTP to download data from CEDA.
Obtaining wget
wget is usually installed by default on linux and Mac.
wget is also available for Windows as an executable file that can be run from a command window. For convenience, move it to a location that is on PATH, so it can be run from any location. You can also run wget via Windows Subsystem for Linux or MobaXterm.
Obtaining a CEDA access token
Some CEDA data can be accessed using wget without authentication. However, many datasets have specific auditing policies or access restrictions in place which prevent non logged-in or unauthorised users from downloading the data. In order to download data you may therefore need to obtain an access token. If you try downloading data without using an access token and find that the files downloaded are empty or do not contain the expected content then it may be because you need to use an access token.
You can obtain an access token via the CEDA services portal. Once you have created one you can use the 'Copy' button to copy the token text. For more details about access tokens, including how they can be obtained and used from scripts, see Using Archive Access Tokens
Using an archive access token with wget
To submit your access token, add the following option to your wget command, where TOKEN_TEXT should be replaced by the token string:
--header "Authorization: Bearer TOKEN_TEXT"
As the token text is quite long, you may find it more convenient to store it as an environment variable instead. From most Linux computers you can create a variable named TOKEN containing your token text with the following command (replace 'TOKEN_TEXT' with your token string):
export TOKEN="TOKEN_TEXT"
You can then substitute the token text into the wget option as follows:
--header "Authorization: Bearer $TOKEN"
Testing token access
If you want to test that your token is working you can run the following command (substituting your token string or environment variable for TOKEN_TEXT as described above):
wget -O - -q https://dap.ceda.ac.uk/badc/ARCHIVE_INFO/ACCESS_TEST/RESTRICTED/TOKEN_CHECK --header "Authorization: Bearer TOKEN_TEXT"
If it works then you should see the following short message:
Congratulations, you have successfuly authenticated with CEDA using a token
If you get a large amount of html output then it has failed.
Simple downloads
When you browse the CEDA archive using https://data.ceda.ac.uk click on the bulk downloads options button at the top:
This will show a wget command that can be used to download all data starting from the current directory, including all files in subdirectories below this point. For example:
wget -e robots=off --mirror --no-parent -r https://dap.ceda.ac.uk/badc/acsoe/data/c-130/97-flights-trajectories/a574/
You can modify the options to this command to select only a subset of the files for download and to customise how they are stored when downloaded . See the Advanced options section of this page.
The wget command shown by the archive browser does not include authentication using your access token, as described above as it is currently unable to determine if authentication is required for the data you want to access. If you try the command and it downloads files, but the content does not look right then add the token authentication option (TOKEN_TEXT is either your token text or environment variable containing the token text, see above):
--header "Authorization: Bearer TOKEN_TEXT"
Advanced options
wget has many options. Type "wget —help" to see the options available for the version you are using (options may vary according to the version). You can also see an online version.
The most useful options are those that allow you to select which files to download and to control how they are then stored on your local machine.
-A | Specifies a comma-separated list of file name patterns to accept, which can include wildcard characters. For example, '*.dat' or '*2024*'. If you include wildcards then enclose the pattern in quotes. Note that this pattern only applies to the filename and not the directory path. |
-R | Same as -A, but specifies file name patterns to be excluded. |
—accept-regex | Specifies a regular expression to accept covering the entire url (includes the directory path). For example '.*2024.*' to include any file where the filename or directory path contains '2024'. |
—reject-regex | Same as —accept-regex, except it specifies pattern to reject. |
-l | Sets the maximum subdirectory depth that will be retrieved. Use this to prevent downloading large amounts of unwanted data. Note that the —mirror option includes setting this value to 'unlimited'. |
—no-directories | Puts all the retrieved files into the current directory instead of creating a directory hierarchy. |
Problems
No files downloaded? Check the url that you have used. Unless you are downloading a single file it should end in a slash ('/') character.
If you download files, but the content is not as expected then it is probably because you need to authenticate with an access token. This is required for many datasets so we can report usage statistics to the dataset originators. Please see above for details of how to obtain and use access tokens.