Data Usage Statistics

Data in the CEDA archives are accessed by a world wide user community and such usage is important to understand for our data providers and funders especially. Over the years the way such usage has been tracked and reported has evolved in response to the changing nature of the archive, its shear scale and volume of use and ways of accessing data. This can present challenges on reporting usage statistics and how to interpret them, especially when comparing different time periods.

This article covers the various download statistics that are available for the CEDA archive as well as important caveats that users need to keep in mind when using them.

Archive download stats

Data in the CEDA archive have been accessible via various means over the years. This includes web download (either manually or via scripts), FTP (now retired) and other mechanisms such as via the CEDA Web Processing Service. The logging behind these downloads has also changed over time, but are largely comparable, barring the caveats listed further down.

To access archive download stats

Access to the public stats service can either be obtained directly at https://public-stats.ceda.ac.uk/, which starts with a view covering the entire archive, or by using the 'Download stats' link on the right hand side of relevant dataset pages as shown below:

Note, the stats are based on archive paths with dataset catalogue records linked to one part of the archive that they cover. It is, though, possible to get aggregate stats for more than one dataset by altering the path under the 'Dataset' filter on the left of the public stats page:

There are various other filters there as noted below:

start date/end date - you can set one or both of these to get stats within the time period given. Note, however, that the CEDA stats presently extend back to mid 2023, though we will be reprocessing older logs to give earlier download statistics in due course.
Dataset - this shows the path in the archive below which stats are aggregated up. It is possible to aggregate over more than one path in the archive by supplying them as comma seperated path strictly with no spaces. e.g. /badc/ukmo-midas-open/data/,/badc/ukmo-cet/data/ will cover usage of all the MIDAS Open and Met Office CET datasets below those two paths in the archive. This is important as without this aggregation method obtaining the stats separately and adding them together could lead to double counting of number of unique IP addresses and countries.
Method - if you want to look at a specific access method by which the data are accessed (E.g. dap, ftp)
User Type - all, anonymous or registered users only, useful for 'public' datasets when trying to tease apart potentially some of the caveats around using IP addresses to record 'unique users' as noted below
Bot Filtering (experimental) - over the years we have seen increased traffic from bots crawling the archive and, where access permits, downloading data despite the user of 'robots.txt' files. This is a common issue for web services, not just the CEDA archive. This filter applies an experimental CEDA method to try and reduce the impact of such 'bots' crawling the archive on the download stats. (NOTE: 5th Sept, the filtering is known to be overly aggressive at present, but adjustments will be rolled out soon)

The various views of the download stats can then allow users to explore the stats in various ways and also to either access them via JSON endpoints or to download in various forms for local processing/visualisation (e.g. in spreadsheets to create additional charts).

Archive stats coverage

In mid-2025 a new CEDA archive stats service was implemented to address performance issues and to address logging inconsistencies over the years. In addition, increased activity from 'bots' (see note further down) has necessitated additional filtering to be implemented within the stats index.

The index has been processed back to 2012 and is now updated on a regular basis.

Caveats

Archive download statistics do not include usage via direct JASMIN usage, via tools such the CEDA WPS or via ESGF.

Additional important caveats to note are given below. A summary table shows if they related to fully 'Public' datasets or those that require a user to login. Details of the caveats are then given.

Caveat	Public data	Login required
'bot' usage	Yes	No
Crawler usage	Yes	No
multiple IP addresses per user	Yes	Yes
'bit torrent' downloads	Yes	Yes

Bots! There has been a significant increase in bots which actually download datasets and we presume (from the bot names, Claude, ChatGPT etc) are then used to train LLMs. Sam and I think this is a legitimate use. This continued to rise steadily throughout the year.
Crawler (search engine bots) were also included for a while, caused by the robots.txt on one of the download servers being deleted on an upgrade.
There was a large leap in activity days around August/September 2024. We see multiple IP addresses which are downloading multiple open datasets across the CEDA holdings that involves many machines (probably via torrents). It is, therefore, likely that this is artificially inflating the user numbers in this period.

Known periods of reduced download statistics coverage and unusual trends

Whilst the archive has tried to maintain its download access logs, there have been times when specific issues have impacted the logging as noted below

Time start	Time end	Issue
End September 2021	Mid November 2021	During parts of September, October and November no stats were recorded, so numbers for these three months are underreported.
June 2023	June 2023	Some data from the end of June 2023 are missing as a new logging system for there were a few issues when a new logging system for one of the download methods was introduced.
25th October 2023	10th November 2023	Some stats were lost due to issues related to the JASMIN downtime in this period.
February 2024 August 2024	Ongoing	There has been an increase in download activity since February 2024 and even more since August 2024. We believe that some of this activity may be down to increased bot activity rather than real user downloads. This is being investigated and we are looking to provide updated download stats with this activity filtered out.
November 2024	onwards	The ftp server was turned off in November for security reasons, so this download method has no stats from this date.
November 2024	November 2024	There was an issue with the underlying storage hosting the datasets which meant that the datasets were not available for at least a 4-day period in November and with further disruption in the days before and after. Other downtime periods may have affected stats, but not presently listed.
October 2024 Mid February 2025	October 2024 mid-March 2025	Tthe download logs for one of our popular download servers (dap.ceda.ac.uk) were unfortunately lost for most of October 2024 and also from mid February to early/ mid March, so download statistics will be low in this period. There is now a monitor in place to ensure it is noticed more quickly in future.
May 2025	Beginning of July 2025	Nearly the entire catalogue appears to have been downloaded by users based in Brazil. We think this is related to a workshop which was ongoing at the time.
October 2025	November 2025	The unusually high volume of data downloaded in October and November 2025, which was caused by users using bit-torrent approaches to download a large amount of Sentinel and MODIS data.
February 2026	March 2026	The unusually high number of Vietnamese users in February (but also building from other months), probably due to a few users torrenting larger climate datasets to download items more efficiently. We believe this may relate to an upcoming climate event in Vietnam in late March 2026.

JASMIN usage

At present there isn't a method available to capture file access from within the JASMIN system to provide usage statistics for such usage. Potential methods to address this are being examined by CEDA.

ESGF usage

Various climate related datasets are also made available as part of the Earth System Grid Federation (ESGF). The ESGF hosts a central stats service for accesses via the various ESGF nodes, operated by one of the ESGF partners. It is presently under redevelopment and will include further stats in due course. As ESGF 'mirrors' data accross nodes it is useful to poll the ESGF stats service to find data usage accross all nodes, including the CEDA ESGF node.

Caveats

Citation harvesting

Where datasets are assigned a 'DOI' (a permanent identifier that resolves to a given resource) it is becoming increasingly possible to track data usage when these DOI references through the open literature. CEDA uses a tool developed with other NERC data centres in the Environmental Data Service to harvest such citations from 'indexing' services. When these have been found they have been added to the dataset catalogue records and can be viewed by clicking on the 'citations' link on the right side of the dataset record view:

This will take you to a list at the bottom of the catalogue record where the list of harvested citations are displayed:

This is, however, a tool that remains in development and there are a wide range of known issues upstream from the EDS citation harvesting tool that affect the citations havested:

Caveats

The citation harvesting only works for DOIed datasets. Thus older, non-DOIed datasets will not be supported by this harvesting mechanism.
Data citations may not be included by the authors of journal articles following best practice and so aren't presented in the indexing services
Journals may not be suppling datset citation information to the indexing services
Journals may push data citation information into "supplementary material" parts of the paper (especially when the number of citations is large otherwise) and thus are no longer indexed accordingly
There are known cases of false-positives (ie.. incorrect linking) being pulled through to the indexing services
There are also known cases where citations are not being picked up by the indexing services
Citation information may come through incomplete or unresolvable by the EDS tool
There remain questions about what constitutes a 'datset citation' with various usages being presented by the indexing services that may not conform to expectations (e.g in the above screen shot there are 23 citations, but most are from dataset to dataset version or other relationship linking, as opposed to citations in journals)

As such the citations captured should be used as an indicator only of the types of usage that the data may have been downloaded for.