Data Usage Statistics
Data in the CEDA archives are accessed by a world wide user community and such usage is important to understand for our data providers and funders especially. Over the years the way such usage has been tracked and reported has evolved in response to the changing nature of the archive, its shear scale and volume of use and ways of accessing data. This can present challenges on reporting usage statistics and how to interpret them, especially when comparing different time periods.
This article covers the various download statistics that are available for the CEDA archive as well as important caveats that users need to keep in mind when using them.
Archive download stats
Data in the CEDA archive have been accessible via various means over the years. This includes web download (either manually or via scripts), FTP (now retired) and other mechanisms such as via the CEDA Web Processing Service. The logging behind these downloads has also changed over time, but are largely comparable, barring the caveats listed further down.
To access archive download stats
Access to the public stats service can either be obtained directly at https://public-stats.ceda.ac.uk/, which starts with a view covering the entire archive, or by using the 'Download stats' link on the right hand side of relevant dataset pages as shown below:
Note, the stats are based on archive paths with dataset catalogue records linked to one part of the archive that they cover. It is, though, possible to get aggregate stats for more than one dataset by altering the path under the 'Dataset' filter on the left of the public stats page:
There are various other filters there as noted below:
- start date/end date - you can set one or both of these to get stats within the time period given. Note, however, that the CEDA stats presently extend back to mid 2023, though we will be reprocessing older logs to give earlier download statistics in due course.
- Dataset - this shows the path in the archive below which stats are aggregated up. It is possible to aggregate over more than one path in the archive by supplying them as comma seperated path strictly with no spaces. e.g.
/badc/ukmo-midas-open/data/,/badc/ukmo-cet/data/
will cover usage of all the MIDAS Open and Met Office CET datasets below those two paths in the archive. This is important as without this aggregation method obtaining the stats separately and adding them together could lead to double counting of number of unique IP addresses and countries. - Method - if you want to look at a specific access method by which the data are accessed (E.g. dap, ftp)
- User Type - all, anonymous or registered users only, useful for 'public' datasets when trying to tease apart potentially some of the caveats around using IP addresses to record 'unique users' as noted below
Bot Filtering (experimental) - over the years we have seen increased traffic from bots crawling the archive and, where access permits, downloading data despite the user of 'robots.txt' files. This is a common issue for web services, not just the CEDA archive. This filter applies an experimental CEDA method to try and reduce the impact of such 'bots' crawling the archive on the download stats. (NOTE: 5th Sept, the filtering is known to be overly aggressive at present, but adjustments will be rolled out soon)
The various views of the download stats can then allow users to explore the stats in various ways and also to either access them via JSON endpoints or to download in various forms for local processing/visualisation (e.g. in spreadsheets to create additional charts).
Caveats
Archive download statistics do not include usage via direct JASMIN usage, via tools such the CEDA WPS or via ESGF.
Additional important caveats to note are given below. A summary table shows if they related to fully 'Public' datasets or those that require a user to login. Details of the caveats are then given.
Caveat | Public data | Login required |
---|---|---|
'bot' usage | Yes |
No |
Crawler usage |
Yes |
No |
multiple IP addresses per user |
Yes |
Yes |
'bit torrent' downloads |
Yes |
Yes |
- Bots! There has been a significant increase in bots which actually download datasets and we presume (from the bot names, Claude, ChatGPT etc) are then used to train LLMs. Sam and I think this is a legitimate use. This continued to rise steadily throughout the year.
- Crawler (search engine bots) were also included for a while, caused by the robots.txt on one of the download servers being deleted on an upgrade.
- There was a large leap in activity days around August/September 2024. We see multiple IP addresses which are downloading multiple open datasets across the CEDA holdings that involves many machines (probably via torrents). It is, therefore, likely that this is artificially inflating the user numbers in this period.
Known periods of reduced download statistics coverage
Whilst the archive has tried to maintain its download access logs, there have been times when specific issues have impacted the logging as noted below
Time start | Time end | Issue |
---|---|---|
End September 2021 | Mid November 2021 | During parts of September, October and November no stats were recorded, so numbers for these three months are underreported. |
June 2023 | June 2023 | Some data from the end of June 2023 are missing as a new logging system for there were a few issues when a new logging system for one of the download methods was introduced. |
25th October 2023 | 10th November 2023 | Some stats were lost due to issues related to the JASMIN downtime in this period. |
February 2024 August 2024 |
Ongoing | There has been an increase in download activity since February 2024 and even more since August 2024. We believe that some of this activity may be down to increased bot activity rather than real user downloads. This is being investigated and we are looking to provide updated download stats with this activity filtered out. |
November 2024 | onwards | The ftp server was turned off in November for security reasons, so this download method has no stats from this date. |
November 2024 | November 2024 | There was an issue with the underlying storage hosting the datasets which meant that the datasets were not available for at least a 4-day period in November and with further disruption in the days before and after. Other downtime periods may have affected stats, but not presently listed. |
October 2024 Mid February 2025 |
October 2024 mid-March 2025 |
Tthe download logs for one of our popular download servers (dap.ceda.ac.uk) were unfortunately lost for most of October 2024 and also from mid February to early/ mid March, so download statistics will be low in this period. There is now a monitor in place to ensure it is noticed more quickly in future. |
May 2025 | Beginning of July 2025 | Nearly the entire catalogue appears to have been downloaded by users based in Brazil. We think this is related to a workshop which was ongoing at the time. |
JASMIN usage
At present there isn't a method available to capture file access from within the JASMIN system to provide usage statistics for such usage. Potential methods to address this are being examined by CEDA.
ESGF usage
Various climate related datasets are also made available as part of the Earth System Grid Federation (ESGF). The ESGF hosts a central stats service for accesses via the various ESGF nodes, operated by one of the ESGF partners. It is presently under redevelopment and will include further stats in due course. As ESGF 'mirrors' data accross nodes it is useful to poll the ESGF stats service to find data usage accross all nodes, including the CEDA ESGF node.
Caveats
Citation harvesting
Where datasets are assigned a 'DOI' (a permanent identifier that resolves to a given resource) it is becoming increasingly possible to track data usage when these DOI references through the open literature. CEDA uses a tool developed with other NERC data centres in the Environmental Data Service to harvest such citations from 'indexing' services. When these have been found they have been added to the dataset catalogue records and can be viewed by clicking on the 'citations' link on the right side of the dataset record view:
This will take you to a list at the bottom of the catalogue record where the list of harvested citations are displayed:
This is, however, a tool that remains in development and there are a wide range of known issues upstream from the EDS citation harvesting tool that affect the citations havested:
Caveats
- The citation harvesting only works for DOIed datasets. Thus older, non-DOIed datasets will not be supported by this harvesting mechanism.
- Data citations may not be included by the authors of journal articles following best practice and so aren't presented in the indexing services
- Journals may not be suppling datset citation information to the indexing services
- Journals may push data citation information into "supplementary material" parts of the paper (especially when the number of citations is large otherwise) and thus are no longer indexed accordingly
- There are known cases of false-positives (ie.. incorrect linking) being pulled through to the indexing services
- There are also known cases where citations are not being picked up by the indexing services
- Citation information may come through incomplete or unresolvable by the EDS tool
- There remain questions about what constitutes a 'datset citation' with various usages being presented by the indexing services that may not conform to expectations (e.g in the above screen shot there are 23 citations, but most are from dataset to dataset version or other relationship linking, as opposed to citations in journals)
As such the citations captured should be used as an indicator only of the types of usage that the data may have been downloaded for.