Levels of Data Curation

The CEDA Archive seeks to offer long-term archiving for data, but the level of curation needed will vary from dataset to dataset based on how likely onward re-use is expected to be. Generally, data with a higher level of preparation will be more widely accessible to a wider pool of future users, but CEDA also recognises there are resource limitations limiting this.

The table below shows three levels of curation offered by the CEDA Archive, what level of re-use that can be expected for each type, and the level of data standard and conventions the data will need to meet.

	Reference	Structured	Interoperable
Suitable for	Complete datasets	Key or ongoing datasets	Core community datasets
Anticipated data re-use level	Low	medium-high	high
Discoverable in CEDA data catalogue, Google Scholar, NERC Data Catalogue, Data.gov.uk etc.	✓	✓	✓
DOI-able dataset (citable in papers)	✓	✓	✓
Web, FTP download	✓	✓	✓
Direct JASMIN access	if permitted	if permitted	if permitted
Community wide/archive quality format (e.g. netCDF)	encouraged	✓	✓
File metadata follows conventions (e.g. CF)	encouraged	✓	✓
Extra data tools (e.g. subsetting)			✓

Reference

These are data that are discoverable and downloadable, but it is left to the user to work out some of the usability issues. CEDA will make a catalogue entry and add the data files to the archive. This is a suitable solution if there is not likely to be mass interest in the data and their principal objective is to provide evidence to support a publication. Data in this category should be small volume (< 1TB).

Minimum qualification: a paper/documentation referencing the dataset.

Structured

As well as being Reference ready, these data are in a community supported format, with a defined file and directory naming convention. They may also have specified file level metadata attribute conventions. This level is suitable for a dataset where there is an intention to make the data more reusable.

Minimum qualification: evidence of use of similar datasets by CEDA core communities.

Interoperable

In addition to being Referenced and Structured, these data are connected to specific community tools or systems that enable better discovery or processing. For example, climate model data in ESGF, MIDAS land surface station data in the CEDA WPS or aircraft data in the Flight Finder tool.

Minimum qualification: evidence of use of similar datasets by CEDA core communities and community tool specifications. Some evidence that the data will fit the tools.