Depositing data FAQ's

What and how much data can I deposit?
Why should I archive data?
What data format should I use?
Why is metadata important?
What metadata should I include in my files?
How do I check my files?
How should I name my files?
Can I get a DOI for my dataset?
How long will it take to get a DOI?
How do I tell CEDA about my data?
How do I supply catalogue record information and other information?
What is a data management plan (DMP)?
Is my data worthy of archiving?
What if I want to change my data?
How can I send my data to CEDA?
How do I get data usage information?
Where do I go if my data is not suitable to be stored at CEDA?
Can I archive compressed and tar-ed data?
My data is stored on a group/project workspace does that mean it is stored on the archive?
What licence will my data be made available to other under?
Who owns my data?
What levels of archiving service do CEDA offer?
How long will CEDA keep my data?
Can I restrict who can use my data?
How is dataset access controlled?

What and how much data can I deposit?

CEDA will archive data relevant to Atmospheric Sciences and Earth Observation fields, including data from NERC funded grants. CEDA will also accept small volumes of non-NERC funded projects to act as a reference only source. This includes a wide range of instrumental, satellite, aircraft, observations, analyses and model datasets of interest to the scientific community.

Projects can archive up to 10TB of data, either as data for publication reference purposes or for re-use by the community.

Reference data: archive data to referencing in a peer-reviewed publication.

These are data that needs to be discoverable and downloadable, but is left to the user to work out some of the usability issues. CEDA will make a catalogue entry and add the data files to the archive. Its principal use will be to provide evidence for publication.

Reusable: data to be held on the CEDA archive for re-use by the UK research community

The data must be provided in community supported format (e.g. NetCDF), with a defined file and directory naming convention, CEDA can provide support to organise the data. It must also have associated documentation and file level metadata (preferably adhering to the Climate and Forecast (CF) metadata conventions).

Note that for projects with large volumes of model data in general CEDA will only archive a small subset of the data to be used for reference in publications. Data that is perceived to have a high value to the wider scientific community (outside of the specific project) may be allocated a larger volume of space, however, there is a finite amount of disk space and it has to be shared across the community.

Why should I archive data?

Sending data to a secure long-term archive is increasingly a necessity for science projects due to the funding body requirements and to fulfil citation stipulations when publishing work.

It is NERC’s data policy that all environmental data of long-term value generated through NERC-funded activities must be submitted to NERC for long-term management and dissemination and will be made openly available for others to use.

It is also good practice for long term scientific aims and to enable the preservation and re-use of valuable research data as it is unique and cannot be replaced if destroyed or lost.

What data format should I use?

It is essential that the data are provided in suitable standard data formats such as NetCDF, NASA Ames or BADC-CSV – preferably following the CF metadata convention. Conforming to standard formats will maximise the potential for re-use and long-term usability and interoperability of your data in CEDA's archives.

Everyone has their own favourite formats but with 14PB of data for >5000 datasets and >550 million files in the archive it is essential that correctly-applied standard data formats are used in the archive to enable efficient curation and preservation of your data.

Why is metadata important?

Metadata is information about the data itself. Metadata is used for data discovery and for explaining what the specifics of the data are. The archived data may be re-used in the future by parties not involved with the project and could help support their science - so it is essential that data producers provide as much context with their data as possible.

Metadata should answer these questions: who, what, why, where, when and how your data was produced.

What metadata should I include in my files?

Metadata should answer these questions: who, what, why, where, when and how your data was produced. CEDA encourages the use of international standards such as the Climate and Forecasting Conventions ( CF-convention)s and Dublin Core. Most of our standard formats are already set up to follow these standards and depositors are encouraged to follow these.

For more information see Metadata Basics

How do I check my files?

There are tools available for some formats to allow people to check their files for compliance to format and/or metadata standards. These include the following:

Format	File compliance tool
BADC-CSV	BADC-CSV file checker service.
NASA Ames	NASA Ames file checker service
NetCDF	CEDA netCDF CF-compliance checking service NCAS-CMS CF-netCDF checking service

How should I name my files?

CEDA file naming convention given below to enable quick access to pertinent metadata and avoids the need to open and read the file in order to assess its contents.

The CEDA file-naming convention for observation data:

instrument|model_[location|platform|modelnumber]_YYYYMMDD[hh][mm][ss][_extra].ext

For further information please see the File Names explained.

Can I get a DOI for my dataset?

CEDA is able to assign DOIs (Digital Object Identifiers) to datasets held within its archives. Publishers are increasingly requesting that researchers ensure that their data are lodged in a recognised data repository, preferably with DOIs assigned to the data to aid linking the published article and the referenced data resource.

In order to do this the data must be:

Complete - the data can be a whole dataset (or can apply to an ongoing time series).
Unchanging - whereby the actual content is fixed and will not be further amended to correct for inconsistencies, with the exception of additional files clearly labeled as errata.
Properly archived in a persistent location - such as the archives held by CEDA.
Have all the necessary information about the dataset to compile a suitable CEDA dataset catalogue page. This forms the DOI landing page to which the DOI will resolve and bring the person following the DOI to the referenced resource. For this reason we don't change the title or the authors after we give the data a DOI.

CEDA data scientists will work with you to ensure that your data meet these criteria. While this may also necessitate some additional work on your behalf we'll try to guide you through the process and make it as easy as possible.

How long will it take to get a DOI?

This will depend on the work needed to ensure the data and associated dataset catalogue page are suitably in place for the DOI landing page to be created.

In all cases we recommend that you contact CEDA at the earliest available opportunity to discuss your requirements as we may not be able to meet short deadlines.

How do I tell CEDA about my data?

If you have a NERC grant relating to atmospheric or earth observation fields, CEDA will contact you at the beginning of your project for further details and discussion about archival. Appropriate data management must occur for each funded project, NERC data policy guidance can be found here.

For Non-NERC projects please contact CEDA to discuss your requirements.

How do I supply catalogue record information and other information?

To help users find data in CEDA archives it is important that we have correct information about the data. This includes information about dataset itself, like a description and the geographic area covered by the data, but also information about the instruments or model used to create the data and other project background. These details can be sent via a text file with the data. We will assume a file in the top level of the delivered dataset with the name metadata.yaml contains the details of the dataset, project and instrument or model. You can either use this simple metadata file creation utility, or edit one of the examples below.

example dataset details YAML file.

station-data_metadata_example.yaml

instrument_metadata_example.yaml

model_metadata_example.yaml

Adding the information this way helps us keep the record and the data together even if its supplied over non-web based channels like FTP.

Other useful information

Besides the information areas covered in the above links it is also helpful to provide the following types of information to CEDA to help curate your data for long-term use:

links to useful websites
script to read in/plot the data
documentation related to particular data formats
copies of project logos
photographs of instruments and sites where the instrument is deployed
calibration information
links to related articles

What is a data management plan (DMP)?

A data management plan is a formal document that outlines how data are to be handled both during a research project, and after the project is completed. The goal of a data management plan is to consider the many aspects of data management, metadata generation and data preservation before the project begins; this ensures that data are well-managed in the present, and prepared for preservation in the future.

For NERC grants, the outline data management plan from the proposal will be the start point for creation of the full data management plan. This full DMP should be mutually agreed between the Data Centre and the Principal Investigator within three months of the start date of the grant.

Is my data worthy of archiving?

Whatever the scale of the project it is useful to first determine the value of the data and whether there are any requirements to formally archive them. The NERC Data Value CheckList is an excellent and quick series of questions to help evaluate any data to see if they should be archived. While aimed primarily at NERC funded research, it can be used for non-NERC funded work too. Here is a flavour of the questions to address:

Is there any potential onward benefit for the data to others?
Am I mandated by a research council policy or legal requirement to make the data available?
Do I wish to obtain recognition for the data product in its own right? If so, how can I get this?
For how long does the data need to persist?
Have the data been, or will they be, used in publications? Therefore, is there a case to archive the data for reproducibility of the results

What if I want to change my data?

CEDA strongly encourages to archive data that is complete and finalised. However, we are aware that the data may change due to error's found later on. In this instance you should get in touch with CEDA to discuss further. New versions of the data can be added by indication of the version number in the file name and information about what has changed in the history comments of the metadata.

How can I send my data to CEDA?

There are 3 ways you can send data to CEDA (HTTP, FTP and RSYNC). A member of the CEDA team will be assigned to liaise with the data provider to provide advice on data preparation and to help set up the delivery and ingestion route.

HTTP: File uploader service which is suitable for small scale data providers and short lived projects. A step by step guide can be found here and video tutorial.

FTP: Suitable for small - medium scale data uploads for suitable projects where RSYNC is not an option. More information can be found here and video tutorial.

RSYNC: This route is particularly suited to regular, automated data uploading and is especially useful for very large files and dataset transfers. More information can be found here and video tutorial.

How do I get data usage information?

CEDA will be able to provide data usage information for use in reporting to funding agencies to data providers where suitable statistics are available. All data usage information that we hold is covered by the Data Protection Act and as such data providers can only have access to the data use information for which permission has already been granted by the CEDA facility user.

If you have a specific data usage reporting requirement please discuss this prior to archiving data with CEDA to ensure that we can set up the required access control mechanisms to capture the required information. Otherwise it may not be possible for us to retrospectively either provide the information required as we do not have a suitable record or that user permission to make the information available to the data provider may not be in place.

The information that we are able to provide will depend on the type of access set (note - this is mainly for BADC and NEODC data, less information is held about UKSSDC and IPCC-DDC users). CEDA groups access as follows:

Open Access (Publicly available data) - no registration is required so limited information available
Restricted Access - All registered users - usage logs provide additional anonymised userbase profiling information (e.g. research fields supported by dataset, countries of users making use of dataset). Although CEDA is able to identify individual user accounts within internal logs, user details are not available to data providers.
Restricted Access - Application required - specific research descriptions available to permit authorisation of access to restricted resources in addition to user base profiling

Available Information	Use	Covers
Download statistics	number of files and volume for a set period of time.	public, registered user, restricted user
Userbase information	To permit anonymised profiling of user base for dataset - e.g. research fields supported	registered user, restricted user
Application information	For dataset authorisers to determine if access should be granted to restricted resource	restricted user only

Where do I go if my data is not suitable to be stored at CEDA?

NERC has a network of 5 environmental data centres covering a range of disciplines including:

British Oceanographic Data Centre (Marine)
Centre for Environmental Data Analysis which includes:
- British Atmospheric Data Centre (Atmospheric)
- NERC Earth Observation Data Centre (Earth observation)
- UK Solar System Data Centre (Solar and space physics)
Environmental Information Data Centre (Terrestrial and freshwater)
National Geoscience Data Centre (Geoscience)
Polar Data Centre (Polar and cryosphere)

The range of data held within the data centres is vast, covering all aspects of environmental science. Some centres also hold physical specimens and sample materials collected during NERC's activities, as well as material supplied by third parties (sometimes under statute).

Can I archive compressed and tar-ed data?

Moving and storing files can often be made more efficient by compressing the files themselves (e.g. .gz, .bz, .zip), sometimes in addition to "tar" which is used to ball together various files.

For archiving purposes, CEDA does not generally apply such compressions or balling together of files. Where this has been done it is usually for one of the the following reasons:

To make it easier for end users to access and download large numbers of files and/or large individual files. For example, NIMROD rain radar products tend to be gzipped and then tarred together into daily tarballs to avoid users having to deal with many hundreds of files for any given day, reduce the overall number of file objects in the archive and to reduce the overall volume of the data in the archive.
For consistency with older data. If a dataset have always been compressed we will generally continue to compress additions to that dataset so that anyone with automated processing of the data is not suddenly disrupted.
To follow an established convention. For example Sentinel data uses zip as part of its SAFE packaging. The sentinel user community has software that is expecting the zipped version.

My data is stored on a group/project workspace does that mean it is stored on the archive?

CEDA supports projects through shared storage spaces such as JASMIN group workspaces or FTP project spaces. Users of these services should understand that:

this is NOT the archive - placing data into these areas will not constitute having deposited data in the CEDA archive
the group-workspaces/project-spaces are NOT managed by CEDA and so content should be considered at risk

However, it is possible to prepare a dataset in these areas for eventual ingestion into the archive. If you wish to do this please contact your CEDA support officer in the first instance to discuss ingestion into the archive as it may be possible to ingest directly from these areas.

What licence will my data be made available to other under?

Generally the data produced by NERC funded research will be released under the open government licence, as per NERC data policy. Data from other sources may use another licence, but we strongly encourage open licences such as creative commons licence.

Who owns my data?

Ownership of the data lies with the data creator, but NERC has the right to exploit data resulting from NERC funded research. Data creators will be required to agree to the data deposit conditions before the data are added to the archive.

What levels of archiving service do CEDA offer?

CEDA offers 3 levels of data archiving (reference, structured and compatible). Once CEDA has decided the data is suitable for deposit, one of the following categories will be assigned depending on the level of interest to the scientific community. The categories are reference, structured, compatible,

Reference: These are data that needs to be discoverable and downloadable, but is left to the user to work out some of the usability issues. CEDA will make a catalogue entry and add the data files to the archive. This is a suitable solution if there is not likely to be mass interest in the data and its principal use will be to provide evidence for publication. Minimum qualification: a paper referencing the dataset.

Structured: As well as the reference, these data are in a community supported format, with a defined file and directory naming convention. It may also have specified file level metadata attribute conventions. The data are more useable by third parties and is suitable for a dataset where there is an intention to make the data more reusable.

Minimum qualification: evidence of use of similar datasets by CEDA core communities.

Compatible: In addition to being structured, these data are connected to specified community tools or systems that enable better discovery or processing. For example, climate model data in ESGF, MIDAS data in the WPS or aircraft data in the Flight Finder.

Minimum qualification: evidence of use of similar datasets by CEDA core communities and community tool specifications. Some evidence that the data will fit the tools.

How long will CEDA keep my data?

Most unique observations of the environment for which the CEDA archive is the primary repository, data will be kept indefinitely. CEDA are constrained by resources, man power and archive size, so we do have to make some chooses on what we keep and whether it is selected for archiving in the first place.

Can I restrict who can use my data?

For NERC funded projects a MAXIMUM of 2 YEARS embargo can be put in place for project participants to work exclusively on, and publish the results of, the data they have collected. Access to all data submitted to the data centre will be restricted to project participants for following the data production date, after which they will be released into the public domain. During the embargo period, access may be extended to external collaborators who have been authorised by the PI or their delegated authority. Potential users of the archive will be required to agree to the Conditions of Use.

For non NERC projects/datasets this depends on the licencing for that dataset. We strongly encourage open goverment licencing and are less likely to archive data with a locked down licence.

How is dataset access controlled?

CEDA recognises both the requirements to make data as open and as free possible for use by the wider community but also the need to protect data providers' IPR and right to first use of data provided. For all NERC funded research data providers are covered by the NERC Data Policy, whereby access can be limited to a maximum of 2 years after the date of collection/production - i.e. the date when the data were collected for observational data or the date when the simulation code produced data.

CEDA therefore has various mechanisms to control access to data held in our archives to ensure that these two requirements are managed appropriately over time. The levels of control are as follows:

NOTE - conditions of use for the data are independent of the access control mechanism and all data users are required to adhere to the general CEDA conditions of use and the specific data licences pertaining to the data that they wish to use.

Publically available data	Data are freely available to access, with no requirement for the user to register
Registered user accessible data	Data are freely available to access
Restricted data for limited period	Data are restricted for a set period of time, for example until 2 years after the date of production, after which access can be opened up to public/registered user. Access to such datasets is usually verified by a designated authoriser for applications.
Permanently restricted	Typically only applying to third-party datasets that CEDA obtains for use by the wider research community where access needs to be controlled to ensure use is within permitted conditions (e.g. only UK academics; those with NERC funding). Applications for these data are usually verified by CEDA staff.