DataCitation - CEDA
[Alison w - review and keep]
Page Contents
Data Citation and DOIs
Why this is a good idea:
- Encourages users to submit data to us
- with complete metadata
- in useful formats
- Encourages collaboration, data reuse, improves scientific record, better value-for-money, etc. etc. etc....
Citation (i.e. assigning a DOI) is something that we can do in-house (and is valuable to our users in it's own right).
Publication (involving full academic peer review) has to be done in collaboration with a journal. See here for a presentation (aimed at data producers) all about data citation and publication.
Why DOIs?
- They are actionable, interoperable, persistent links for (digital) objects
- Scientists are already used to citing papers using DOIs
- Pangaea assign DOIs, and ESSD use DOIs to link to the datasets they publish
- The British Library gave us an allocation of 500 DOIs to assign to datasets as we saw fit.
Choosing datasets to cite
Dataset has to be:
- Stable (i.e. not going to be modified)
- Complete (i.e. not going to be updated)
- Permanent – by assigning a DOI we’re committing to make the dataset available for posterity
- Good quality – by assigning a DOI we’re giving it our data centre stamp of approval, saying that it’s complete and all the metadata is available
Authors' permissions
For legacy datasets, we will need to ask permission to add a DOI to the dataset. For new datasets, permission to add a DOI will be part of the Data Management Plan.
The Rules
The official NERC data citation guidance documents are attached to this page.
- The guidelines for scientists document can be found at http://www.nerc.ac.uk/research/sites/data/doi.asp
- The guidance document and procedures for minting a DOI can be found in the Operational Policies and Procedures section of the EDC central iShare site at https://ishare.apps.nerc.ac.uk/teams/edcs/default.aspx
When a dataset is cited that means:
- There will be bitwise fixity
- With no additions or deletions of files
- No changes to the directory structure in the dataset “bundle”
A DOI should point to a html representation of some record which describes a data object.
- Upgrades to versions of data formats will result in new editions of datasets – rules about how we do that to be decided.
- If there is a new version of the dataset, a new DOI is needed.
- We can only cite datasets where we have full authorization to distribute it in perpetuity (i.e. we won't assign DOIs to Met Office data, as Met Office reserve the right to get us to delete it whenever they want.)
We will need an inventory of files and their associated metadata records.
We can give separate DOIs to datasets and their metadata records (but we probably don’t want to).
DOI metadata will be scraped from the MOLES metadata records. It will follow the DataCite? metadata schema as given at http://schema.datacite.org/
Landing page rules
Landing pages tend to have metadata about the object being referenced; e.g. author, abstract, publication date...
We decided to use the deployments records in our metadata catalogue as our DOI landing page. So, for the GBS dataset, we’re using: http://team.badc.rl.ac.uk:50001/view/badc.nerc.ac.uk__ATOM__dep_11902946270621452 And http://badc.nerc.ac.uk/view/badc.nerc.ac.uk__ATOM__dep_11902119479621181
- We can change the landing page any time we like, but you had better be able to get to your digital object from there!
- Landing pages can have query based links to other things (papers which cite this dataset) etc ...
- The (DOI-mandatory) metadata describing the dataset shouldn't change as it describes the digital object and represents it faithfully. It ought not change, since any change to it, ought to reflect a change to the digital object (and that should trigger a new DOI)
- The original landing page can indicate that a newer version of the dataset exists, but it should still point to the older version!
File Inventory
Probably will use checkm, but to be decided.
Human readable citation string
The human readable citation string should follow the guidelines laid out in section 2.2 of the current DataCite? metadata schema ( http://schema.datacite.org/ ), as copied below:
Because many users of this schema are members of a variety of academic disciplines, DataCite? remains discipline‐agnostic concerning matters pertaining to academic style sheet requirements. Therefore, DataCite? recommends rather than requires a particular citation format. In keeping with this approach, the following is the recommended format for rendering a DataCite? citation for human readers using the first five properties of the schema:
Creator ( PublicationYear? ): Title. Publisher. Identifier
It may also be desirable to include information from two optional properties, Version and ResourceType? (as appropriate). If so, the recommended form is as follows:
Creator ( PublicationYear? ): Title. Version. Publisher. ResourceType? . Identifier
For citation purposes, the Identifier may optionally appear both in its original format and in a linkable, http format, as it is practiced by the Organisation for Economic Co‐operation and Development (OECD), as shown below.
Regarding the PublicationYear? , DataCite? recommends, for resources that do not have a standard publication year value, to submit the date that would be preferred from a citation perspective. Here are several examples:
- Irino, T; Tada, R (2009): Chemical and mineral compositions of sediments from ODP Site 127‐ 797. Geological Institute, University of Tokyo.doi:10.1594/PANGAEA.726855. http://dx.doi.org/10.1594/PANGAEA.726855
- Geofon operator (2009): GEFON event gfz2009kciu (NW Balkan Region). GeoForschungsZentrum? Potsdam (GFZ). doi:10.1594/GFG.GEOFON.gfz2009kciu. http://dx.doi.org/10.1594/GFZ.GEOFON.gfz2009kciu
- Denhard, Michael (2009): dphase_mpeps: MicroPEPS LAF‐Ensemble run by DWD for the MAP D‐PHASE project. World Data Center for Climate. doi: 10.1594/WDCC/dphase mpeps. http://dx.doi.org/10.1594/WDCC/dphase_mpeps
DOI strings Prefix (unique to NERC): 10.5285 - followed by a unique string of our choice.
We decided (along with all the other NERC data centres) to use GUIDs (Globally Unique Identifier) as the unique string.
The value of a GUID is represented as a 32-character hexadecimal string, such as {21EC2020-3AEA-1069-A2DD-08002B30309D}, and is usually stored as a 128-bit integer. The total number of unique keys is 2128 or 3.4×1038 — roughly 2 trillion per cubic millimeter of the entire volume of the Earth. This number is so large that the probability of the same number being generated twice is extremely small.
Our DOIs will look something like this: 10.5285/e8f43a51-0198-4323-a926-fe69225d57dd (you can use a website like this to generate a GUID )
If you want to assign a DOI to a Dataset (Observation in MOLES):
Note that this process is subject to change, depending on the amount of DOIs that need to be minted. This document will be updated as the process changes. (Last update 7 June
- Requester confirms that the dataset to be cited meets the criteria specified above for DOI assignment.
- If the dataset does not meet the criteria, the dataset author should be informed of what needs to be done to the dataset to allow it to achieve the required criteria.
- Requester puts a message in the #ceda_doi slack channel (https://ncas-talk.slack.com/messages/C1B02HTRS/convo/C0PAPC5L0-1496742217.927695/) with the URL of the catalogue page for the dataset.
- DOI minter picks up the request (and lets the others know by doing a thumbs up on the request)
- DOI manager checks the the landing page.
- DOI issuer mints the DOI.
- DOI issuer sends a slack message (on the #ceda_doi channel) to the requester confirming that the DOI has been minted.
- Requester sends a congratulatory email to dataset authors, confirming DOI has been minted.
- Note that there is some latency in the DOI system, and newly minted DOIs may not resolve for the first 24 hours. Check back the next day, and ping the #ceda_doi channel if there is an issue.
If you want to assign a DOI to a Dataset Collection (Observation Collection in MOLES):
It is possible to add a DOI to a dataset collection in addition to lower level datasets. However, before this is possible each component dataset within the collection MUST also match the required standards for a DOI - i.e. bitwise fixity etc - in addition to the collection as a whole being of standard. As such, collections containing 3rd party datasets will not be able to receive a DOI for the collection as those parts of the collection are outside the control of the DOI requester.
Adding DOI string to MOLES 3 record:
Once you have your DOI for the dataset/dataset collection in question you need to add it to the relevant MOLES Observation/Observation? Collection page that will act as the landing page for the resource.
This example record demonstrates the fields that have been completed to add the DOI information to the MOLES record as noted below.
To ensure that the DOI information is correctly included in the Observation/Observation? Collection citation string:
- Add the DOI string into the "Identifiers" section, entering the full DOI url into the URL field and selecting "DOI" as the Indentifier Type. (a shortURL will be auto-generated for you). E.g. the full DOI URL is of the form:
http://dx.doi.org/10.5285/E8F43A51-0198-4323-A926-FE69225D57DD
- Change the "publication status" to "Citable"
See here for a presentation about requesting a DOI.