CMIP6 data at CEDA

The World Climate Research Programme (WCRP) Working Group on Coupled Modelling (WGCM) oversees the Coupled Model Intercomparison Project Phase 6 (CMIP6). The CMIP programme is an ongoing project which combines climate modelling experiments and produces huge quantities of data archived around the world by members of the Earth System Grid Federation (ESGF) and the current iteration is CMIP6. CMIP6 has been making data available since 2018 - present and the analyses will inform the IPCC Sixth Assessment Report (AR6).

NOTE - CMIP6 is currently in progress and data are still being actively retrieved by CEDA, not all datasets are complete yet. In addition, CEDA will only permanently hold a subset of the CMIP6 data (currently CEDA are prioritising data required for the AR6 WG1 report). Please see below for guidance on how to search what data we hold.

1. CMIP6 data structure

2. Data Access: CEDA

3. Data Access: ESGF

CMIP6 data structure

The following will describe the directory structure and filename composition of the CMIP6 data so users can find specific data more easily:

Directories

The directory structure for CMIP6 is as follows:

<mip_era>/<activity_id>/<institution_id>/<source_id>/<experiment_id>/<variant_label>/<table_id>/<variable_id>/<grid_label>/<version>

mip_era: refers to the phase of the project, this will be CMIP6,
activity_id: is the abbreviated identifier of the Model Intercomparison Project (MIP). For example; Aerosols and Chemistry Model Intercomparison Project (AerChemMIP), Coupled Climate Carbon Cycle Model Intercomparison Project (C4MIP) or Scenario Model Intercomparison Project (ScenarioMIP). A full list can be found here,
institution_id: refers to the centre or institute responsible for the model,
source_id: is the model used. Details for all models should be available through ES-DOC (note this is still an active piece of work),
experiment_id: refers to the set experiments being run for CMIP6. For example; PiControl, historical and 1pctCO2 (1 percent per year increase in CO2)
variant_label: is a label constructed from 4 indices (ensemble identifiers) r<k>i<l>p<m>f<n> where:
- k = realization_index
- l = initialization_index
- m = physics_index
- n = forcing_index
table_id: this refers to the MIP table being used. The MIP tables are used to organise the variables. For example, Amon refers to monthly atmospheric variables and Oday contains daily ocean data. Each variable in a MIP table must have a unique output name. To understand more about the naming conventions of MIP tables please see here,
variable_id: is the data variable, for example, Near-Surface Air Temperature (tas), Surface Air Pressure (ps), Relative Humidity (hur),
grid_label: this describes the model grid used. For example; global mean data (gm), data reported on a model's native grid (gn) or regridded data reported on a grid other than the native grid and other than the preferred target grid (gr1),
version: refers to the data version (for CMIP6 this is normally of the form vYYYYMMDD),

Filenames

The filename structure for CMIP6 is as follows:

<variable_id>_<table_id>_<source_id>_<experiment_id>_<variant_label>_<grid_label>_<time_range>.nc

A number of filename facets are shared with the directory so only new facet is `time_range`:

time_range: the date range of the data file is given in the format YYYYMMDD-YYYYMMDD with optionally additional elements to cover hours (HH) and seconds (SS). For example data from Jan 1st 1850 to 31st Dec 1899 would be: 18500101-18991231.

Version

As described above, the final part of the CMIP6 identifier is the version number. To organise data on the filesystem the directory structure is such that a symbolic link called "latest" is created and always points to the most recent version. Older versions of the data retain their version number on the filesystem and so can be accessed if need. The 'latest' directory should be the default that you use as data may be updated due to known errors. Therefore the data within the 'latest' directory may change if a new version of the data are published.

Data Access: CEDA

CEDA data browser

Data can be obtained from the CEDA data browser by searching through the directories (see below) and clicking the download button under 'Actions' :

Data discovery: catalogue search

You can search for CMIP6 data through the CEDA Catalogue. Each institution has its own project record which then contains dataset records from the different models and experiments produced by that institute. The search bar can be used to refine what is shown in the results, as seen below, you can search by institution, model, experiment etc.

From the dataset record page there is 3 methods of downloading the data you require, including the data browser as described above.

JASMIN users only

Alternatively, data can be obtained directly from JASMIN under:

 /badc/cmip6/data/

As CEDA will only hold a percentage of the total CMIP6 archive, the following section gives examples for data discovery on JASMIN to check if the data you require are available on the CEDA Archive.

If you can't find the data you need, you can see if the data are available from the full CMIP6 distributed archive searchable via the ESGF ( see the section below).

Data discovery: example searches

The following section will provide examples of Linux commands which will identify what CMIP6 data is available. Multiple examples have been provided dependant on what data is required, these can be modified to cater to your specific query. The following commands (search methods) will need to contain the exact CMIP approved terminology when referring to the names of the directories e.g. scenarioMIP (for guidance see the controlled vocabulary lists).

Example 1: CMIP identifier to path

If you have the specific CMIP identifier for a file and want to check if we hold that in the CEDA Archive, the following can be entered into the command line. You can substitute in the CMIP identifier you have, to obtain the CEDA Archive path using a translation command. You can then list the file(s) within this directory.

cmip_identifier=CMIP6.ScenarioMIP.BCC.BCC-CSM2-MR.ssp370.r1i1p1f1.Amon.tas.gn.v20190314
ls /badc/cmip6/data/$cmip_identifier | tr . /

Example 2: List files

The following command will simply list all of the files which exist in the CEDA Archive under this specific combination of directories. By adding the last command ' | wc -l ', the file paths will not be listed but the number of files will be displayed. This may be useful to see if there are a large number of files for your query. In this example, without the last command, a list of file paths will be returned looking at all the CMIP models and experiments for one parameter.

ls /badc/cmip6/data/CMIP6/HighResMIP/*/*/*/r1i1p1f1/Amon/tas/gn/latest/ | wc -l

Example 3: List paths

In order to obtain a list of paths but not list all the files within them, ' -I '*.nc' ' will exclude all nc files. This command would useful if you wanted to produce a list of paths to pass to a script, which then read in all the files beneath. In this example, a list of paths will be returned looking at all the CMIP models for one specific parameter.

ls -I '*.nc' /badc/cmip6/data/CMIP6/ScenarioMIP/*/*/ssp370/*/Amon/tas/gn/latest

Example 4: Replacing wildcard

If you would like to broaden your search and do not know the specific directories under each parent, the wildcard (*) may be useful. This may also be useful if you want to analyse data across all models or all parameters but do not know each specific path combination. In this example, a list of file paths will be returned looking at all the parameters for one specific model.

ls /badc/cmip6/data/CMIP6/CMIP/MOHC/HadGEM3-GC31-LL/piControl/r1i1p1f1/Amon/*/*/latest

Data Access: ESGF

If data you need is not available on JASMIN, please visit the Earth System Grid Federation (ESGF) site which contains the full CMIP6 Archive. From this site, you can download the data that you require for analysis. If using JASMIN to analyse data on your GWS, only do this if there are a limited number of small files. Otherwise please contact the CEDA helpdesk to request data be retrieved for the main archive.

The ESGF site allows you to search through all the available CMIP6 data by filtering with specific requirements. These filters mimic the directory structure used to store data in the CEDA Archive (as seen in the section above).

To access the NetCDF files, select 'list files' and from here the file can be downloaded via HTTP or OpenDAP. Other methods of downloading the files are available, including a wget script or Globus Connect, for further information see the ESGF website.

Data Access: Object-store (Zarr)

An alternative method is accessing data via JASMIN Object Store where the CMIP6 holdings have been converted to Zarr format. Zarr files ..... These can be accessed by JASMIN notebooks.

What is object store?

An object store is a data storage system that manages data as objects referenced by a globally unique identifier, with attached metadata. This is a fundamental change from traditional file systems that you may be used to, as there is no directory hierarchy - the objects exist in a single flat domain. These semantics allow the object store to scale out much more easily than a traditional shared file system.

How do I access the data?

Data is accessed over HTTP, with authentication using HTTP headers. The data is organised in zarr files (format for the storage of chunked, compressed, N-dimensional arrays). Using the JASMIN Notebook service .....

Need to add something about the csv.