Compliance Checking and CEDA-CC

Introduction

For certain projects we run the CEDA-CC tool to check that the files received are compliant with the format specification for that project. This can check file names, time ranges across groups of files and metadata inside files.

This page provides examples of problems identified with files and how they were tackled.

Note about installation of CEDA-CC on ingest1 server

On ingest1.ceda.ac.uk , CEDA-CC is deployed in the standard virtual environment (venv27). You can update the CEDA-CC version on this server by running a single script:

/usr/local/ingest_software/venv27/bin/update_ceda-cc.sh

Typical workflow

The typical compliance checking process is:

  1. Data Providers places a batch of files in arrivals space.
  2. CEDA runs the CEDA-CC tool on that batch.
  3. CEDA runs the CEDA-CC tool in summary mode to examine the logs of the main run.
  4. If no errors:
    • CEDA ingests the data into the archive
    • CEDA publishes the datasets to ESGF
  5. If errors:
    • Inform the Data Provider and then go back to step 1.

Example problems identified with files

This section lists a number of real-life example problems that have been found with input files. It has been written to provide hints on how to diagnose issues identified by CEDA-CC and how to communicate them back to the Data Provider - who will be responsible for fixing files.

NOTE: CEDA does not undertake to fix problems identified by CEDA-CC - that is the responsibility of the Data Provider (who should also run a local copy of CEDA-CC before sending us data).

1. Units errors in NetCDF files

In this example, the errors showed up as follows:

$ ceda-cc --sum specs_CNRM-CM5_batch2/
############################ /datacentre/processing/specs/CCCC/trunk/ceda_cc
Summarising error reports from 73018 log file

C4.002.005:  1511  [variable_ncattribute_mipvalues] :Variable [hus] has incorrect attributes: units="1" [correct:"kg kg-1"]
               hus_Amon_CNRM-CM_seaIceInit_S19790501_r10i1p1_197905-197911__qclog_20150521.txt
               hus_Amon_CNRM-CM_seaIceInit_S19790501_r10i2p1_197905-197911__qclog_20150521.txt
Number of files with no errors: 71507
				

I did some grepping and counting to double-checked the errors are all the same:

$ cd specs_CNRM-CM5_batch2/
$ for i in *_CNRM*.txt ; do grep FAILED $i >> ../FAILED.txt ; done
$ wc -l ../FAILED.txt
1511 ../FAILED.txt
$ sort -u ../FAILED.txt
C4.002.005: [variable_ncattribute_mipvalues]: FAILED:: Variable [hus] has incorrect attributes: units="1" [correct: "kg kg-1"]
				

This showed that there was a common error across all failures.

Verified the error by looking inside one of the data files:

$ ncdump -h /group_workspaces/jasmin/specs/CNRM-CM5/batch2/CNRM/CNRM-CM/seaIceInit/S19790501/mon/atmos/hus/r1i1p1/hus_Amon_CNRM-CM_seaIceInit_S19790501_r1i1p1_197905-197911.nc | grep hus | grep units
                hus:units = "1" ;
				

Checked the CF standard name table at:

http://cfconventions.org/Data/cf-standard-names/28/build/cf-standard-name-table.html

Said "hus" should have units of "1".

Checked in ceda-cc MIP table:

$ head -1329 /usr/local/ingest_software/venv27/config/specs_vocabs/mip/SPECS_Amon | tail -8
variable_entry:    hus
!============
modeling_realm:    atmos
!----------------------------------
! Variable attributes:
!----------------------------------
standard_name:     specific_humidity
units:             kg kg-1
				

So, even though the standard name table says the units should be "1", in SPECS the scientists have decided to use "kg kg-1".

Outcome

We asked the Data Provider to change the files.

Recommend that she changes files.

Note that the code they used to fix this was quite simple:

# Find affected files, grab the units of "hus" and count them:
$ find /group_workspaces/jasmin/specs/CNRM-CM5/batch2/CNRM/CNRM-CM/seaIceInit -type f -name "hus_*.nc" -exec ncdump -h {} \; | grep hus:units | wc -l
   1511 
# Fix all the files using ncatted   
$ find /group_workspaces/jasmin/specs/CNRM-CM5/batch2/CNRM/CNRM-CM/seaIceInit -type f -name "hus_*.nc" -exec ncatted -a units,hus,m,c,"kg kg-1" {} \;
				

Pierre-Antoine fixed the files himself.

2. Variable not in the MIP Table

For some CORDEX files the summary output reported:

C4.002.002: 22
  --- [variable_in_group] :Variable hurs not in table day: 10
               hurs_AFR-44_ECMWF-ERAINT_evaluation_r1i1p1_MOHC-HadGEM3-RA_v1_day_19900101-19901231__qclog_20150603.txt
               hurs_AFR-44_ECMWF-ERAINT_evaluation_r1i1p1_MOHC-HadGEM3-RA_v1_day_19910101-19951231__qclog_20150603.txt
               hurs_AFR-44_ECMWF-ERAINT_evaluation_r1i1p1_MOHC-HadGEM3-RA_v1_day_19960101-20001231__qclog_20150603.txt
               hurs_AFR-44_ECMWF-ERAINT_evaluation_r1i1p1_MOHC-HadGEM3-RA_v1_day_20010101-20051231__qclog_20150603.txt
               hurs_AFR-44_ECMWF-ERAINT_evaluation_r1i1p1_MOHC-HadGEM3-RA_v1_day_20060101-20081130__qclog_20150603.txt
               hurs_EUR-44_ECMWF-ERAINT_evaluation_r1i1p1_MOHC-HadGEM3-RA_v1_day_19900101-19901231__qclog_20150603.txt
               hurs_EUR-44_ECMWF-ERAINT_evaluation_r1i1p1_MOHC-HadGEM3-RA_v1_day_19910101-19951231__qclog_20150603.txt
               hurs_EUR-44_ECMWF-ERAINT_evaluation_r1i1p1_MOHC-HadGEM3-RA_v1_day_19960101-20001231__qclog_20150603.txt
               hurs_EUR-44_ECMWF-ERAINT_evaluation_r1i1p1_MOHC-HadGEM3-RA_v1_day_20010101-20051231__qclog_20150603.txt
               hurs_EUR-44_ECMWF-ERAINT_evaluation_r1i1p1_MOHC-HadGEM3-RA_v1_day_20060101-20101231__qclog_20150603.txt
				

This means that the variable represented was not in the MIP Table. A MIP Table is one of the configuration tables used to drive large experimental model runs. It is used in CEDA-CC to check the outputs.

Outcome

This issue was raised with the Data Providers. They identified that the project had recently updated the MIP Tables. Contact was made with Martin Juckes and the new version of the CORDEX MIP Tables was added to CEDA-CC trunk and rolled out on the ingest1 server.

Once the new MIP Tables were in place this error ceased to be reported by the checker.

3. Incorrect global attribute value

In one CORDEX case many files had an unknown value for an expected global variable. The summary said:

C4.002.006: 1830
  --- [global_ncattribute_cv] :Global attributes do not match constraints:[('driving_model_id', 'ERAINT', "['ECMWF-ERAINT', 'BCC-bcc-csm1-1', 'BCC-bcc-csm1-1-m', 'BNU-BNU-ESM']")]: 1824
               areacella_CAS-44_ERAINT_evaluation_r0i0p0_MOHC-HadRM3P_v1_fx__qclog_20150603.txt
               areacella_CAS-44i_ERAINT_evaluation_r0i0p0_MOHC-HadRM3P_v1_fx__qclog_20150603.txt
               clivi_CAS-44_ERAINT_evaluation_r1i1p1_MOHC-HadRM3P_v1_6hr_1990010106-1990123118__qclog_20150603.txt
               clivi_CAS-44_ERAINT_evaluation_r1i1p1_MOHC-HadRM3P_v1_6hr_1991010100-1991123118__qclog_20150603.txt
               clivi_CAS-44_ERAINT_evaluation_r1i1p1_MOHC-HadRM3P_v1_6hr_1992010100-1992123118__qclog_20150603.txt
               clivi_CAS-44_ERAINT_evaluation_r1i1p1_MOHC-HadRM3P_v1_6hr_1993010100-1993123118__qclog_20150603.txt
               clivi_CAS-44_ERAINT_evaluation_r1i1p1_MOHC-HadRM3P_v1_6hr_1994010100-1994123118__qclog_20150603.txt
...
				

From the output it is clear that the value given was 'ERAINT' and the list of expected values included 'ECMWF-ERAINT'.

Outcome

The resolution was simply to ask to the data provider to modify the files (using NCO) so that the global attribute 'driving_model_id' had the value 'ECMWF-ERAINT'. This fixed the problem.

4. Inconsistent file metadata

In one CORDEX case some files had global attributes that did not match the corresponding filename attributes. The summary said:

C4.002.007:
---  [filename_filemetadata_consistency] :File name segments do not match corresponding global attributes:[(2, 'model_id'), (4, 'startdate'), (5, '@ensemble'), (6, '@forecast_reference_time:4:')]
               zg_day_CNRM-CM5-HRA-LRO_horizlResImpact_S19940501_r1i1p1_19940901-19940930__qclog_20160818.txt
               zg_day_CNRM-CM5-HRA-LRO_horizlResImpact_S19951101_r7i1p1_19960101-19960131__qclog_20160818.txt
				

Looking at the file attribute model_id:

ncdump -h zg_day_CNRM-CM5-HRA-LRO_horizlResImpact_S19940501_r1i1p1_19940901-19940930.nc | grep model_id
                :model_id = "CNRM-CM5-LRA-LRO" ;
				

We can see that the model_id in the filename and in the attributes are not the same where "-LRA-" should be "-HRA-". Similarly the attributes 'startdate', 'forecast_reference_time' and 'associated_experiment' (which maps to @ensemble) are checked.

Outcome

The resolution was simply to ask to the data provider to modify the files (using NCO) so that the global attributes 'model_id', 'startdate', 'forecast_reference_time' and 'associated_experiment' have values that are consistent with the filename. This fixed the problem.

Did this answer your question? Thanks for the feedback There was a problem submitting your feedback. Please try again later.