The CF Metadata Convention
Introduction
In addition to the basic data in a netCDF file, the creator of the dataset needs to include information about the data themselves (e.g. type, units), and the creator may want to include information about how the data were collected, and/or warn future users of pitfalls; these items of information are metadata. Several groups have defined conventions for netCDF files, to enable the exchange of data. Since future cataloguing and searching systems will rely on standards in metadata, CEDA has decided to adopt the NetCDF Climate and Forecast (CF) Metadata Convention. The Unidata website gives more information about other possible conventions for netCDF files (and also about the CF Convention). The conventions define metadata which provide a description of what the data in each variable represents. This enables the users of data from different sources to decide which quantities are comparable. CEDA receives data from many sources, for example station data, satellite data, model data, and we aim to describe all netCDF datasets with the CF conventions.
Whereas netCDF is a binary file format used to order and store data and metadata, with strict rules (so that software will fail to read your files unless the data are correctly structured), the CF conventions are guidelines and recommendations as to where to put metadata within a netCDF file. The CF conventions also advise on what type of information you may want to include.
Variables should (ideally) have associated with them a name and units, and possibly other information such as the direction of increasing coordinate value and statistical processing (e.g. are data values a mean, minimum, maximum, etc.). The CF conventions have a list of standard_names for variables. These are held in the most recent version of the CF Standard Name Table. The list includes the units recommended for each standard name (most common prefixes can be used with the units, e.g. kilo (k), hecto (h), Mega (M), etc). If a standard_name metadata attribute is associated with a data variable, its value must be chosen from the list published in the standard name table. It is not compulsory within the CF conventions to assign a standard name to a data variable, but including one helps data users to understand the contents of a netCDF file. A long_name attribute can also be used to supply text that describes the variable more fully (and perhaps provide a handy graph-axis label). Both standard_name and long_name can be provided for a data variable, and the CF conventions recommend that at least one of them be supplied with each variable in a netCDF file.
The guidelines given here are not exhaustive, and some sections of the CF conventions are not covered. Please consult the CF web pages for further information. There is an introduction to CF in this presentation from the CF 2020 workshop.
CF conventions and the netCDF file
Figure 1 shows the structure of a netCDF file. The file has several basic components: dimensions, variables, data, and global attributes.
Dimensions
The dimensions of a variable define the axes of the quantity it contains. Dimensions can be spatial, temporal, or any other quantity (even an index). For example, typical dimensions for gridded model data are latitude, longitude, altitude, and time, while typical dimensions for radar data are range and time. Dimensions may be of any size, including unity. Optionally, one dimension in a netCDF file is allowed to be 'UNLIMITED'. The unlimited dimension allows a data file to be appended to at a later stage along a particular axis, for example, if data are still being collected. Most often this facility is used with the time dimension. The sizes of the various dimensions are declared at the top of the example shown in Figure 1.
Coordinate Variables and Data Variables
Coordinate variables are special variables in a netCDF file. The name of a coordinate variable is the same as the name of its dimension. In the example in Figure 1, the variables dimension1(dimension1) and dimension2(dimension2) are coordinate variables. The dimensions named in parenthesis refer to those declared in the first section of the file and give the size of each coordinate variable. Coordinate variables can contain regularly or irregularly spaced steps.
Data variables contain the actual measured or modelled quantity, for example, air temperature. Data variables must not have the same name as any of their dimensions. In Figure 1 the variable declared as variable1(dimension1,dimension2) is a two dimensional data variable. The data values are contained in variable 1 itself and the corresponding coordinates are contained in the coordinate variables dimension1 and dimension2. A data variable can also be a scalar quantity with only a single value, for example radar frequency. Data variables should be given a 'standard_name' metadata attribute where possible, otherwise 'long_name' should be used to describe the variable. For example, a standard name from the CF standard name table can be assigned as follows: tempvar:standard_name = "air_temperature"; where tempvar is the name of a data variable containing air temperature values. An overview of CF standard names is available in this presentation from the CF 2020 workshop.
It is often very useful to include a dummy value for missing data in a file. The CF conventions suggest that the '_FillValue' attribute be used, and defined as the same type of variable as the one it replaces.
There are many other attributes that can be used to provide a detailed description of the variables inside a netCDF file. Please see the CF Conventions Document for the full list of attributes and examples of their usage.
Global attributes
These relate to the dataset at the more general level. They might include such information as instrument name and description, institution name, processing history, references. One aspect of the CF conventions is that extra attributes are not outlawed, so you can include further information in the global attributes, if you think it would be useful for future users. CEDA recommends you include as much information as possible.
CF conventions make one global attribute mandatory:
Conventions "CF-1.0"
CF conventions recommend the following global attributes:
title | A succinct description of what is in the dataset. |
institution | Specifies where the original data were produced. |
source | The method of production of the original data. If the data are model generated, source should name the model and the version number. If the data are observational, source should characterize them, e.g. surface observation, radiosonde, satellite. |
history | Provides an audit trail for modifications to the original data. Well-behaved generic software will automatically append their name, input parameters, and a timestamp. |
references | Published or web-based references which describe the data, or the methods used to produce them. |
comment | Miscellaneous information about the data or the methods used to produce them. |
We have generated some examples of the types of information that could be put into the global attributes for three datasets.
The CF-checker
There is a web-based CF-checker that allows you to upload a file to test for compliance with the CF Convention. Visit the CF-checker page to use this service. Alternatively, the CF-checker software can be downloaded and installed by following the instructions at https://github.com/cedadev/cf-checker.