File Names Explained
Introduction
CEDA holds a wide range of instrumental, satellite, aircraft and model datasets of interest to the scientific community. From the point of view of data access, it is highly desirable to adhere to common file formats and file-naming conventions for all the data produced under the various projects. A well thought out and organised file-naming convention allows quick data access and avoids the user having to read the file in order to enquire as to its contents. Using this convention will save time and resources when setting up data management for each individual project, it will also allow greater analysis and manipulation of the data by software within CEDA and beyond.
There may be specific conventions followed for particular types of data or communities, such as the NWP data from met agencies, or the GHRSST satellite data standards. CEDA encourages the use of these standards where possible. CEDA has its own file naming convention given below if no community standard exists. Where alternative file naming conventions have been used in the archive the nomenclature used should be explained on the relevant dataset catalogue page.
Basic Character set
Files and directories in the CEDA archive are named using a restricted character set to avoid problems when the files are used. For example, plus(+) in a filename may produce an error in a web service that interprets it as a space. Any character that has a special meaning within a URL, Windows or the Unix shell are avoided. This leaves the following plain ASCII characters:
a-z A-Z 0-9 - _ .
Older data and data where there is a clear, long-established naming pattern may break this rule, but generally we will demand that new data deposits follow this convention.
CEDA File-naming convention
The file-naming convention for instrumental (and other) datasets uses long file names since these indicate significant information about the contents of the file without having to read the file or refer to the directory structure. Important attributes in a file name include INSTRUMENT, LOCATION/PLATFORM and TIME.
Please note that FAAM file names expand the convention below by allowing three [_extra] fields, two of which are mandatory for data collected on board the FAAM aircraft (for details, please refer to the FAAM Filename Convention). Participants to FAAM campaigns may feel free to generalise this rule to all data collected during FAAM campaigns, and use up to 3 extra fields separated by underscore signs, if they wish to do so. |
The chosen convention is as follows:
instrument_[location|platform]_YYYYMMDD[hh][mm][ss][_extra][_cor#].ext
Where:
instrument
- is the instrument name (full or shortened) or model name. When the same instrument is used by a number of groups, the instrument name should be prefixed with the institute name/code and a hyphen, for example uea-ptrms and york-ptrms. See current list.
location|platform
- is the location name (full or shortened) or the name of the platform on which the instrument is deployed. This refers to the location/platform for the observation and not the institute or location of the participating scientist/group. This field could be used for a range of items such as a site, a station, a platform (e.g. an aircraft), an institute or a university. See current list.
YYYYMMDD
- is the date on which measurements were taken. If a data file spans more than one day then this field should represent the first day during which data was recorded. The year is given as four digits with month and day as two digits each.
[hh][mm][ss]
- is the time of day specified (optional). Hours, minutes and seconds can be represented as two digits each. Hours can be used alone, only hours and minutes used or all three fields can be included. However, minutes or seconds cannot be used without the preceding time unit (i.e. no minute field allowed unit without the hour field).
[_extra]
- this section allows additional code to define such things as different range resolutions and so forth. It could also be used for Version numbers etc,.
[_cor#] - this section denotes that the file is a corrected version of a previously released file that the data provider has released. The number indicated at the end of the "cor" is linked to the details of the correction noted in the dataset catalogue page and the appropriate readme file for the dataset.
.ext
- will normally be .nc (NetCDF) or .na (NASA Ames) although occasionally other formats will be used, in particular .png and .gif for Image files. See current list.
Filenames should contain only the characters [-_.a-z0-9]
. Spaces are forbidden and upper case characters should be avoided. The underscore "_" character should only be used as a separator between fields.
File-naming for non-standard data (e.g. model, trajectory data)
Some projects will also generate model data, flight data, data recorded at sea (stationery and in transit), trajectories and other non-standard data types. It is suggested that the above format be adapted in the following ways:
- Data recorded by on board moving craft
When data is recorded on a moving craft the varying spatial location should not be recorded in the filename. Instead, the location field in the filename should include a name (or code) for the vessel and optionally the flight/voyage code/number. - Trajectory data
Calculated trajectory data is similar to data recorded on a moving craft. The varying spatial location should not be recorded in the filename. Instead, thelocation
field in the filename should include a relevant code for the trajectory type/model/number. - Model data
In the case of the model data, theinstrument
field in the filename should instead be used for a model code (indicating the type, version etc., of the model). For box models running at one location only thelocation
field should include this. However, models that output data over a grid can use appropriate codes to represent this. - Use of the
[_extra]
additional information field
The[_extra]
field is unlikely to be used in most cases but is provided as an option for exceptional cases where the data producer wishes to include some additional information not otherwise catered for. Suitable warning should be used against overloading this field. Such a use might be in forecast files where the date and time provide the start time whilst the[_extra]
field provides the time of the actual forecast. - Use of the
[hh][mm][ss]
time options
The[hh][mm][ss]
options are included or occasions where data is produced at such a high frequency that storing it in multiple files per day, hour or minute becomes appropriate. This is unlikely to be commonplace but is available for special cases. - Image files
Text files (.txt
) may be included to describe image data. Apart from the file name extension (last field), files containing images and their associated metadata should have the same name. When data exist both in the form of NASA Ames formatted fields and images,files also have the same name, except for the file name extension.
Standardising common names in the naming convention - adding new names
In order to standardise the names used within the file-naming convention CEDA will need to collate those currently used by the community and publish them via our website. This can be regularly extended to include new locations, instruments, models, etc. Interaction with instrument scientists and modellers will be essential to achieve this aim successfully. Should you need to extend the lists indicated above please contact us to discuss your requirements.
Old CEDA "8.3" filenames
Older datasets within the CEDA archive pre-date the CEDA file naming convention and were constructed to conform to the limits of filename sizes present in MS-Dos: namely an "8.3" character string, i.e. 8 characters, a period and then 3 more characters. This file naming convention was often, therefore, restrictive in the way that the necessary filename parts could be encoded. However, many such files in the CEDA archive will typically follow this pattern:
ppYYMMDD.eee
where:
pp
- typically used for a location or platform identifierYY
- two digit yearMM
- monthDD
- dayeee
- the file extension was typically used to encode some information about he parameter or instrument data covered by the file.
For example in the ACSOE EAE-96 campaign at Mace Head we find files such as:
mh960703.cn1 mh960703.cn3 mh960703.epi mh960703.nx2 mh960703.o31
the mh
was used for the Mace Head site while the various extensions (cn1, cn3, epi , nx2 and o31
) correspond to different data outputs from the various suites of instruments deployed at the site during the campaign. Such platform and instrument code have, where possible, been added as CEDA abbreviations onto relevant records in the CEDA data catalogue.
File Extensions
Extensions may give an indication to the format of the data. The following list are the file extensions commonly used within the CEDA Archive.
Allowed values:
Value | Description |
---|---|
na | Nasa-Ames file |
csv | BADC-CSV file |
ict | ICARTT ASCII format (updated Nasa-Ames) |
nc | NetCDF file |
txt | free form metadata file |
jpg | JPEG image file |
gif | GIF image file |
png | PNG image file |
tar | TAR archive file |