File Formats Demystified
Introduction
A file format is way to encoded information in a computer file. A format specifies how to interpret the bytes in the files as information with meaning to the programs and people reading and writing them. Each format is designed to carry a particular type of data, but some formats are more specific or more general in their realm of operation. For example, the PNG format is excellent for encoding an image, but could not be easily used to store a 3D computer aided design model.
Text formats are those where the bytes in the file should be interpreted as text characters. This means that generic text editors can be used to view or change the data. This is very useful if your data is small and can be interpreted by human inspection. There are different ways to encode text, but most are encoded with ASCII or unicode.
Binary formats are those where the bytes have to be interpreted by the specific format rules to work out their meaning. This necessitates the use of specialised programs to read and write the data.
The CEDA Archive includes a wide range of file formats - some well supported and other historical ones less so. The table below list some of the main formats within the CEDA archives with links to tools supporting the format. For information about which format is used for a dataset please CEDA data catalogue.
Additionally, how information is stored within or about files (so called metadata) is key to how the data within files can be used. See information about metadata formats in the " Introduction to metadata" article.
Core Supported Formats
Format | Type | File endings | Commonly used for |
BADC-CSV | text | .csv | simple "1-D" type of data, e.g. instrument time series data |
NASA Ames | text | .na | aircraft and older instrument data (older data may have an older file-naming convention) |
HITRAN | text | various | spectroscopy data |
JCAMP-DX | text | .dx, .jdx | only suitable for spectra from spectroscopy experiments |
NetCDF | binary | .nc | CEDA's preferred data format. Model data and observational data with more than 1 dimension (e.g. time-height data). Suitable for gridded numeric data such as model output. CF conventions preferred - migration to make CF compliant acceptable. |
HDF | binary | .hdf | Satellite data. Requires consistent conventions to be followed. |
PP | binary | .pp | Met Office model output |
GRIB | binary | .grb | ECMWF model output |
GEOTIFF | image | .tif, .TIFF | Earth observation imagery data |
JPEG2000 | image | .jp2 | Earth observation imagery data |
JPEG | image | .jpg | For images |
TIFF | image | .tif, .tiff | For images |
Other Accepted Formats
A range of other formats have been included to the CEDA Archive over time. Some of these are historical, whilst others are from third party sources which CEDA obtains as a facilitation mode. Not all are listed and the file format information on dataset records in the data catalogue should be referred to.
Format | Type | File endings | Commonly used for | Notes |
PNG | image | .png | ||
BUFR |
binary | meteorology data | WMO standard | |
Nimrod format | binary | .dat | Met Office NIMROD rain radar data | To be superseded by ODIMS compliant HDF5 |
BIL | binary | .bil | flat binary format used by ENVI users - produced by ARSF processing node | |
LAS | point cloud for EO data | |||
PDF(a) | suitable for documentation only | |||
plain text | text | .txt | suitable for documentation only. Data should utilise an approved format instead. | |
ENVI-HDF |
Not Accepted Formats
These formats have been reviewed by CEDA and deemed not acceptable for long-term data archival.
Format | Type | File endings | Alternative format to use |
csv/tsv |
text | csv, .tsv | BADC-CSV |
Excel | text | .xls | BADC-CSV |
HTML | text | .html | BADC-CSV |
Word | text | .doc, .docx | BADC-CSV, PDF |
Compression and Aggregations
At times it is desirous for files to be compressed to reduce overall volumes and also consider aggregation off files together to aid transfer and storage. These come into play primarily where there are either large numbers of files or large data volumes to consider, though impacts to onward use of the data (to uncompress/unpack) should be considered too.
Note - these should only be applied to files that are already formatted in a permitted format given above.
Format | Type | File endings | Commonly used for |
internal compression | compression | (retains main format ending) | Reducing file sizes e.g. HDF5, netCDF |
tar | aggregation | .tar | Aggregating a number of files as a "tar ball" allows a set of files to be downloaded together. |
gzip, bzip, zip | compression | .gz, .bz, .zip | Reduce the volume required for the file to aid transfer and storage. Note, requires uncompressing before use. |