ASCII Data Formats
The CEDA archives contain many different data formats - including both "binary" and "ASCII" formats (such as " tsv" or "csv" data). ASCII (American Standard Code for Information Interchange) is a term often used to mean that the data are stored in a human-readable manner, though it has a specific meaning with reference to the permitted characters allowed (i.e.. a-Z, A-Z, 0-9 and a few others).
Whilst CEDA encourages the use of specific formats for encoding data (see the list on our File Formats Demystified page), which covers the majority of the CEDA archive, there are some datasets (mainly legacy ones) which use other formats often referred to as "ASCII" formats.
Identifying an ASCII formatted file
The simplest way to spot an ASCII file is to try and open it - if you can read the contents then it's a text file. However, it is also worth while checking if this is a bespoke one for that dataset or if it conforms to one of the main archive text file formats such BADC-CSV or NASA-Ames. To help check try the following:
- Check the dataset's catalogue page for any format information and linked documentation
- ASCII files often have a
.txtfile-endings. Though earlier files may follow the earlier 8.3 file-naming convention.
- BADC-CSV files have a
.csvfile ending and have
Conventions,G,BADC-CSVat the top.
- NASA-Ames files typically have a
.nafile ending and a top line consisting of two numbers: the first the number of lines in in the file before the data, the second (a 4 digit nubmer) indicates the type of NASA-Ames file it is.
These two, and others, are detailed further on the File Formats Demystified page.
ASCII Format Documentation
When CEDA archives data that doesn't use one of our archive standard formats we try to obtain supporting documentation describing that format and add links to this on the relevant dataset pages in the CEDA data catalogue. Links to such documents will be found under the "Docs" tab on the relevant dataset's catalogue page. However, should you be unable to find relevant documentation, please contact the CEDA heldpesk for further assistance.
Reading ASCII formatted data
Sometimes software tools may be available to read in the data in the format provided. These will often be linked to from the dataset's catalogue page under the "Docs" tab or an internet search may reveal tools produced elsewhere and may be stored in a "software" folder within a dataset collection in the archive.
However, this may not always be the case and instead the data may need to be read into a data processing programme, spreadsheet or database. In such cases the documentation should give an indication how to read these data in, or the following two sections may assist.
Before starting to handle the file, though, it can be helpful to get an idea of what the file looks like by just opening the file with a text editor (for example, Windows - Notepad, Mac - Nano, Linux - Emacs).
Common ASCII data structure
A common approach to ASCII data is to have two sections to the file:
- A header section - sometimes described as the metadata section. This may contain extra information either about how the data were collected/produced (e.g. instrument name, settings), some relevant additional information (e.g. who produced the data, where it was produced) or information on how to read the data in or interpret it (e.g. number of columns/rows in the data, headings for the columns, scale factors to use). However, these items may not be labelled to tell you what the values are for - here referring to external documentation is likely to help.
- Data section - below the header section will be the data themselves, which may be split up using a delimiter such as a comma, tab, "|" or similar character.
Comma (CSV) and Tab (TSV) etc delimited data
Often ASCII data will be delimited into columns of data to help users to understand how the data should be treated. Common ways of delimiting the data are to use symbols such as comma, tab, semi-colon, "|", space(s) or tab(s). As such these data can readily be imported into programmes such as spreadsheets simply by using an "import" function for text files and selecting the appropriate delimiter(s). Guides on how to do this in common packages such as Excel, Google Sheets and Access can easily be found by searching for these on the internet.
However, care should be taken with the file as there may be a top (header) section which has been provided which has not been prepared for importing as delimited data in the same way as the data lines themselves - as in the example below.
One example of an ASCII formatted file not covered by the standard archive formats is the one used in the CRU TS data, which typically looks like the following:
Climatic Research Unit Country File created on Thu 2 Jul 2015 18:16:18 BST, from CRU TS run #1506241137 Country = Afghanistan : Parameter = Precipitation : Units = mm/month Period = 1901.2014 : missing value = -999.0 : format = (i5,17f8.1) YEAR JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC MAM JJA SON DJF ANN 1901 65.4 13.5 46.8 32.5 50.9 20.1 7.8 2.6 3.9 11.5 7.7 7.7 130.1 30.6 23.1 48.6 270.4 1902 19.8 21.1 41.6 33.8 14.1 5.2 2.5 3.3 1.9 19.8 53.9 19.0 89.5 11.0 75.6 117.9 236.0 1903 49.0 49.8 68.2 38.1 80.6 7.1 7.4 6.6 6.3 2.0 12.0 21.4 186.8 21.1 20.2 113.8 348.4 1904 67.8 24.6 77.2 23.6 28.2 0.3 3.0 4.1 8.6 21.6 26.3 20.4 129.0 7.4 56.4 128.8 305.6 1905 71.3 37.1 71.4 34.4 15.8 4.5 1.1 3.6 5.5 2.4 3.0 43.1 121.7 9.2 10.9 176.8 293.2 1906 20.8 112.9 56.3 43.2 14.6 6.2 6.7 5.5 2.3 4.7 5.4 34.7 114.1 18.4 12.4 137.8 313.2 1907 52.0 51.1 40.9 50.8 40.8 15.1 6.4 4.7 3.3 19.0 8.3 9.9 132.5 26.2 30.6 127.5 302.3 1908 80.1 37.5 55.5 87.1 10.5 3.1 9.8 7.3 6.1 2.8 1.2 45.1 153.2 20.2 10.1 137.4 346.2 1909 31.5 60.8 52.1 79.4 26.4 8.1 12.8 2.5 2.5 5.9 4.4 47.5 157.9 23.4 12.7 161.3 333.9 1910 82.2 31.5 53.7 30.0 19.0 4.8 13.5 17.0 0.7 2.5 1.9 22.5 102.7 35.4 5.2 140.2 279.5 1911 83.3 34.5 91.8 30.2 32.9 1.3 0.4 2.5 0.9 6.3 11.5 21.1 154.9 4.2 18.7 129.5 316.7
Here the header or metadata section is clearly separated from the monthly values by the column headers line. The metadata section also gives useful information such as:
- who produced the files
- when it was produced
- from what source data it was produced
- which area if covered
- what the data represented and their units
- what period the data covered
- how to identify "missing data" points by a given data value
Below this header section the data lines are then tab delimited