Handling Erroneous Data and Data Versioning

Handling Erronous Data and Data Versioning

The notes that follow should be used for a common approach to coping with minor and major updates to archived data following publication. Using a common approach will allow for user familiarity with how we handle such data in the archive and give us the potential for automating checks in the archive and perhaps ingestion processes to cope with these amendments from data providers.

Definitions

When to use correction and versioning terminology When a change has occurred in the processing of the incoming data to warrant the creation of a new Observation record (MOLES3 speak, MOLES2=new Deployment) then this should be treated as a new VERSION - and thus the data in the archive are held distinctly separate from earlier versions of the data. However, when an error in the data already provided has been identified (e.g. erroneously produced due to processing hiccup) and only a portion of the dataset are to be re-issued then a CORRECTION is issued. The following copes with how to handle corrections.

Handling data Corrections

Once alerted that erroneous data have been provided some initial steps need to be taken:

establish if the data are already in the archive. If not then prevent these from being ingested.
If already in the archive, determine if these data have already been used or not. If not then consider removing access to these files (BUT RETAIN them!)
Liaise with the data provider to establish the nature of the error and how these will be corrected and over what time scale.

Why keep the files?

The files need to be kept as they may have been already used-in-anger, thus the data are already in circulation and results may have been made available. Therefore the data must be retained to allow results to be reproduced.

How to archive the old data?

While the data are to be retained they must be demarcated to ensure that users are aware that the data are subject to some error. To do this I employed the following method:

Create a sub-directory called “erroneous_files” at a suitable location.

for MST radar data products I created the following:

/badc/mst/data/mst-products-v3/st-mode/radial/erroneous_files
/badc/mst/data/mst-products-v3/st-mode/cartesian/erroneous_files

Within the erroneous_files sub-folder create a readme file to inform the user of the reason for the folder and to act as an index of sub-folders created therein, e.g.

  00README_error_details
  00README_error_details.

This file provides details of the contents of this directory. The contents have been moved from the main directory structure to demark these files as those known to suffer from a particular error.

These files are maintained within the archive to preserve file traceability and provenance where these have already been made use of within the user community and thus permit reproducibility of reported results based on these erroneous files.

In addition, where corrections to these erroneous files have been determined details are provided. Where possibile corrected versions of these files have been produced by the data provider and placed within the archive. Such files will be duly marked as corrections with an appropriate suffix matching

the directory names used within this archive and detailed below.

Index of folders containing erroneous files.
*******************************************

correction_c1
*************
date range - 2007/02/06 - 2008/04/08 inclusive
see mst.nerc.ac.ukj/announce_20080416.html for further details

Owing to a bug in the new MST radar control and data acquisition system, there have been errors in the
reported ranges and altitudes of all MST radar data products for the period from 13:44:03 UT on 6th
February 2007 until 14:37:00 UT on 8th April 2008, inclusive.

3) Create sub-directories to contain the data that need moving from the archive – including any sub-folders needed to reflect the archive structure.
E.g. for the two sets of data above I have:
/badc/mst/data/mst-products-v3/st-mode/radial/erroneous_files/correction_c1/YYYY/MM/<files>
/badc/mst/data/mst-products-v3/st-mode/cartesian/erroneous_files/correction_c1/YYYY/MM/<files>
A readme file can contain additional information about this correction, e.g. :
correction_c1
*************
date range - 2007/02/06 - 2008/04/08 inclusive
see mst.nerc.ac.ukj/announce_20080416.html for further details

Owing to a bug in the new MST radar control and data acquisition system, there have been errors in t
he reported ranges and altitudes of all MST radar data products for the period from 13:44:03 UT on 6
th February 2007 until 14:37:00 UT on 8th April 2008, inclusive.

A compensation for this problem was introduced into the signal processing software starting with the
cycle/dwell beginning at 12:48:15 UT on 7th February 2008. However, it turns out that this over-cor
rected the problem. Moreover, this meant that the data at the lowest three range gates were unreliab
le.
<and so on..>

Move files from the archive proper to the corrections sub-folder just created (and thus remove the data from the main part of the archive).

How to deal with subsequent supply of corrected data

Hopefully the data supplier will provide new data to replace the ones that have just been demarcated as having some issue. These should be ingested into the archive proper, but with a suffix to clearly mark these files out as being a) new, b) different and c) related to the holding in the erroneous_files directory. E.g. for the above files a suffix of “_c1” will be added to the new files.

Using a standard approach we should then be able to script some of the steps required to deal with such files and also when a data provider sends through replacement data without informing us (I propose that the ingest step checks the archive for data that may already exist and takes appropriate action – e.g. move the newly arrived data to a holding directory and informing the data scientist of the new data).