Filling in Missing Datasets for data.ceda.ac.uk (and try to scrape internal metadata, e.g. parameters)

Background

The commands in this document will index information about files and directories so that it can be displayed in data.ceda.ac.uk. By virtue of doing these steps the files will also get submitted to the 'slow' queue for eventual in-depth scanning to try and pull our internal file metadata if possible.

dap.ceda.ac.uk - Is a live file listing of the archive (similar to an ls). This is used to serve the files for download and provide OPeNDAP services.

data.ceda.ac.uk -  Is based on the Elasticsearch indices. This is the browse interface for the archive. There are two indices but for the purposes of this system, the use of 'FBI' below relates to the whole indexing process itself.

Due to historical issues with the indexing system it is possible that there are still some datasets which have not been fully indexed. To address this there is a process which is set to go round the entire archive, storage spot by spot, and check the index against the file system but this will take months to complete a circuit of the archive. In the meanwhile, where parts of the archive are known to exist, but are not visible in data.ceda.ac.uk a forced scan can be initiated on targeted areas to give more immediate indexing and visibility.

Contents:

How long will things take?
- Check current queue size
Activating the environment

Choosing the right command
Commands
- fbi_directory_check
- fbi_rescan_dir
- fbi_q_check

How long will things take? A rough overview of the system.

Events processed by the deposit server such as deposit, removal, directory creation and removal are, as long as they use the correct tools, passed onto the FBI exchange to make changes to the Elasticsearch indices which provide data to data.ceda.ac.uk. (As such direct archive interaction using ls, rm, mkdir etc should be avoided).

All messages are passed to the fast and slow queues. The fast queue gets as much information as it can from the message without touching the filesystem. This should be enough that it displays in data.ceda.ac.uk in a short time frame. The slow queue follows this and adds richer metadata.

From the time you can see the files in the archive, you should expect to see your new files in data.ceda.ac.uk within 1 day. It will likely be much quicker than this (could be immediate) but 1 day gives sufficient lag to allow any surges in ingesting traffic to be processed as this is an asynchronous process.

The tools provided in this document feed into the FBI exchange and are used to manually modify the indices. 

All of these commands go through a queuing system so how long it will take will depend on how many items are in each of the queues.

You can check the current queue size: https://archdash.ceda.ac.uk/current/es_queue

How do the tools work?

The fbi_directory_check submits a list of directories to a local queue. A process (crawler) then takes a directory off the queue, checks the archive against the index and sends the difference to the FBI exchange. 
Note: There can be a delay from submitting using fbi_directory_check to seeing things appear due to this intermediate step

fbi_rescan_dir treats the directory in question as if being scanned for the first time and bypasses the checks. This can be useful if the item is present but the data is incorrect or if there is a whole directory which is not showing up so you can skip the checks.

Activating the correct environment

  1. Login to ingest machine
  2. activate the environment

    conda activate ingest_py3
    	
  3. Run command

Choosing the right command

In general, if you can see partially complete information then you should use fbi_directory_check. If there are large swathes missing, then fbi_rescan_dir will add them. If you wish to overwrite out of date file and directory metadata (MOLES catalogue labels, sizes, locations, variables) then use fbi_rescan_dir. Look at the decision tree below to help you pick the right tool.

fbi_directory_check

If you wish to overwrite the metadata (e.g. MOLES catalogue is incorrect, use fbi_rescan_dir)

Submit directories to be checked for consistency between the archive and the indices. This command does not check the content, only decides whether something should be added or removed from the index based on the files. This command submits items to a local queue for checking. If you believe the content to be incorrect, you should use  fbi_rescan_dir

Note: There can be a delay from submitting using fbi_directory_check to seeing things appear due to the checking step. Especially if the number of submitted directories is high. You can check how many items there are in this queue using fbi_q_check and look at the user-submitted queue count.

Usage:

fbi_directory_check (--dir <dir> | --file <file>) [-r] --conf <conf>

Examples:

fbi_directory_check --conf ~/software/fbi_directory_check/fbi-directory-check/fbi_directory_check/conf/index_updater.ini --dir /neodc/esacci/cloud/data/phase-2/L3C -r

fbi_directory_check --conf ~/software/fbi_directory_check/fbi-directory-check/fbi_directory_check/conf/index_updater.ini --file list_of_directories.txt

Options:

Option Description
-r Will search all directories recursively
--dir Accepts a directory path
--file Accepts a file input
--conf Path to configuration file

fbi_rescan_dir

Rescan the given directory. This will overwrite the content in the indices for this directory - useful if you know the metadata is out of date/incorrect.

Usage:
fbi_rescan_dir <dir> [-r] [--no-files] [--no-dirs] --conf <conf>

Examples:

fbi_rescan_dir --conf ~/software/fbi_directory_check/fbi-directory-check/fbi_directory_check/conf/index_updater.ini /neodc/esacci/cloud/data/phase-2/L3C --no-files - Add all the directories in the given directory

fbi_rescan_dir --conf ~/software/fbi_directory_check/fbi-directory-check/fbi_directory_check/conf/index_updater.ini /neodc/esacci/cloud/data/phase-2/L3C -r --no-dirs  - Add all the files below the given directory

Options:

Option Description
-r Will search all directories recursively
--no-files Will exclude files from the results and only change directories
--no-dirs Will exclude directories from the results and only change files
--conf Path to configuration file

fbi_q_check

Display the current number of directories in the user-submitted and bot queues. These are processed to check the difference between the index and the archive and then actions are submitted to update the indices.
The user-submitted queue is given priority over the bot queue and is built from items submitted through fbi_directory_check

Usage:
fbi_q_check

Did this answer your question? Thanks for the feedback There was a problem submitting your feedback. Please try again later.