Ingest Servers

Ingest Servers
General ingest software

This page outlines the Ingest Servers and the ingest processes to be used.

Presently the ingest servers are: ingest1.ceda.ac.uk (on JASMIN/CEMS)

Data being PUSHED to CEDA

As detailed on opman/arrivals there is an Arrivals machine to which all data being pushed to CEDA arrives. On that server data is put into dataset subfolders which appear under the /datacentre/arrivals top level folder on the Ingest servers.

Data being PULLED to CEDA

When datasets are updated by CEDA staff pulling data into CEDA and/or placing onto the system from another media (e.g. external hard disk) then these should be initially placed within dataset's subfolder of /datacentre/processing.

Under /datacentre/processing3 the dataset directory name should match those used within the archive itself or the stream name, e.g. use ukmo-rad instead of radios for UK Met Office global radiosonde data. Beyond the /datacentre/processing3/<dataset dir>/ directory the data scientist may structure the directory as desired.

Add a 00INFO.txt file to the processing directory so that people know the owner and purpose of the directory.

Scheduling ingestions processes

A command line utility is provided to control all ingest scripts - ingest_control . This is the only method to schedule ingest jobs, which enabling us to have a complete ingest list.

General ingest software

Many ingest operations are common across ingest streams.

fileProcessor.py

If renaming, zipping and/or tarring operations need to happen during the ingest process then the fileProcessor.py script may be of use. This is set up to work with the new ingest server and /processingCache set up in mind, but could be applied to other file processing operations elsewhere.

For details of how to use this script see these pages

fromdeliveries.py

Deprecated : Still available on glacial. For ingesting files into the archive use the fromdeliveries.py script. For more details see these pages? .

On ingest1 a new deposit service is in place to separate archive deposit actions from ingest processing actions. This ensures consistency of the archhive and carries out some simple QC checks.

There are 2 ways to use this service: via a client library or via a command line tool.

The deposit client

Data should copied to the archive via the deposit server. This maintains the correct permissions, ownership and performs logging. To use this service you can use a command line tool or import a deposit client library into an ingest script.

Use these lines to import the client and make an instance to use.

from deposit_client import DepositClient
D = DepositClient()

use a deposit client like this:

D.deposit(rpath, readmepath)           # add a file to the archive
D.rmdir(self,d)                        # remove a directory in the archive
D.remove(self,f)                       # remove a file in the archive
D.deposit(self, src, dst, force=True)  # make a deposit with forced overwrite in archive.
D.symlink(self, linkto, linkname)      # make a sym link in the archive
D.makedirs(self,d)                     # recursively make directories
D.recursive_deposit(self,src, dst)     # recursive deposit from src to dst

Command line tool

(venv27)[badc@ingest1 ~]$ cedaarchive -h
CEDA Archive deposit command line script for interactive interaction with the archive.
usage: /usr/local/ingest_software/venv27/bin/cedaarchive [options] mkdir <dir>
           <dir> = directory to make in the archive
usage: /usr/local/ingest_software/venv27/bin/cedaarchive [options] makedirs <dir>
           <dir> = directory to make in the archive (including all parent directories)
usage: /usr/local/ingest_software/venv27/bin/cedaarchive [options] rmdir <dir>
           <dir> = directory to remove from the archive
usage: /usr/local/ingest_software/venv27/bin/cedaarchive [options] remove <file>
           <file> = file or symlink to remove from the archive
usage: /usr/local/ingest_software/venv27/bin/cedaarchive [options] deposit <src> <dst>
           <src> = file to deposit
           <dst> = target destination in the archive for the file
usage: /usr/local/ingest_software/venv27/bin/cedaarchive [options] rdeposit <src> <dst>
           <src> = directory to deposit recursively
           <dst> = target destination in the archive for the directory
usage: /usr/local/ingest_software/venv27/bin/cedaarchive [options] symlink <linkto> <linkname>
           <linkto> = path in the archive to link to
           <linkname> = link name in the archive
-v verbose
-h this help
-t test
-f force overwiting of files that exist in archive

Arrivals deleter

from Python:

import sys sys.path.append("/home/badc/software/infrastructure/arrivals-deleter/") import arrivalsDeleter d = arrivalsDeleter. ArrivalsDeleter? () d.delete(filename1) d.delete(filename2) ... d.close()

from the command line

(CAUTION: run this as user "badc", otherwise the file /home/badc/.fileOpsLogin won't be read and the program exits)

/home/badc/software/infrastructure/arrivals-deleter/arrivalsDeleter.py filename1 [filename2 ...]

from python, an argument to the constructor: d = arrivalsDeleter.ArrivalsDeleter(login_file_name)

from the command line, the flag -l login_file_name

import sys sys.path.append("/home/badc/software/infrastructure/arrivals-deleter/") import arrivalsDeleter d = arrivalsDeleter. ArrivalsDeleter? () d.rename(oldfilename,newfilename) ... d.close()

Arrived files can be read on machines in group @badc_ingest (currently, glacial) as /arrivals is mounted there (as well as /.arrivals and /disks/tropical/* so that the symlinks work as expected). However, the directories are quite deliberately exported read-only. This is to ensure that if the ingest machines do any processing of the files before ingest, they do not store it back on the arrivals machine, but should do so locally instead. This means that there needs to be a way for ingest processes to delete files which are finished with. This is provided by means of a python module which connects to the file operations server.

The arrivals deleter lives here in subversion, and is installed at /home/badc/software/infrastructure/arrivals-deleter/ .

To use it:

(If you specify multiple filenames, these are all deleted in a single session, which is slightly more efficient.)

Note that the module uses the file /home/badc/.fileOpsLogin which contains login details for the file operations server. This is only readable by user badc . If you want to run the deleter from another username, then make an alternative copy of this file (similarly protected please), and override the default filename by using:

Note also that the command line interface supports the flag " -i " to make it prompt interactively for each file.

The arrivals deleter has now been extended to include a rename function. This now means that files on the arrivals area on tropical can be moved to different directories, provided that the destination directory has been created in advance. The syntax is :