unpacker.py
Original trac page
Introduction
A generic script that can be used to unpack an incoming archive ball (one with various levels of tarring, gzipping and/or zipping).
It allows:
- unpacking of an incoming archive balls matching a regex (extractRegex)
- extraction of selected items according to a match with a given regex
- optional handling of other found files in the incoming archive balls not matching the regex
- optional quarantine check and migration to another area - e.g. an ingestion area or an area where further processing could happen
- can set up an arrivals area if this is not already in existence
- it makes use of the arrivals library to determine incoming files etc from multiple sources
- as it is a stand alone operation, such operations can be scheduled to daisy-chained them together as an ingestion stream with fileProcessor and ingester, plus other actions
Where source is stored
Source code is stored in the CEDA svn repository - presently just here : http://proj.badc.rl.ac.uk/badc/browser/ceda_software/unpacker/trunk
Limitations
At present the unpacker has the following limitations:
- You can't specify the unpack depth
- On unpacking it will ignore any internal directory structure within the tar ball
- It doesn't cope with bzip - only tar, gzip and zip
- You can't specify it to extract down to a matching gzip, tar or zip file within the archive ball.. it will unpack past this. (which can be a pain if you want to do something like: unpack to a gzip file, rename that gzip file with fileProcessor and then ingest - instead you'll need to add in the zipping step into fileProcessor)
- It is presently hardwired to process in batches of 1000 unpacked files
Files needed
This only requires a configuration file with an appropriate entry for the "stream" to be processed.
Config options
[stream-name] owner: <insert your username here - this is important to help those looking after the system work out who is running jobs> description: <a short description detailing the job and what it does> # standard bits for most config files: script: <command line call for the job, including optional entries> lockfile: <path and name to a lockfile - standard practice is to pop these under /home/badc/lockfiles/> when: <scheduler times in standard crontab format - e.g. minutes hours day-of-month month year to run the script. Used to schedule recurring tasks under ingest_control> timeout: <number of hours the script is permitted to continue running for before being terminated - the default is 12> notify_ok: <space separated list of email addresses to email if the jobs runs ok> notify_warning: <space separated list of email addresses to email if there are warning messages issued> notify_fail: <space separated list of email addresses to email if the job fails> # end of scheduler details arrivals_users: <either give a space separated list of the users who will contribute to this data stream> arrivals_dirs: <OR a space separated list of absolute paths to the source directories for the incoming data> arrivals_wait: <how old the files should be in seconds before being considered for ingestion> fileAge: <how old the files should be days before being considered for ingestion - note, will be retired in due course> extractRegex: <a regex used to identify the incoming files to ingest and then to supply parameters for the dirtemplate/headerclass library call> regex: <regular expression to uniquely identify files within a tarball (which would also be decompressed if zipped) to prepare a list for subsequent operations. (this should be the first in the operations order if required).> extraFiles: < option for handling extra files in the incoming archive ball not matched by the regex. Options are: split - unpacks these to a date-timed directory within the quarantine area and allows optional removal of source archive ball; keep - keeps the original archive ball (overwrites the deleterChoice setting to "keep"); ignore - doesn't do anything with these extra files and follows the deleterChoice/ Finally the argument can be followed by ",notify" for optionally reporting this back as a warning message deleterChoice: <one of arrivals|notArrivals to delete data from /datacentre/arrivals/users/ or /datacentre/processing otherwise the files will be kept>
For example:
[cfarr-lidar-ct75k-unpacker] owner:gparton description: unpacks ct75k archive balls ready for ingestion script: python /usr/local/ingest_software/unpacker/unpacker.py -c /home/badc/software/datasets/chilbolton/chilbolton_fileProcessor.cfg -s cfarr-lidar-ct75k-unpacker arrivals_dirs: /datacentre/arrivals/users/jagnew/cfarr-lidar/ quarantineDir: /datacentre/processing/chilbolton/quarantine/ ingestDir: /datacentre/processing/chilbolton/readyToIngest/ lockfile: /home/badc/lockfiles/chilbolton-lidar-ct75k-unpacker.lock extractRegex: (.*/)?ct75k_(?P<year>[0-9]{4})(?P<month>[0-9]{2})(\.tar\.gz)$ regex: (.*/)?cfarr-lidar-ct75k_chilbolton_(?P<year>[0-9]{4})(?P<month>[0-9]{2})(?P<day>[0-9]{2}).(?P<product>png|nc)$ extraFiles: split,notify notify_warning: graham.parton@stfc.ac.uk fileAge:0 quaratinePeriod:0 quarantineCheck: fileAge ingestRegex: cfarr-lidar-ct75k_chilbolton_(?P<year>[0-9]{4})(?P<month>[0-9]{2})(?P<day>[0-9]{2})(\.)(?P<product>nc|png)$ order: extract,moveToIngest deleteOption: arrivals mode: operational
Additional, un-desired files
If the unpacker spots that there are other files that have not matched the regex then it will follow the options set by the "extraFiles" field of the configuration settings. Options are:
- ignore - don't worry about their existence and continue to handle the source balls as normal under the deleterOption setting
- keep - keep the incoming source balls and prevent these from being deleted - i.e. override the deleterOption to "keep"
- split - splits out the files in to a date-timed sub-directory within the quarantine area. This will also maintain the directory structure that the extrafiles were found under within the archive ball, but will have unpacked them too, thus loosing any compression etc.
There is also an option as to whether or not to raise a notification about this issue with the "notify" option. This is strongly recommended for the keep and split options, and also recommended for the ignore option too in order to ensure that nothing nasty is happening, or that you're not missing files due to an issue with your regex setting.