opman/DataPushTool – CEDA

Original trac page

Description

The Data Push Tool is a perl script used for pushing data to the BADC in a hopefully efficient and robust way from remote sites. Features include:

  • Support for ftp or bbftp transfer methods
  • Concurrent transfers
  • Checksumming of files after transfer, with:
    • Automatic retries on bad checksum
    • Email notification on repeated bad checksum
  • Logging

It has two basic modes of operation:

  • One-off mode: transfers a specified set of files (read from standard input) and then exits.
  • Daemon mode: keeps running, and watches for ASCII files to appear in a directory containing lists of data files to transfer.

The main limitation at present is that it makes a separate ftp or bbftp session for each file to be transferred; it is therefore appropriate in situations where the startup overhead of a session is small compared to the data transfer itself.

For each file to be transferred, it will launch a process to do the following:

  • Compute a local checksum.
  • Transfer the file to a temporary upload directory using ftp or bbftp.
  • Connect to the checksum service (CGI script) to obtain the checksum.
  • If successful (checksums match), connect to the related CGI script which moves the file to another directory at the BADC end (effectively tagging the file as ready for ingest), and then delete the original file.
  • If unsuccessful, retry up to some maximum number of times.
  • If still unsuccessful, send an email (up to some maximum number of emails per instance of the script.)

A number of the above processes will be launched in parallel, up to some maximum.

Note that for I/O performance reasons, the tool uses a lockfile mechanism to ensure sequential checksumming files at both ends, despite concurrent transfers. (This is separate from the similar mechanism imposed by the checksum service at the server end, and so in practice the latter is redundant where this tool is the client.)

Installation

The data push tool is located here . Start by checking it out and copying to the remote site. For example:

  • SROOT = svn+ssh://glue.badc.rl.ac.uk/svn/badc
  • svn co $SROOT/data-push-tool/trunk
  • tar cvfz data-push-tool.tar.gz --exclude=.svn trunk
  • copy tarball to remote site
  • unpack tarball in a scratch directory

To install, then:

  • set the environment variables:
    • $EMAIL (email address to receive notifications from the script)
    • $BASEDIR (desired base directory of installation; should differ from where you unpacked the tarball) if you do not set these, then the installation will prompt you for the email address and/or default to installing into $HOME/data-push-tool
  • run the Install script
  • customise the installation by editing the variables in the file script/lib/globals.pm of the installation, according to the description in the comment lines
  • ensure that permissions are set correctly; in particular, if the tool itself, or processes which provide input file lists for the tool running in daemon mode, are to run under different user IDs, then you should ensure that they have write permission on the relevant directories under log/ and run/ .

Running the data push tool

  • First check the datafile permissions. Ensure that the tool runs with permission to read the files and to delete them (i.e. write permission on the directory). Provided that these are met, the tool does not actually need to run with ownership of the files. Also ensure that the files to be transferred are group-readable if bbftp is to be used. This is because bbftp preserves file permissions, and the checksum service will fail if the files are not group-readable at the BADC end.
  • The main executable is called transfers.pl .
  • To run the tool in one-off mode:
    • Invoke the tool with a list of files on standard input. For example, filelist contains:
      /path/to/datafile1 
      /path/to/datafile2
      /path/to/datafile3
      
      and you do:
      /path/to/transfers.pl < filelist
      
  • To run the tool in daemon mode:
    • Start the daemon by:
      • either: invoking the tool with the -d flag.
      • or better: using the daemonctl script (in the same directory), which takes one of the following command line arguments: start , stop , restart , status , startmail . Most of these options are self-explanatory; startmail will start the daemon if not already running, and send an email if it does so; this is useful for cron jobs to ensure the daemon is running.
  • You then provide ascii files containing lists of data files for transfer. The ascii files can have arbitrary names, but should be put in the directory which you configured in globals.pm , the default being run/daemon/input_lists . These ascii files must contain the line **END** at the end, so that the daemon knows when the ascii file is completely written. For example, /path/to/data-push-tool/run/daemon/input_lists/my_arbitrary_file_name may contain:
    /path/to/datafile1 
    /path/to/datafile2
    /path/to/datafile3
    **END**
    
    The daemon will move these ascii files to the output directory when it has read the list of files from them and added them to its in-memory queue for transfer.
  • The data push tool will inform of its progress by the following means:
    • The main log file is at log/transfers.log.<timestamp> or log/transfers-daemon.log.<timestamp> -- the timestamp is based on when the tool was launched
    • The directory log/ftplogs contains separate output files containing the output of each individual ftp session.
    • Serious errors will be communicated by email, e.g. if a file still fails to transfer after retries.
      • The maximum number of failed transfers to advise by email is configurable; if this number is exceeded, an email will be sent warning that further failures will not be notified by email, although will continue to appear in the logs. The count of emails sent is per instance of the script, e.g. if running in daemon mode it is reset by restarting the daemon.
  • Signals:
    • "kill -USR1" the top-level process in order to reset the count of emails sent. This means that after the script has stopped sending emails because of hitting the maximum number, it is possible to make it again send emails in case of future errors, without having to restart the script.
    • "kill -USR2" the top-level process in order to make it write to a file in the log directory the lists of datafiles being transferred and awaiting transfer. This means for example that if you have to stop the script and launch a new instance, then it is easier to construct a list of files as input for the new instance to continue where the old one left off. Also if you "kill -TERM" the top-level script, it will produce these lists just before it terminates.

Wishlist of future improvements

  • Warning on files not group readable if using bbftp
  • Option for emails on successful batch transfer (though maybe just use the arrival monitor ) instead.
  • Suspend and terminate signals that propagate to children.
Did this answer your question? Thanks for the feedback There was a problem submitting your feedback. Please try again later.