June 12, 2011 | Lisa Childers

Part of my work for Globus Online User Services involves moving a great deal of data. Indeed, within the past 7 months I have singlehandedly moved over a petabyte of data using Globus Online. For any given transfer, verifying that the data at the destination matches the source is not always necessary (such as when executing a test run preparatory to a large transfer, or trying to better understand the user interactions enabled by a new GO feature.) In these cases the transferred data is a mere side effect.

Other transfers are all about the data: move files from A to B and make darn sure that every bit reaches its intended destination.  In such cases I invariably use Globus Online's built-in data integrity checking because the interface is so simple and integrates seamlessly with my workflow.  If you care about data integrity you should consider using it too!

You can find the bash script that I use for integrity checking here: http://www.mcs.anl.gov/~childers/go-dir-xfer.sh.  (Note that in order to use this script you'll need to enable CLI access for your Globus Online account and substitute your GO account name for "username".)   Here's my typical workflow:

1. activate my endpoints

2. execute a standalone GO transfer command on the source/dest directories via the Web or CLI

Then, if I want to check data integrity:

3. wait for the transfer in step #2 to finish

4. execute go-dir-xfer.sh on those same directories

go-dir-xfer.sh first transfers only those files that are newer at the source than the destination (sync=2); this check usually goes quickly because the comparison is date-based and the files were already transferred in step #2.  After the dates are compared, the script runs a checksum comparison (sync=3); this can take some time. All files that fail the checksum test are retransmitted and a new checksum test is initiated. If checksum mismatches are detected a second time then there is likely a critical infrastructure failure at one of the endpoints; in this case the script aborts with an error. The script returns 0 if all checksums match.

And there you have it!

Ensuring data integrity can add significant time to your transfer workflow, which is why I don't always do it.  In my experience such errors are quite rare; the vast majority of problems are automatically detected and the files retransmitted by Globus Online in step #2. However when data integrity is mission-critical, Globus Online's built-in integrity checking can provide the extra assurance you need.