Handling Science Data: As Always, Devil's in the Details

September 12, 2011   |  Stuart Martin

We've been operating the Globus Online file transfer service for 10 months now, and along the way we've learned a thing or two about handling data for scientists.

Yes, we designed GO's file transfer backend/engine with performance, reliability and scalability in mind. And yes, we've written hundreds of automated tests to cover the various issues we've already run across -- but nothing is as humbling as dealing with reality on a daily basis.

Here are three real stories that took place during one week in August:

1) How many files in that directory?

Globus Online's file transfer backend was recently challenged when processing a large data set from Sarah Kenny (neuroscience researcher at the University of California at Irvine neurology department). Her dataset included 14 million files -- more than most, but no problem for GO in itself. However, one directory contained 415,484 files... and that did it. The backend timed out while waiting for the directory listing to return. The problem was identified and the timeout value was increased. Sarah did not have to do anything and her transfer request completed just fine.

2) Is that file path for real?

Another one from Sarah Kenny pushed the limits on the length of a file path. We'd set a limit to 1024, but that was not enough for some of her files. Some contained ~100 (I lost count) nested .../OrderA/OrderA/... file paths! Fortunately, she did not need this dir, so she was able to remove it and get the files to where she needed them to go. We'll be increasing the path length limit soon.

3) File system encoding error?

Greg Daues (research programmer at NCSA at University of Illinois) reported this error he got from a transfer request: "Globus Online currently only supports UTF-8 encoded filesystems; please contact support@globusonline.org." In Unix, filenames are just a string of bytes. GO and other tools use specific encoding to display the bytes as unicode characters. UTF-8 is usually the default and that is what GO currently supports. Bryce Allen (GO developer) provided Greg instructions for identifying & eliminating any non-UTF-8 files. Greg followed the instructions, found and removed the files, and transferred his data.

Our goal at Globus Online is to enable scientists to "just move" their data, no matter what unusual methods were used to assemble, store or make that data available. As GO encounters more data scenarios, it will become more aware and capable. We look forward to the next challenge -- in an odd way, that is part of the fun!