February 22, 2018 | Lee Liming

More than just a convenient way to “fire and forget” your data transfers, using Globus gives you and your colleagues the benefit of our experience managing wide-area data transfers for thousands of researchers, including our customer-based code tweaks.

When we released the first Globus data transfer software in 2001, we wanted to change the data-intensive research experience for the better. Not surprisingly, we found that software alone wasn’t enough, though a few scientists achieved amazing results with it. But to make things work well, research teams needed a lot of expertise in arcane wide-area network techniques, and enlisting that expertise distracted from the very research we were trying to help.

Our aim with Globus SaaS, the browser-based interface and thin clients we now offer, was to supply the expertise that most research teams don’t have, and--let’s be honest--don’t want to have. It’s worked well: researchers at colleges and universities are now able to transfer, share, and publish their data without being--or enlisting--experts.

The benefits of today’s Globus are usually seen in terms of speedy performance and “fire and forget”-style reliability. But there’s another benefit that’s sometimes overlooked. We’re constantly improving things in Globus behind the scenes. And because Globus is provided as a service, everyone using it gets the benefits of those improvements immediately.

Let’s look at a recent example. ESnet, the Energy Sciences network serving the U.S. National Laboratories, recently conducted a campaign to optimize and improve the performance of their network--and the data transfer endpoints at the labs--for use by cosmology and climate researchers. It’s an impressive success story and their work greatly improved the research experience for scientists in both fields of study. But, like our early work on GridFTP, there was more that could be done.

As part of their work, ESnet told the Globus team about how climate data is organized. What they shared led us to realize that we could make their transfers work better with a modest change to the Globus transfer service. So in late October 2017, Globus added such a change aimed specifically at climate dataset transfers.

One of the many tricks Globus uses to speed up transfers is to automatically “batch” multiple files together and transfer them as if they were a single file. (This only works when someone requests a transfer that includes a lot of files.) Each individual file transfer requires some setup time, like “gear shifting” in a car. Automatic batching reduces the number of individual file transfers, thus reducing the time spend on gear shifting. Through mid-2017, Globus could batch up to 1,000 files into a single file transfer, depending on the sizes of the files. In late October 2017, we started batching up to 10,000 files once a “trial” batch of 1,000 files succeeded. Since climate datasets contain tens of thousands of files, this change eliminated a lot of gear shifting for those transfers. The effect of this change on climate dataset transfers can be seen in Figure 1.

A typical climate dataset transfer

Figure 1. A typical climate dataset transfer (4.4 terabytes) from Oak Ridge National Laboratory (Oak Ridge, TN) to the National Energy Research Scientific Center (Berkeley, CA). The blue line is from 2016, the green line is after the Globus October 2017 change. (Time on the X axis is in seconds.) The periodic dips in performance on the blue line are due to “gear shifting” between individual file transfers. Note that one initial gear shift is still required before the change takes effect on the green line.

This small change in our service--less than ten lines of code--had a dramatically positive effect on the performance of climate data transfers among the research centers and campuses involved. ESnet’s Eli Dart reported, “We have seen significant performance improvements in our testing with complex, multi-Terabyte data sets with tens of thousands of files. We're now seeing performance levels of 20Gbps to over 50Gbps for production data transfers between DTN clusters at HPC facilities. This is a huge win for science collaborations in multiple fields.”

The datasets used in climate science and cosmology are extreme by today’s standards, though they’re already being used by researchers on campuses across the country. But we don’t believe this will be “extreme” forever. In the coming years, new research projects will generate and use new data, and some--maybe on your campus, maybe yours--will look like today’s climate and cosmology data. Now that we’ve seen what’s needed to make this kind of data transfer well, and added the improvement to Globus to handle it, this optimization is available to everyone using Globus--now and in the future--without needing to be aware of what we did for extreme datasets on ESnet in 2017.

Our team will continue finding these sneaky optimizations, and not only for “extreme” datasets. (There’ve been many, and many more will come. In fact, we made more changes like this last month.) Researchers who use Globus to move, share, and publish their data will enjoy the benefits of these techniques without having to figure them out themselves. This “institutional memory,” brought to bear every time a researcher uses Globus to move or share their data, is arguably as valuable as the service’s performance and convenience.