August 18, 2011 | Ravi Madduri

When people hear about Globus Online they probably think of simplified online file transfer using a GUI. And while this is certainly true, what some people don’t realize is that, in addition to the nifty Web 2.0 interface, Globus Online also provides a powerful API developers can use to integrate Globus Online into their own applications, to localize the transfer experience.

This lets users interact with the service without leaving their familiar user interface, essentially adding to the capabilities of their existing system without having to learn or install anything new.

Below is an example of one such exercise where the GO API provides reliable, large file transfer capabilities to a popular next-gen sequencing analysis platform called Galaxy. Hopefully this example will be instructive for other communities, organizations or developers who use web applications to perform data intensive science.

What is Galaxy?

The Galaxy framework is a next-gen sequencing analysis platform that allows complex computational analysis to be accessible for experimental biologists using Software as a Service modality. Workflows allow users to specify reusable multi-step analyses, with complex data-flow and dependencies between steps specified in an intuitive way. The Galaxy framework provides a friendly user-interface in which the researcher can design a custom analysis workflow depending on his/her specific needs. A finished workflow appears just like any other tool in Galaxy, and can easily be run or even composed into other more complex workflows. Workflows can be shared with other Galaxy users, as well as exported in a portable format that can be moved between different Galaxy sites, or even run outside of Galaxy.

Problem

Next gen sequencing is a data intensive science. While Galaxy is very useful for people building sequencing pipelines, users often encounter issues when they need to transfer large amounts of data (order of magnitude of GBs) into Galaxy instances for analysis, and get data out of Galaxy once analysis is done. The current Galaxy implementation allows users to use HTTP or FTP to transfer data into Galaxy, but both of these mechanisms get unreliable when transferring huge amounts of data, which is pretty common in Next Gen Sequencing. So users end up spending a lot of time baby-sitting the transfers instead of spending time thinking about novel science that can be done.

How Globus Online and the REST API Helped

Globus Online provides a nice REST API and clients in Python and Java to submit transfer tasks, monitor transfer tasks, manage file transfer endpoints and list remote directories. The API is described in more detail here.

For this project, we used the Python API to create a Galaxy transfer tool presenting a Galaxy Web UI that lets users input the location of sequencing data that needs to be analyzed. The destination of these transfers typically is a Galaxy cluster. Once the user clicks ‘Submit’ on the GO Galaxy tool, which looks exactly like the Galaxy UI they are already familiar with, the transfer gets submitted to Globus Online’s transfer service. We used Globus Online’s Python API to monitor the transfers from within the Galaxy monitoring framework.

Now users are able to do high performance, reliable transfers of large quantities of data from within Galaxy without having to spend time monitoring and restarting failed transfers. Using Globus Connect with Globus Online enabled Galaxy, users can also reliably upload files from their laptops/desktops as well.

Results and Applicability

There are two important things that we would like to emphasize here: One is the time it took to integrate Globus Online transfer with Galaxy. Because of the simplicity of the API and the documentation provided, it took us less than a week to enhance the Galaxy framework with this capability. Second, with this enhancement the users of Galaxy can now create scripted workflows that can automatically retrieve input data from geographically distributed data sources and transfer results back to endpoints reliably. So data transfer becomes one less thing to worry about for our researchers so that they can focus on things that they really care about.