Globus CLI And Automating Research Data Handling
Geoffrey Lentner from Purdue University works with researchers to give them tools and best practices for working with data. His goal is to enable research teams (typically a PI and group of grad students or post docs) to develop methods that allow them to process scientific data on a routine basis, minimizing the tedious, manual tasks so they can focus on research questions and solutions.
The general pattern Geoffrey sees is that data periodically “arrives” to the research team via some means (e.g., an instrument, a simulation run, a collaboration) and the team needs to process the data to see what it tells them. It’s usually the senior grad student in the team whose job it is to “push the buttons” that will get the data processed. The goal is to minimize the effort needed to make that happen.“Pushing the buttons” usually involves moving the data to a “landing zone” for the team, then getting it to the analysis system for processing, possibly putting it into a special high-performance storage system for analysis, unstaging it from the analysis system, packaging it up then and getting it into the tape archive. Purdue has a campus Globus subscription and all of their storage systems have Globus endpoints with sharing enabled, so all of the research team’s various storage locations are easily accessible to them via Globus. As easy as the Globus web app is to use, it’s still tedious to manually point-and-click through the web interface to make things happen.
Geoffrey tries to get research teams to script their processing steps so they can execute the steps with minimal human effort. In addition to making the team more efficient, scripting also helps the team capture the expertise needed to understand how the processing works, and assists with research reproducibility. The Globus Command Line Interface (CLI), which is a standalone application that provides a command line interface to both the Transfer and Auth services, is well-suited for use within these scripts.
One of Geoffrey’s examples involved showing a research team how to automate their data processing using the Unix make system. (Make is a software development tool with a built-in dependency engine.) The team was able to process their research data pipeline using make commands. They used the dependency engine to define the dependencies from each stage of the pipeline to the previous one(s), so that a request for the results of the pipeline for a new dataset automatically generated and executed all of the earlier stages of the pipeline. The transfer ID produced by the Globus CLI was a key piece of data for making these dependencies work, as it allowed later stages of the pipeline to be made dependent on the transfer status for each previous transfer ID reaching the “SUCCEEDED” state.
- “I started using the Globus CLI last year and I have never needed to look at the documentation. The built-in help interface and the self-documenting nature of the CLI made it very easy to figure out how to do what was needed just by trying commands and following the help text.”
- “Since all of the endpoints the research teams at Purdue need to use are shared endpoints, a single “globus login” command provides access to the endpoints they’re authorized to use. Because the Globus CLI offers output in multiple formats (JSON, CSV, etc.), it’s easy to use the output from a CLI command in whatever scripting system one is using.”