SEA's Improving Scientific Software (ISS) Conference 2021
Globus will be presenting at SEA's ISS Conference 2021.
Scalable Data Management: Automation and the Modern Research Data Portal
Scientific instruments and core facilities generate large volumes of data daily. With growing data volumes the research enterprise is increasingly challenged by what should be mundane tasks: reliably moving data from instruments and computing resources, easily describing data for downstream discovery, and making the data accessible (often with appropriate access controls) to distributed groups of collaborators. The ad hoc methods currently employed at many facilities place undue burden on scientists and system administrators alike, and it is clear that some level of automation is required for these tasks.
Globus is an established service from the University of Chicago that is widely used for managing research data in national laboratories, campus computing centers, and HPC facilities. While its interactive web browser interface addresses simple file transfer and sharing scenarios, large scale automation typically requires integration of the research data management platform it provides into bespoke applications.
We will describe one such example, the Petrel data portal (https://petreldata.net), used by researchers to manage data in diverse fields including materials science, cosmology, machine learning, and serial crystallography. The portal facilitates automated ingest of data, extraction and addition of metadata for creating search indexes, assignment of persistent identifiers faceted search for rapid data discovery, and point-and-click downloading of datasets by authorized users. As security and privacy are often critical requirements, the portal employs fine-grained permissions that control both visibility of metadata and access to the datasets themselves. It is based on the Modern Research Data Portal design pattern, jointly developed by the ESnet and Globus teams, and leverages capabilities such as the Science DMZ for enhanced performance and to streamline the user experience. We will describe common use cases that motivate the need for such data portals, illustrated by further examples, and will demonstrate how investigators can rapidly develop and deploy these capabilities to scale up their research.