DOE Data Day (D3)
Globus professional services manager Rick Wagner will be speaking at the 2019 Department of Energy (DOE) Data Day event at Lawrence Livermore National Laboratory (LLNL):
- Title: Globus Research Data Platform
- Date/Time: Wednesday, Sept. 25 @ 1:55 - 2:20 p.m. (part of "Data-Intensive Computing" session)
- Location: Bldg. 170, Room 1091
- Abstract: This presentation will introduce how Globus is used for data management, motivated by examples drawn from major facilities and projects within the DOE, NSF, and NIH funding ecosystems. Globus provides high-performance, secure, file access, transfer and synchronization directly between storage systems (i.e., without needing to relay via an intermediary machine). Globus scales to meet the needs of increasingly diverse data by handling all the difficult aspects of data transfer, from authentication at source and destination to performance optimization and automatic fault recovery. It supports both high-performance GridFTP transfers and secure HTTPS access for direct upload/download. Originating from Argonne National Laboratory and now developed and operated by the University of Chicago, Globus has become a preferred service for moving and sharing data between and among a wide variety of storage systems at research labs, campus computing resources, and national facilities like the Argonne and Oak Ridge Leadership Computing Facilities (ALCF and OLCF), DOE’s Joint Genome Institute, the National Energy Research Scientific Computing Center (NERSC), and the Advanced Photon Source (APS) at Argonne.
Globus relies on two core components: the Globus service, which coordinates data transfer; and the Globus Connect software that is deployed on storage systems to enable secure, high performance data access. Globus Connect’s modular backend storage interface enables interoperability across HPC and cloud storage systems such as Amazon S3, Ceph, HPSS, HDFS, and Box. We provide secure data sharing allowing users to make data on Globus endpoints accessible to other individual users and/or groups. The Globus Connect software provides advanced features that ensure that users who access such shared endpoints are restricted to the locations and permissions granted by the owner. Such shared endpoints can be created and managed dynamically by users and programs, providing a convenient mechanism for data sharing. All Globus services expose programmatic APIs that can be used by developers and data providers to offer robust file transfer and sharing, while leveraging advanced identity management, single sign-on, and authorization.
Globus’s data management features build on Globus Auth, a mature, widely used foundational identity and access management platform service designed to address the needs of the science and engineering community for authentication and authorization across platforms, institutions, and services. It serves to broker authentication and authorization interactions between end- users, identity providers, resource servers (services), and clients (including web, mobile, desktop, and command line applications, and other services), permitting unified access to research data, across all systems.
Developers can use Globus Auth APIs to integrate these capabilities into services, applications, and tools without needing to develop software to authenticate users, support peripheral workflows (e.g., password reset), or apply security updates. By eliminating friction associated with the frequent need for multiple accounts, identities, credentials, and groups when using distributed resources, Globus Auth streamlines the creation, integration, and use of advanced research applications and services. Globus Auth builds upon the OAuth 2 and OpenID Connect specifications to enable standards-compliant integration using existing client libraries. It supports identity federation models that enable linking of diverse identities (e.g., XSEDE, ORCiD, institutional), a secure scoped access token model for interacting with services, and APIs for resource servers and clients to validate and introspect tokens, with a delegation model by which services can obtain short-term delegated tokens to access other services.
Globus Auth capabilities are all accessible via a REST API and associated SDKs, making it easy to integrate them into applications and services. Its effectiveness is illustrated by its adoption by projects such as: NCAR's Research Data Archive; DOE’s KBase; the NSF’s JetStream cloud and XSEDE network; and NIH’s FaceBase Consortium. Globus Auth supports over 420 identity providers, including most DOE national laboratories. Others measures of maturity, adoption, and impact are that Globus Auth manages over 75,000 unique identities; supports 300 applications and services; and has issued almost 4.4 million access tokens.
Globus services are widely used within and outside the Department of Energy, with tens of thousands of users and more than 14,000 storage systems accessible via Globus, including at most leading US universities and research computing centers. By using Globus, users can leverage other DOE investments, such as ESnet; for example, researchers from Argonne recently completed the largest single transfer managed by Globus, moving 2.9 PB between OLCF and ALCF storage without disruption. DOE researchers use Globus to receive data from core facilities like the APS, move data to compute resources and archival systems, pull data from remote instrument facilities into analysis environments, and automate these tasks.
The D3 workshop is dedicated to data management activities in the Department of Energy (DOE) national laboratories on September 25-26 at Lawrence Livermore National Laboratory (LLNL). The DOE has joined the larger scientific community in the promotion of data management as a means to higher quality and more efficient research and analysis. Data management includes a disciplined approach to metadata, which tracks provenance and provides traceability from raw data products through analysis results and potentially through production. We will be discussing a variety of topic areas including Data Curation and Standards, Data Intensive Computing, Data Management in the Cloud, and Data Access, Sharing, and Sensitivity. For details, visit the event page.