I am pleased to announce that the Globus platform now supports search capabilities, providing a secure and scalable solution for research data discovery.
Discovery of research data relies on the ability to search and query over domain-specific and scientific metadata that is structured, in addition to free-form descriptive metadata. The Globus search ingestion API allows such metadata to be indexed, respecting the metadata schema where defined. While in some domains there are standards for metadata that can be used, projects are often stymied by the need to define a metadata schema upfront or adopt a generic standard which is not well suited for their use. Globus search is metadata schema agnostic: users can choose a metadata schema for use in their index, and evolve the schema automatically as additional metadata elements are added.
Once metadata is indexed, the search platform supports rich, faceted search. Using the query API, users can start with text searches and can be interactively guided to relevant results through the use of facets, which are effective in narrowing or expanding results in subsequent queries. More sophisticated queries for integration of search results with other platforms, such as computation services, are also supported. For example, researchers can use the query to find specific datasets to analyze and feed that into their workflows and tools, such as Jupyter notebooks.
Another barrier to making research data discoverable is restricting visibility of not only the data but the metadata available for discovery. Often, access is limited to collaborators during various phases of a project, or different aspects of the metadata may be of use to different groups of users. An entity in a Globus search index may have multiple metadata elements associated with it, each with its own access policy to limit its visibility in query results to desired users or groups of users. In this way, subsets of metadata about the search results may be presented appropriately to various users.
The Globus search platform is a key part of The Federated Research Data Repository, a joint initiative by Compute Canada and Portage to provide a solution for discovery of research data across Canada.
In another example, the Materials Data Facility (MDF) uses the Globus search platform to allow users to find and use data harvested and indexed from heterogeneous resources (i.e., databases, datasets, collections, etc.) across the materials science community. Currently, MDF has indexed over 3 million entries from a list of 116 distinct data sources and is expected to grow rapidly in the near future. This index provides a rich set of materials-specific metadata coupled with links to full simulation and experiment output files for its users to aggregate.
MDF’s use of Globus search not only serves users who want to browse and find datasets, but also allows simple programmatic access to query for datasets that can be integrated with high performance computing platforms and interactive tools like Jupyter notebooks. This combination of indexed metadata coupled with access to the associated output files has proven useful for machine learning applications in materials science.
The Globus search capability is in early production release and available to users on request. (More information can be found in the documentation here). We continue to build on the capability to make it generally available, including self-service index creation and management, and general search user interfaces as part of the Globus web application.
We welcome comments and feedback, and if you are interested to learn more on using this new capability for your project, please contact us at firstname.lastname@example.org.