Part IV in a Five-Part Series
Data volumes are exploding, and the need to efficiently store and share data quickly, reliably and securely—along with making the data discoverable—is increasingly important. Deliberate planning and execution are necessary to properly collect, curate, manage, protect and disseminate the data that are the lifeblood of the modern research enterprise.
In order to better understand the current state of research data management we reached out to several research computing leaders at institutions around the world to gain insights into their approach and their recommendations for others. We wanted to explore how data management tools, services, and processes deployed at their institutions help the research community reduce the amount of time spent addressing technological hurdles and accelerate their time to science.
At these, and many other leading institutions, new methodologies and technologies are being developed and adopted to create a frictionless environment for scientific discovery. Self-service data portals and point-and-click data management tools such as Globus are being made available to allow researchers to spend more time on science and less on technology. Education, documentation and a data management plan can assist researchers in their desire to spend more time on science. Hack-a-thons, webinars and researcher resource books are just some of the ways that institutions are educating researchers on best practices. Science Node published a summary of our findings, and we present here a more complete discussion with Matthew Harvey, former Computing Service Manager from Imperial College, in part four of our five-part series.
Matt Harvey, Research former Computing Service Manager, Imperial College (UK)
How does your institution approach research data management?
Until 2017 the College did not have a coordinated central strategy for research data management. There were many independent solutions targeted at "sync-and-share” (Dropbox, Box, etc.). PIs who needed lots of data storage would find their own small NAS or coordinate a larger purpose of storage into a HPC facility. Clearly this was not tenable. In 2018 we designed a centralized institutional data storage (IDS) platform. It needed to be fast and directly accessible, and as close to ubiquitous as possible—basically, accessible by researchers on their desktop systems and some proposition for moving data between other institutions or external collaborators.
What benefits do tools like Globus provide your organization?
One of the main reasons for Globus is for sharing and external collaboration. Without Globus we would have no way of sharing without putting people through human resource processes, and that would still not solve the technical hurdles with accessing, syncing, and sharing our large files which scale to TB’s in size. Security is super important as well. We go through a certification process and have the high assurance (HA) version of Globus which gives us all the logging information we need. We can force multi-factor authentication (or MFA) for our users, and are prescriptive as to who can share data with Globus. Only people with appropriate authorization can share, and we built this process.
How do you educate your users and promote adoption of the tools and best practices?
To educate and make people aware of Globus sharing we have a listing of our services on the web and Globus is at the top of the services webpage. When we encounter a service request, where someone is going to send a hard disk full of data, then we make them aware of Globus and suggest (they use the service) and onboard them to Globus. Now with the pandemic, and with many people working remotely, we are also promoting Globus Connect Personal to our user base. This is pretty useful, particularly for people in regions where they have high latency from the UK or are unable to access a VPN service. Also, we are starting to see Globus used internally for movement of data much more than I thought we would. For data acquisition devices producing a lot of data—like for Cryo-Electron Microscopy—we now require these services to be built with the data backhaul to the RDS, rather than building their own storage silo. We need to make the RDS as readily accessible as possible to the researcher in the forms that they need it. The RDS platform is designed with the key feature being data, not compute, and it needs to be accessible by the researcher.
What would you like to see in the future to add value to the data?
We have moved from where a researcher has multi-terabytes of storage to a fine-grained model where researchers cut a new allocation for each activity. This is so we know what the allocation is being used for. We also run a course for researchers on how to handle data. It is essentially a software carpentry module. We have a data repository which is a place for researchers to publish data. This is fine for tiny data sets but we have a problem with researchers who have large data sets where they want to publish and there is no community to publish to. We will allow researchers to store data sets indefinitely. Our HA version of Globus requires authentication, but for data which has gone through the publication process it would be great to be able to promote a URL and have truly public access. Another thing is managing data with personally identifiable information and data with regulatory controls. This is challenging because it requires providing a whole sandbox of compute and storage to constrain a research activity, and to allow access only to researchers on a particular project. Data movement in this sort of environment needs an audit trail both in and out of this “walled garden” or secure enclave. Globus provides a level of audit and we require a copy of the moved data to be kept but also we keep a copy of the moved data in write-only storage so we have a complete audit trail of what went in and out. Ideally we would have a Globus endpoint in each enclave, and it would be set up automatically. (Globus Note: many institutions are already doing this, and it will be further simplified by the Automate service that is currently under development).