State Of the Craft in Research Data Management
June 14, 2021 | Susan Tussy
Data volumes are exploding, and the need to efficiently store and share data quickly, reliably and securely—along with making the data discoverable—is increasingly important. Deliberate planning and execution are necessary to properly collect, curate, manage, protect and disseminate the data that are the lifeblood of the modern research enterprise.
In order to better understand the current state of research data management we reached out to several research computing leaders at institutions around the world to gain insights into their approach and their recommendations for others. We wanted to explore how data management tools, services, and processes deployed at their institutions help the research community reduce the amount of time spent addressing technological hurdles and accelerate their time to science.
At these, and many other leading institutions, new methodologies and technologies are being developed and adopted to create a frictionless environment for scientific discovery. Self-service data portals and point-and-click data management tools such as Globus are being made available to allow researchers to spend more time on science and less on technology. Education, documentation and a data management plan can assist researchers in their desire to spend more time on science. Hack-a-thons, webinars and researcher resource books are just some of the ways that institutions are educating researchers on best practices. Science Node published a summary of our findings, and we present here a more complete discussion with Nick Jones from the New Zealand eScience Infrastructure, in the final installment of our five-part series.
Nick Jones - Director, New Zealand eScience Infrastructure
What is important with respect to data management in science?
For context, New Zealand’s national science system was a late adopter of advanced digital research infrastructures and platforms. A decade ago our national profile of advanced computational research skills and capabilities was slim. There were few nationally recognised big science missions. Looking to today, natural hazards, our natural environment, and health, diversity, and genetics across all species are areas where we now see significant capability. And all of these cases going all the way back to climate science, these areas of research are computationally and data intensive. So in many respects, our ability to advance science through research data becomes critical for New Zealand.
One further point to set the scene in a local sense, and to offer perhaps a different perspective to build a broader view. In New Zealand we have a unique arrangement of governance and sovereignty between Māori the tangata whenua (indigenous peoples) of Aotearoa New Zealand and the Government. As a science system we are on a journey in recognising this relationship between Māori and Government - which has implications due to a complicated array of interests across many areas of research.
How would you describe your approach to research data management?
Our role as a research computing and data platform is to invest into enabling research across a broad and inclusive array of research activities. The sheer diversity of needs is driving us to invest in a much richer set of services than ever before. We’re extending our view out to data intensive methods and workflows which operate beyond the walls of our facilities.
We took a significant step towards this a few years ago by adopting Globus as our de facto national data transfer platform for research - we’ve invested to build and sustain capability to operate Globus as a national service provider, and to drive adoption widely across the NZ research system. Our goal has been to lower barriers, to normalise expectations of moving demanding volumes of data to enable data intensive science.
Our focus at NeSI remains on active data within computational workflows—where researchers need to move data in and out of campuses, on and off different facilities and instruments, with a principle of moving data as little as possible and implementing the right controls and governance protocols.
What do you do to encourage adoption of tools?
We love working within research communities, working together to come up with tactics to help them meet their needs. One question we often face is how much advanced computational training we should do versus basic research computing skills training.
We’ve been involved in a wide range of activities, from “Hacky Hours” to webinars to hackathons to workshops on research data management working through principles and tools and practices. Recently we held a Git for research webinar with 70 researchers, many of whom walked away with a real sense of progress in how they could work. There is real interest in communities to pick up basic skills. While these types of tactics are becoming common at a few institutions—there is little coordinated action at a national level.
In 2015 we formally adopted the Carpentries approach to basic computational skills training, recognising we needed national coordination and a strategy of instructor training (train the trainer). This has scaled up computational skills training for researchers, and offers professional development and formalises a key role of those supporting research and working as research software engineers. Still, there is no single party responsible for supporting the digital transformation of research—and where do you start when considering the diversity of needs and tools, and the lack of a viable standardised support model? Often this is seen as the role of an institution, though equally an institution might typically see this as a role for any research community within an academic tradition.
We tend to adopt a broader systems view when thinking about our investments, aiming to identify the levers that will drive the highest impact for researchers or the science system as a whole. An example—we’ve partnered with our national research & education network REANNZ, adopting the Science DMZ pattern, and focusing on building capability into institutions to grow the network. We support connections across that community, whether through shared support channels in Slack or hosting Globus webinars and national data transfer forums at national events.
One key to the evolution of our platform has been adopting a user-centric posture. We try to hold a broad view of who we’re here to support—being more inclusive helps us in uncovering new insights from users who haven’t learnt to work around the way things are. We’re looking closely at the effort invested by users in carrying out their research. And as we identify and address their pain points, we strengthen relationships and build confidence. This opens up conversations with researchers and institutions on their future directions, and it’s these conversations which are taking us to new and interesting places.
How do you address the needs for collaboration?
As a national provider, we operate at the intersection of national infrastructures and institutions—we’re a collaborative venture embedded across the sector, and as a small country we depend on collaboration to achieve a useful level of scale and capability.
For our own efforts it has taken time to lay down necessary foundations. We’ve invested into relationships and building of shared understanding. We align around shared objectives and mutually compatible goals. There are always differences of perspective to work through, and challenges in sustaining commitments—there is always a need for energy to be applied. Our small scale works in our favour to some extent, as it isn’t reasonable to allow collaboration to fail—we don’t have the diversity of investments which means a failure in one simply opens opportunity for another. We’re getting far better at identifying the points of inertia at the national level, and being deliberate about structures and incentives to drive collaboration and shared outcomes.
Stepping right back and pondering our sense of nationhood, we are working through differences in belief systems and values. As a nation, and as researchers, we recognise Māori have an advanced system of knowledge, referred to as mātauranga Māori. Māori have always held a very long term view, and are keen observers of the stars and the galaxies, the seasons and weather patterns, all aspects of the environment, and especially of people and community. To work in research areas such as environment, ecology, population health, and disease—so many of the key needs we have for advancing our research involve tangata whenua rights, interests, and needs. And that research needs to be done in response to those interests.
There are now a wide range of areas of science where Māori are leading the work and where Māori communities are at the table, with the research guided by community-led governance. This is at least as important as the underlying tools and skills when approaching research data management in New Zealand. Māori data sovereignty is becoming a central tenet, as so many of our common interests at a national level intersect with tangata whenua interests in land, people, waters, and systems of knowledge and belief. If we’re to succeed in research data management, it is essential we keep working on effective models of collaboration and governance.
What trends are you seeing around data management?
We’re seeing a lift in expectations for more capability and maturity in research data management across a range of research communities. Research teams are focusing on building shared understanding of practices, data workflows, and governance. Research institutions are looking to invest in integrated and flexible research platforms to support a varied range of needs.
We see some institutional dropbox style models widely adopted, and see some common tooling like Globus being adopted, and automation for experimental workflows being adopted for moving data between storage and instruments. Some patterns are emerging, but the integration points are challenging, like the back-end of the data transfer nodes (DTNs) and Globus storage connectors. We are starting to see things change as we leverage more cloud first philosophies and cultures, though the cost barriers remain high. There are the starting threads of national discussions on open research, and on research data management as an investment priority for building institutional capability.
What are some of the challenges in research data management?
We have looked at a broader range of use cases, though with a strong focus on active data. With traditional HPC focused on great I/O in shared file systems, the starting question is whether these systems work for a fuller life cycle of data management? Are they geared for that?
There are points of friction in integrating HPC into and supporting broader needs for research data management, spanning a range of challenges around delivering useful software and services, balancing security controls and restrictions, and in general evolving operating models which have been optimised for HPC over many decades. Traditional HPC systems design is for shared multi-user file systems and so the security posture is quite closed. Having accessible APIs, and supporting use cases where you can access interactive notebooks on the HPC, these are complicated use cases given a shared environment. Working inside an HPC platform, acquiring technologies that aren’t necessarily purpose built for those environments, and mounting them for new communities can be a laborious process. It will be interesting to watch as cloud native technologies become more prominent, though the rate of change in scientific computing is relatively slow due to the sunk cost in previous investments.
We’ve adopted JupyterHub and RStudio to improve our services for working with data. We’re exploring the emerging super facility patterns including adopting micro services for various administrative tasks. We’re embracing cloud native technologies and automation, and experimenting with emerging models such as serverless using platforms such as FuncX. All of these are driving us to change our platform designs, and incorporate cloud-native platforms, technologies like Singularity, kubernetes, and micro service architectures. And we’re seeing the traditional HPC vendors start to aggressively reengineer their HPC operating systems and management environments to adopt these same technologies. We’re witnessing a point of inflection, where cloud-native computing is delivering the performance required for HPC, and HPC technology vendors are now adopting these technologies at the core of their products.
One particular opportunity that comes from this shift is a move from shared multi-user environments to multi-tenancy. In HPC we’re moving into a technology environment where it’s feasible to completely isolate different user groups from each other while still operating on shared systems. This allows for the efficiency of consolidation while enabling us to meet a more diverse set of needs with a richer set of services. And more importantly it’ll allow us to venture into supporting human and clinical genomics, and into responding to concerns for data sovereignty as we work with Māori on research to meet their needs. The journey ahead is certainly looking interesting.