In this first in a series of posts about what’s going on in big data storage, especially as relates to research centers, universities and labs, I’ll be talking about the current state of campus research storage – the good and the bad – and point out what campuses need to be looking at as part of their data management planning.
Campus Research Storage Options
Let’s start with a summary of the options campuses have today in terms of providing storage for their researchers:
- On-premise options:
- Roll your own: Usually low-cost commodity servers, disk drives, and network, with open source storage management software such as Ceph, ZFS, or Lustre, installed and managed by local staff
- Turnkey: A complete storage hardware and software solution supported by a vendor, such as a NAS appliance or Spectra Logic’s BlackPearl
- Cloud-based options:
- Cloud file storage: Google Drive, Box, DropBox, OneDrive, etc.
- Cloud object storage: Amazon S3, Google Cloud Storage, Azure, Wasabi, etc.
- Cloud cold storage: Amazon Glacier, Google Cloud Storage Nearline, Azure Cool Blob Storage, etc.
- Cloud file system: Amazon EFS, Azure File Storage, etc.
Lots of options to choose from!
I Need More Storage
We hear stories like these all the time:
- A new imaging core in the medical center just spent $100K on “fast” storage attached to the Windows servers on their microscopes ...now they’re wondering how they will back it up.
- The head of the physics department asked two of her postdocs to build a “big NAS system for all our data” and gave them a $30K budget.
- A researcher has been keeping their data on USB drives they bought for online, but they need more and are looking for a better option.
- The research computing center is building a new storage system from commodity components to make available to campus researchers under a charge-back model.
- A genomics project is moving data from their genomics core to the cloud where they are doing the processing.
- The CIO’s office announced that they signed a campus-wide license for “unlimited” storage on a public cloud provider, so every student, staff and faculty member should use that for their files from now on.
Institutions everywhere are building storage silos as researchers clamor for more space to hold their ever-increasing data volumes. Individual labs and departments invest in storage on the assumption that it can serve all their needs: support fast computation, facilitate easy access to large files/reference data, and archive all data products so they can be easily recovered in future. But, as many of us know, buying a single system and expecting it to deal effectively with such diverse workloads is not practical. And once they run into issues with on-premises storage, they look to the cloud for relief, introducing additional challenges around security, durability, ease of access, and cost.
So what if we approached research storage as a shared collection of workload specific tiers, instead of a collection of monolithic systems tied to specific owners and serving multiple purposes?
Researchers typically need some combination of: (a) fast, but not necessarily highly reliable (e.g., scratch), storage attached to HPC/HTC resources; (b) online, reliable, low cost general purpose storage for all of their active datasets, large and small and (c) very inexpensive, reliable, but potentially slow (e.g., nearline) storage for archive and backup. In an ideal world, researchers would access all of these systems via a unified interface (an end-user “single pane of glass” if you will) and data would flow seamlessly between tiers, aided by robust automation that requires no user intervention.
Pipedream, you say? Perhaps, but this is the goal we’re working towards here at Globus. And, later in this blog series, we’ll hear from some campuses that adopted this view early on and are starting to reap the benefits.
In this series we’ll take in-depth looks at both on-premise and cloud-based storage solutions, their pros and cons, and practical advice for making the best of them.
We’ll kick off with on-premise turnkey solutions in our next article, with a Q&A discussion with storage leader Spectra Logic.
Thanks for reading!