November 6, 2017 | Steve Tuecke

This is the third in our Storage Innovations blog series. In our last post, we talked about the benefits of turnkey on-premise storage solutions. In today’s post I’ll be addressing the benefits and risks of the alternative: a roll-your-own approach.

Rolling your own is certainly tempting, primarily due to the perceived cost benefits. A robust tiered storage solution doesn't come cheap so there's incentive for campuses to build vs. buy. There can also be control and customization/flexibility benefits to managing a solution you build yourself. But neither of these two benefit categories are as simple as they look…

Is it really cheaper to roll your own?

Not always. Keep in mind that the “cost” of storage includes more than just the hardware -- the people cost to set up and maintain the system can be far greater. It takes a lot of time and expertise to spec out, procure, set up, test, tune, optimize and manage a shared on-premise storage system. If you don’t have the right experts on staff, you need to hire someone -- and storage experts don’t grow on trees. But even if you have this expertise in house, the cost of that person’s time is likely to be high.

And here’s an additional factor: when you’ve hired or allocated an expert to build your own storage solution, you are now dependent on that person. If your expert leaves or becomes unavailable, you now have a custom-built storage solution that none of your staff know how to manage. Yes, you can perform knowledge transfer to guard against this, but as you may have experienced, it’s unwise to depend on this. Of course you can always hire, but as mentioned above the market for such experts is tight and then we’re also back to the cost issue: experts are expensive.

Finally, maintaining your own custom-built solution is heavy lifting, even if built well by an expert team who will never leave campus. Tools like Globus help by providing a seamless UI, but you still have to manage the systems themselves, facing all the challenges that come with hardware maintenance. Yes, you get control and flexibility, but you pay for this in money, time and risk.

So when does Roll-Your-Own storage make sense?

I’m not saying that rolling your own storage is always folly. I’m just urging you to head into such a project with your eyes wide open. Rolling your own can be a good idea for a campus with a large team with multiple areas of expertise, and also where institutional capability exists beyond this one storage system.

Examples – here are some good / bad fits for custom-built storage based on customer experiences:

  1. Good experience: A large R1 university who already had a substantial investment in Lustre for HPC needed a general purpose / commodity storage solution, so they decided to use Lustre for this as well. They procured the systems (optimizing for costs by using cheap commodity disk, ethernet, etc.) and built out the Lustre filesystem themselves for all their storage needs. Their team is well established and redundant in that they aren’t dependent on a single Lustre expert.
  2. Not-so-good experience: Another customer, a smaller R1 university with just a few people in research computing, decided to build a ZFS-based solution since one of their team was skilled in ZFS. They bought 2 cheap Linux servers with local disks and ran ZFS using block replication to backup from one to the other. This took a long time to plan and procure -- and then, before the system even went live (do you see it coming?) -- yes, their ZFS expert left. The entire effort stalled out and to this day they are still scrambling to figure out how to get a solution in place.

The durability question

Cost and expertise issues aside, one factor that must be addressed in planning to build vs. buy a new storage solution is durability (i.e., how reliable is the storage from a data retention perspective). Here are a few examples of how to achieve higher durability when rolling your own research storage:

  1. Leverage existing tape or other inexpensive storage: If your institution has tape backup already, you can perhaps leverage that -- but make sure you understand incremental costs upfront (enterprise tape backup is often too expensive for research storage backup, as most campuses did not build their tape system to optimize for cost). This backup can be done using Globus, or using 3rd party backup tools.
  2. Leverage your existing storage system’s redundancy capabilities: For example, use erasure coding or mirroring. For example, Lustre and Ceph are have increasingly sophisticated methods for redundancy built in.  This approach may be suitable if your durability requirements are not too high. With this approach you may not have offsite backup, but you can typically survive some disk failure.
  3. Build a custom solution: It’s also an option to build your own backup in addition to your own storage - but of course, the same costs/risks all apply.

It’s easy when tasked with building your own solution to under-focus on durability – but you can get away with this only as long as it takes for the first failure to happen.

The rock and the hard place

Research computing professionals in charge of storage don’t have it easy. They are expected to deliver utterly robust and reliable solutions at the lowest possible costs. Propose spending more on a custom-built or turnkey solution, and the incremental gains are called into question. Save money with reliability shortcuts, and any issues or failures erode trust in the Research IT team.

And if you take too long delivering, end users may go around you and acquire their own cheap commodity storage (which, of course, will suffer from durability and performance issues). So it’s far from easy to pick a clear path -- just make sure you have all the right things in mind as you chart your course.

In summary

If/when rolling your own campus storage, be prepared to invest in terms of:

  • Dollars: You can buy cheap commodity systems, but be realistic about staffing and other costs.
  • Time. Everything takes longer than you think. The rule of thumb in the software industry is to multiply time estimates provided by engineers by 3; some of you may have experienced the same.
  • Expertise: You’ll need an onsite expert to built and run your storage, and you’ll need to be mindful of the risk if this person leaves (and how to guard against that).

Up next:

In our next article, we will discuss a new storage connector with a partner we’ll be announcing next week at SC17 - stay tuned and thanks for reading!