Case Study: The University of Alabama at Birmingham OpenStack , Ceph , Dell

Case Study: The University of Alabama at BirminghamOpenStack , Ceph, Dell Kamesh Pemmaraju, Dell John-Paul Robinson, UAB OpenStack Summit 2014 Atlanta, GA

An overview • Dell – UAB backgrounder • What we were doing before • How the implementation went • What we’ve been doing since • Where we’re headed

Dell – UAB background • 900 researchers working on Cancer and Genomic Projects. • Their growing data sets challenged available resources • Research data distributed across laptops, USB drives, local servers, HPC clusters • Transferring datasets to HPC clusters took too much time and clogged shared networks • Distributed data management reduced researcher productivity and put data at risk • They therefore needed a centralized data repository for Researchers in order to insure compliances concerning retention of data. • They also wanted scale-out cost-effective solution and hardware that could be re-purposed for compute & storage

Dell – UAB background (contd..) • Potential solutions investigated • Traditional SAN • Public cloud storage • Hadoop UAB chose Dell/Inktank to architect a platform that would be very scalable and provide lost costs per GB and was the best of all worlds that provide compute and storage on the same hardware.

A little background… • We didn’t get here overnight • 2000s-era High Performance Computing • ROCKS-based compute cluster • The Grid and proto-clouds • GridWay Meta-scheduler • OpenNebula an early entrant that connected grids with this thing called the cloud • Virtualization through-and-through • DevOps is US

Challenges and Drivers • Technology • Many hypervisors • Many clouds • We have the technology…can we rebuild it here? • Applications • Researcher started shouting “Data”! NextGen Sequencing Research Data Repositories Hadoop • Researcher kept on shouting “Compute”!

Data Intensive Scientific Computing • We knew we needed storage and computing • We knew we wanted to tie it together with an HPC commodity scale-out philosophy • So August 2012 we bought 10 Dell 720xd servers • 16-core • 96GB RAM • 36TB Disk • A 192-core, ~1TB RAM, 360TB expansion to our HPC fabric • Now to integrate it…

December 2012 • Bob said: Hearing good things about open stack and ceph at this week at dell world. Simon anderson, CEO of dream host , spoke highly of dell, open stack, and ceph today. He is also chair of company that supports He also spoke highly of dell crowbar deployment tool. I

December 2012 • Bob said: Hearing good things about open stack and ceph at this week at dell world. Simon anderson, CEO of dream host , spoke highly of dell, open stack, and ceph today. He is also chair of company that supports He also spoke highly of dell crowbar deployment tool. • I said: Good to hear. I've been thinking a lot about dell in this picture too. We have the building blocks in place. Might be a good way to speed the construction.

Lesson 1: Recognize when a partnership will help you achieve your goals.

The 2013 Implementation • The Timeline • In January we started our discussions with Dell and Inktank • By March we had committed to the fabric • A week in April and we had our own cloud in place • The Experience • Vendors committed to their product • Direct engagement through open communities • Bright people who share your development ethic

Next Step…Build Adoption • Defined a new storage product based on the commodity scale-out fabric • Able to focus on strengths of Ceph to aggregate storage across servers • Provision any sized image to provide Flexible Block Storage • Promote cloud adoption within IT and across the research community • Demonstrate utility with applications

Applications • Crashplan Backup in the cloud • A couple hours to provision the VM resources • An easy half-day deploy with the vendor because we controlled our resources a.k.a. firewall • Add storage containers on the fly as we grow…10TB in few clicks • Gitlab hosting • Start a VM spec’d according to project site • Work with Omnibus install. Hey it uses Chef! • Research Storage • 1TB storage containers for cluster users • Uses Ceph RBD images and NFS • The storage infrastructure part was easy • Scaled provisioning, 100+ user containers (100TB) created in about 5 minutes. • Add storage servers as existing ones fill

Ceph Rebalances as Storage Grows :)

Lesson 2: Use it! That’s what it’s for!

Lesson 2: Use it! That’s what it’s for! The sooner you start using the cloud the sooner you start thinking like the cloud.

How PoC Decisions Age Over Time • Pick the environment you want when you are in operation…you’ll be there before you know it • Simple networking is good • But don’t go basic unless you are able to reinstall the fabric • Class B ranges to match the campus fabric • We chose a split admin range to coordinate with our HPC admin range • We chose a collapsed admin/storage network due to a single switch…probably would have been better to keep separate and allow growth • It’s OK to add non-provisioned interfacing nodes…know your net • Avoid painting yourself in corner • Don’t let the Paranoid Folk box-in your deployment • An inaccessible fabric is an unusable fabric • Fixed IP range mismatch with “fake” reservations

Lesson 3: The fabric is flexible. Let it help you solve your problems

Problems will Arise • The release version of the ixgbe driver in Ubuntu 12.04.1 kernel didn’t perform well with our 10Gbit cards • Open source has an upstream • Use it as part of debug network • Upgrading the drivers was a simple fix • Sometimes when you fix something you break something else • There are still a lot of moving parts but each has a strong open source community • Work methodically • You will learn as you go • Recognize the stack is integrated and respect tool boundaries

Sometimes a Problem is just a Problem • Code ex

Lesson 4: The code *is* the documentation

Lesson 4: The code *is* the documentation …and that’s a *good* thing

Where we are today • OpenStack plus Ceph are here to stay for our Research Computing System • They give us the flexibility we need for an ever expanding research applications portfolio • Move our UAB Galaxy NextGen Sequencing platform to our Cloud • Add Object Storage services • Put the cloud in the hands of researchers • The big question…

…how far can we take it? • The goal of process automation is scale • Incompatible, non-repeatable, manual processes are a cost • Success is in dual-use • Satisfy your needs and customer demand • Automating process implies documenting process…great for compliance and repeatability • Recognize the latent talent in your staff today’s system admins are tomorrows systems developers • Traditional infrastructure models are ripe for replacement

Lesson 5? You can we learn from research and engage as a partner

Want to learn more about Dell + OpenStack + Ceph? Join the Session, 2:00 pm, Tuesday, Room #313 Software Defined Storage, Big Data and Ceph - What Is all the Fuss About? Neil Levine, Inktank& Kamesh Pemmaraju, Dell

Case Study: The University of Alabama at Birmingham OpenStack , Ceph , Dell