310 likes | 316 Views
This talk discusses the history and advancements in scalable storage system designs, including topics such as scalable file systems, Google File System, Hadoop File System, and more. It also explores the challenges and solutions in making storage systems faster and more scalable.
E N D
March 15, 2012 Prof. Matthew O’Keefe Department of Electrical and Computer Engineering University of Minnesota Trends in Scalable Storage System Design and Implementation
Organization of Talk • History: work in 1990’s and 2000’s • Scalable File Systems • Google File System • Hadoop File System • Sorrento, Ceph, Ward Swarms • Comment on FLASH • Questions…
Prior Work Part I: Generating Parallel Code • Detecting parallelism in scientific codes, generating efficient parallel code • Historically, had been done on loop-by-loop basis • Distributed memory parallel computers required more aggressive optimization • Parallel programming still a lot like assembly language programming • Increasing scope of code to analyze optimize as parallelism increases • What’s needed is a way to express the problem solution at a much higher level from which efficient code can be generated • Leverage design patterns and translation technologies to reduce the semantic gap
Prior Work Part II: Making Storage Systems Go Faster and Scale More
Storage System Scalability/Speed • Storage interface standards lacked ability to scale in both speed and connectivity • Industry countered this with new standard: Fibre Channel • Allowed shared disks, but system software like file systems and volume managers not built to exploit this • Started the Global File System project at U. of Minnesota in 1995 to counter this • Started Sistina Software to commercialize GFS and LVM, sold to Red Hat in 2003 • Worked for Red Hat for 1.5 years, then (being a glutton for punishment) started another company ( Alvarri) to do cloud backup in 2006 • Alvarri sold to Quest software in late 2008, dedupe engine part of Quest’s backup product • So two commercial products developed and still shipping
Making Storage Systems Faster and More Scalable • GFS pioneered several interesting techniques for cluster file systems: • no central metadata server • distributed journals for performance, fast recovery • first Distributed Lock Manager for Linux — now used in other cluster projects in Linux • Implemented POSIX IO • Assumption at time was: POSIX is all there is, have to implement that • Kind of naïve, assumed that it had to be possible • UNIX/Windows view of files as linear stream of bytes which can be read/written to anywhere in file by multiple processors • Large files, small files, millions of files, directory tree structure, synchronous write/read semantics, etc. all make POSIX difficult to implement
Why POSIX File Systems Are Hard • They’re in the kernel and tightly integrated with complex kernel subsystems like virtual memory • Byte-granularity, coherency, randomness • Users expect them to be extremely fast, reliable, and resilient • Add parallel clients and large storage networks (e.g., Lustre or Panassas) things get even harder • POSIX IO was the emphasis for parallel HPC IO (1999 through 2010) until recently • HPC community re-thinking this • Web/cloud has already moved on
Meanwhile: Google File System and its Clone (Hadoop) • Google and others (Hadoop) went a different direction: change the interface from POSIX IO to something inherently more scalable • Users have to write (re-write) applications to exploit the interface • All about scalability — using commodity server hardware — for a specific kind of workload • Hardware-software co-design: • append-only write semantics from parallel producers • mostly write-once, read many times by consumers • explicit contract on performance expectations: small reads and writes — Fuggedaboutit! • Obviously quite successful, and Hadoop is becoming something of an industry standard (at least the API’s) • Lesson: if solving the problem is really, really hard, look at it a different way, move interfaces around, change your assumptions (e.g., as in the parallel programming problem)
Google/Hadoop File Systems • Google needed a storage system for its web index, various applications — enormous scale • GFS paper at FAST conference in 2004 led to development of Hadoop, open source GoogleFS clone • Co-designed file system with applications • Applications use map-reduce paradigm • Streaming (complete) reads of very large file/datasets, process this data into reduced form (e.g., an index) • Files access is write-once, append-only, read-many
Map-Reduce • cat * | grep | sort | unique -c | cat > file • input | map | shuffle | reduce | output • Simple model for parallel processing • Natural for:– Log processing– Web search indexing – Ad-hoc queries • Popular at Facebook, Google, Amazon etc. to determine what ads/products to throw at you
Scalable File System Goals • Build with commodity components that frequently fail (to keep things cheap) • So design assumes failure is common case • Nodes incrementally join and leave the cluster • Scale to 10s to 100s of Petabytes, headed towards exabytes; 1000’s to 10s of 10,000s of storage nodes and clients • Automated administration, simplified recovery (in theory, not practice)
Hadoop Slides • Hadoop is open source • Until recently missing these GoogleFS features: • Snapshots • HA for NameNode • Multiple writers (appender’s) to single file • GoogleFS • Within 12 to 18 months, GoogleFS was found to be inadequate for Google’s scaling needs, had to build federated GoogleFS clusters
Other Research on Scalable Storage Clusters (none in production yet) • Ceph – Sage Weil, UCSC : POSIX lite • Multiple metadata servers, dynamic workload balancing • Mathematical hash to map file segments to nodes • Sorrento – UCSB : POSIX with low write-sharing • Distributed algorithm for capacity and load balancing, distributed metadata • Lazy consistency semantics • Ward Swarms – Lee Ward, Sandia • Similar to Sorrento, uses victim cache and storage tiering
Reliability Motivates Action • DoE building exascale machines with 100,000+ separate multicore processors for simulations • Storage systems for current teraflop and petaflop systems do not provide sufficient reliability or speed at scale • Just adding more disks does not solve the problem, requires a whole new design philosophy
Semantics Motivate Action • IEEE POSIX I/O is nice and familiar • Portable and guarantees coherent access and views • Stable API • But it cannot scale
Architecture Motivates Action • Existing, accepted, solutions all use the same architecture • Based on “Zebra”, it is a centrally managed metadata service and collection of stores • Examples: Lustre, Minnesota GFS, Panassas File System, IBM GPFS, etc. • Existing, at scale, file systems are challenged • Managing all the simultaneous transfers is daunting • Managing the number of components is daunting • Anticipate multiple orders of magnitude growth with move to exascale
Nebula: A Peer-To-Peer Inspired Architecture • A self-organizing store • Peers join swarm to maximize aggregate bandwidth need • A self-reorganizing store • Highly redundant components supply symmetric services • Failure of a component or comms link only motivates the search for an equivalent peer
Node-Local Capacity Management • Since we want to make more replicas than required for performance reasons • We must not tolerate a concept of “full” • We must be able to eject file segments • Can’t do this in a file system, they are persistent stores • Can do this in a cache! • But scary things can happen • If all are equal partners we can thrash • To the point where we deadlock, even • And at some point, somewhere we must guarantee persistence
Cache Tiers • Need to order the storage nodes by capability • Some combination of storage latency, storage bandwidth, how full, communication performance characteristics • A tuple (which can be ordered) can approximate all of these • With an ordering, we can apply concepts like better and worse • When we can apply “better” and “worse” we can generate a set of rules • That tell us when and whether we can eject a segment • That tell us how we may eject a segment • Effectively, we will impose a tier-like structure
Think of it as a Sliding Scale • More reliable components • Larger capacity • Lower performance • Maybe even offline • Tends to • Less churn • Older data “Worse” “Better” • Less reliable • Maybe even volatile • High performance • Tends to • Very high churn • Only current data
Victim-Caches • With a tier-like concept we can implement victim-caches • Just a cache that another cache ejects into • We introduce a couple of relatively simple rules to motivate all kinds of things • When ejecting, if a copy must be maintained then eject into a victim cache the lies within a similar or “worse” tier • “Worse” translates as more persistence, potentially higher latency and lower bandwidth. When application is recruiting for performance, best candidates are those from “better” tiers • Better translates as more volatile, lower latency and higher bandwidth
Data Movement: Writing • App writes into highly volatile, local swarm • Which is small and must eject • To less performant, more persistent nodes • Which must eject • To a low performance, high persistence source
Reading is similar • First, find all the parts of the file • Start requesting segments • If not happy with achieved bandwidth • Find some nodes in high, “best” tier and ask them to stage parts of the file • Since they will use the same sources as the requestor, they will tend to ask near tiers to stage as well in order to provide them the bandwidth they require • Rinse, wash, repeat throughout • Start requesting segments from them as well, redundantly • As they ramp, source bias logic will shift focus onto the recruited nodes
Current Status of Project • Primitive client and storage node prototype working now • Looking for students (undergrad and grad students) — happy to co-advise with other faculty • Ramping up… • Using memcached and libmemcached (name-value store) as a development and research tool: support HPC and web workloads • Nebula storage server semantics are richer • Question: how much does that help?
Software-Hardware Co-Design of First Tier Storage Node • New hardware technologies (SSD, hybrid DRAM) pushing the limits of OS storage/networking stack • Question: Is it possible to co-design custom hardware and software for first-tier storage node? • File system in VLSI: BlueArc NAS • design FPGA for specific performance goal
Nibbler: Co-Design Hardware and Software • Accelerates memcached and Nebula storage node performance via SSD’s and hybrid DRAM • Hardware-software codesign • e.g., BlueArc/HDS puts file system in VLSI • Large, somewhat volatile memory • Performance first, but also power and density • First-tier storage node will drive ultimate performance achievable by this design
Top 10 Lessons in Software Development • Be humble (seriously) • Be paranoid: for large projects with aggressive performance or reliability goals, you have to do just about everything right • Be open: to new ideas, new people, new approaches • Make good senior people mentor the next generation (e.g., pair programming): temptation is to reduce/remove their distractions • Let the smartest, most capable people bubble up to the top of the organization • Perfect is the enemy of the good • Take upstream processes seriously, but beware the fuzzy front-end • Upstream (formal) inspections not just reviews • For projects of any size, never abandon good SE practices, especially rules around coding styles, regularity, commenting and documentation • Never stop learning: software engineering practices and tools, new algorithms and libraries, etc. — and if in management, create an environment that let’s people learn