310 likes | 425 Views
HPC Storage Current Status and Futures. Torben Kling Petersen, PhD Principal Architect, HPC. Agenda ??. Where are we today ?? File systems Interconnects Disk technologies Solid state devices Solutions Final thoughts …. Pan Galactic Gargle Blaster
E N D
HPC StorageCurrent Status and Futures Torben Kling Petersen, PhD Principal Architect, HPC
Agenda ?? • Where are we today ?? • File systems • Interconnects • Disk technologies • Solid state devices • Solutions • Final thoughts … Pan Galactic Gargle Blaster "Like having your brains smashed out by a slice of lemon wrapped around a large gold brick.”
Current Top10 ….. n.b. NCSA Bluewaters 24 PB 1100 GB/s (Lustre 2.1.3)
Other parallel file systems • GPFS • Running out of steam ?? • Let me qualify !! (and he then rambles on …..) • Frauenhofer • Excellent metadata perf • Many modern features • No real HA • Ceph • New, interesting and with a LOT of good features • Immature and with limited track record • Panasas • Still RAID 5 and running out of steam ….
Object based storage • A traditional file system includes a hierarchy of files and directories • Accessed via a file system driver in the OS • Object storage is “flat”, objects are located by direct reference • Accessed via custom APIs • Swift, S3, librados, etc. • The difference boils down to 2 questions: • How do you find files? • Where do you store metadata? • Object store + Metadata + driver is a filesystem
Object Storage Backend: Why? • It’s more flexible. Interfaces can be presented in other ways, without the FS overhead. A generalized storage architecture vs. a file system • It’s more scalable. POSIX was never intended for clusters, concurrent access, multi-level caching, ILM, usage hints, striping control, etc. • It’s simpler. With the file system-”isms” removed, an elegant (= scalable, flexible, reliable) foundation can be laid
Elephants all the way down... Most clustered FS and OS are built on local FS’s – ….and inherit their problems • Native FS • XFS, ext4, ZFS, btrfs • OS on FS • Ceph on btrfs • Swift on XFS • FS on OS on FS • CephFS on Ceph on btrfs • Lustre on OSS on ext4 • Lustre on OSS on ZFS
The way forward .. • ObjectStor based solutions offers a lot of flexibility: • Next-generation design, for exascale-level size, performance, and robustness • Implemented from scratch • "If we could design the perfect exascale storage system..." • Not limited to POSIX • Non-blocking availability of data • Multi-core aware • Non-blocking execution model with thread-per-core • Support for non-uniform hardware • Flash, non-volatile memory, NUMA • Using abstractions, guided interfaces can be implemented • e.g., for burst buffer management (pre-staging and de-staging).
Interconnects (Disk and Fabrics) • S-ATA 6 Gbit • FC-AL 8 Gbit • SAS 6 Gbit • SAS 12 Gbit • PCI-E direct attach • Ethernet • Infiniband • Next gen interconnect…
12 Gbit SAS • Doubles bandwidth compared to SAS 6 Gbit • Triples the IOPS !! • Same connectors and cables • 4.8 GB/s in each direction with 9.6 GB/s following • 2 streams moving to 4 streams • 24 Gb/s SAS is on the drawing board ….
PCI-E direct attach storage • M.2 Solid State Storage Modules • Lowest latency/highest bandwidth • Limitations in # of PCI-E channels available • Ivy Bridge has up to 40 lanes per chip
Ethernet – Still going strong ?? • Ethernet has now been around for 40 years !!! • Currently around 41% of Top500 systems … 28% 1 GbE 13% 10 GbE • 40 GbE shipping in volume • 100 GbE being demonstrated • Volume shipments expected in 2015 • 400 GbE and 1 TbE is on the drawing board • 400 GbE planned for 2017
Next gen interconnects … • Intel acquired Qlogic Infiniband team … • Intel acquired Cray’s interconnect technologies … • Intel has published and shows silicon photonics … • And this means WHAT ????
Hard drive futures … • Sealed Helium Drives (Hitachi ) • Higher density – 6 platters/12 heads • Less power (~ 1.6W idle) • Less heat (~ 4°C lower temp) • SMR drives (Seagate) • Denser packaging on current technology • Aimed at read intensive application areas • Hybrid drives (SSHD) • Enterprise edition • Transparent SSD/HHD combination (aka Fusion drives) • eMLC + SAS
Hard drive futures … • HAMR drives (Seagate) • Using a laser to heat the magnetic substrate (Iron/Platinum alloy) • Projected capacity – 30-60 TB/ 3.5 inch drive … • 2016 timeframe …. • BPM (bit patterned media recording) • Stores one bit per cell, as opposed to regular hard-drive technology, where each bit is stored across a few hundred magnetic grains • Projected capacity – 100+ TB / 3.5 inch drive …
What about RAID ? • RAID 5 – No longer viable • RAID 6 – Still OK • But re-build times are becoming prohibitive • RAID 10 • OK for SSDs and small arrays • RAID Z/Z2 etc • A choice but limited functionality on Linux • Parity Declustered RAID • Gaining foothold everywhere • But …. PD-RAID ≠ PD-RAID ≠ PD-RAID …. • No RAID ??? • Using multiple distributed copies works but …
Flash (NAND) • Supposed to “Take over the World” [cf. Pinky and the Brain] • But for high performance storage there are issues …. • Price and density not following predicted evolution • Reliability (even on SLC) not as expected • Latency issues • SLC access ~25µs, MLC ~50µs … • Larger chips increase contention • Once a flash die is accessed, other dies on the same bus must wait • Up to 8 flash dies shares a bus • Address translation, garbage collection and especially wear leveling add significant latency
Flash (NAND) • MLC 3-4 bits per cell @ 10K duty cycles • SLC • 1 bit per cell @ 100K duty cycles • eMLC • 2 bits per cell @ 30K duty cycles • Disk drive formats (S-ATA / SAS bandwidth limitations) • PCI-E accelerators • PCI-E direct attach
NV-RAM • Flash is essentially NV-RAM but …. • Phase Change Memory (PCM) • Significantly faster and more dense that NAND • Based on chalcogenide glass • Thermal vs electronic process • More resistant to external factors • Currently the expected solution for burst buffers etc … • but there’s always Hybrid Memory Cubes ……
Solutions … • Size does matter ….. 2014 – 2016 >20 proposals for 40+ PB file systems Running at 1 – 4 TB/s !!!! • Heterogeneity is the new buzzword • Burst buffers, data capacitors, cache off loaders … • Mixed workloads are now taken seriously …. • Data integrity is paramount • T10-DIF/X is a decent start but … • Storage system resiliency is equally important • PD-RAID need to evolve and become system wide • Multi tier storage as standard configs … • Geographical distributed solutions commonplace
Final thoughts ??? • Object based storage • Not an IF but a WHEN …. • Flavor(s) still TBD – DAOS, Exascale10, XEOS, …. • Data management core to any solution • Self aware data, real time analytics, resource management ranging from job scheduler to disk block …. • HPC storage = Big Data • Live data – • Cache ↔︎ Near line ↔︎ Tier2 ↔︎ Tape ? ↔︎ Cloud ↔︎ Ice • Compute with storage ➜ Storage with compute Storage is no longer a 2nd class citizen
Thank You torben_kling_petersen@xyratex.com