190 likes | 350 Views
Advanced Lustre ® Infrastructure Monitoring (Resolving the Storage I/O Bottleneck and managing the beast). Torben Kling Petersen, PhD Principal Architect High Performance Computing. The Challenge. The REAL challenge. File system Up/down Slow Fragmented Capacity planning
E N D
Advanced Lustre®Infrastructure Monitoring(Resolving the Storage I/O Bottleneckand managing the beast) Torben Kling Petersen, PhD Principal Architect High Performance Computing
The REAL challenge • File system • Up/down • Slow • Fragmented • Capacity planning • HA (Fail-overs etc) • Hardware • Nodes crashing • Components breaking • FRUs • Disk rebuilds • Cables ?? • Software • Upgrades / patches ?? • Bugs • Clients • Quotas • Workload optimization • Other • Documentation • Scalability • Power consumption • Maintenance windows • Back-ups
The Answer ?? • Tightly integrated solutions • Hardware • Software • Support • Extensive testing • Clear roadmaps • In-depth training • Even more extensive testing …..
ClusterStor Software Stack Overview • ClusterStor 6000 Embedded Application Server • Intel Sandy Bridge CPU, up to 4 DIMM slots • FDR & 40GbE F/E, SAS-2 (6G) B/E • SBB v2 Form Factor, PCIe Gen-3 • Embedded RAID & Lustre support ClusterStor Manager Lustre File System (2.x) Data Protection Layer (RAID 6 / PD-RAID) Linux OS Unified System Management (GEM-USM) Embedded server modules CS 6000 SSU
ClusterStor dashboard Problems found
Let’s do some math …. • Large systems use many HDDs to deliver both performance and capacity • NCSA BW uses 17,000+ HDDs for the main scratch FS • At 3% AFR this means 531 HDDs fail annually • That’s ~1.5 drives per day !!!! • RAID 6 rebuild time under use is 24 – 36 hours • Bottom line, the scratch system would NEVER be fully operational and there would constantly be a risk of loosing additional drives leading to data loss !!
Drive Technology/Reliability • Xyratex pre-tests all drives used in ClusterStor™ solutions • Each drive is subjected to 24-28 hours of intense I/O • Reads and writes are performed to all sectors • Ambient temperature cycles between 40 °C and 5°C • Any drive surviving, goes on to additional testing • As a result Xyratex disk drives deliver proven reliability with less that 0.3% annual failure rate • Real Life Impact • On a large system such as NCSA BlueWaters with 17,000+ disk drives, this means a predicted failure of 50 drives per year • *“Other vendors” publically state a failure rate of 3%* which (given equivalent number of disk drives) means 500+ drive failures per year • With fairly even distribution, the file system will ALWAYS be in a state of rebuild • In addition as a file system with wide stripes will perform according to the slowest OST, the entire system will always run in degraded mode ….. *DDN, Keith Miller, LUG 2012
Annual Failure Rate of Xyratex Disks • Actual AFR Data (2012/13) Experienced by Xyratex Sourced SAS Drives • Xyratex drive failure rate is less than half of industry standard ! • At 0.3%, the annual failure would be 53 HDDs
Evolution of HDD technology: Impacts System Rebuild Time • As growth in areal density growth slows (<25% per generation), disk drive manufacturers are having to increase the number of heads/platters per drive to continue to increase max capacity per drive y/y • 2TB drives today typically includes just 5 heads and 3 platters • 6TB drives in 2014 will include a minimum of 12 heads and 6 platters • More components will inevitably result in an increase in disk drive failures in the field • Therefore systems using 6TB must be able to handle the increase in the number of array rebuild events
Why Does HDD Reliability Matter? • The three key factors you must consider are drive reliability, drive size and the rebuild rate of your system • The scary fact is: new generation HDD, bigger drives will fail more often • Such drive failures are even more impactful on the file system performance and the risk of data loss when using bigger drives such as 6TB or larger !! • The rebuild window is bigger and risk of data loss is greater • Traditional RAID technology will take up to days to rebuild a single failed 6TB drive • Therefore Parity De-clustered RAID Rebuild technology is essential for any HPC system
Parity Declustered RAID - Geometry • PD RAID geometry for an array is defined as: P drive (N+K+A) example: 41 (8+2+2) • P is the total number of disks in the array • N is the number of data blocks per stripe • K is the number of Parity blocks per stripe • A is the number of distributed spare disk drives
Grid RAID advantage • Rebuild speed increased by more than 3.5 x • No SSDs, no NV-RAM, no accelerators ….. • PD-RAID as it was meant to be …
Thank you …. tkp@xyratex.com