1 / 101

Beyond the File System

Beyond the File System. Designing Large Scale File Storage and Serving Cal Henderson. Hello!. Big file systems?. Too vague! What is a file system? What constitutes big? Some requirements would be nice. 1. Scalable Looking at storage and serving infrastructures. 2. Reliable

Download Presentation

Beyond the File System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Beyond the File System Designing Large Scale File Storage and Serving Cal Henderson

  2. Hello!

  3. Big file systems? • Too vague! • What is a file system? • What constitutes big? • Some requirements would be nice

  4. 1 Scalable Looking at storage and serving infrastructures

  5. 2 Reliable Looking at redundancy, failure rates, on the fly changes

  6. 3 Cheap Looking at upfront costs, TCO and lifetimes

  7. Four buckets Storage Serving BCP Cost

  8. Storage

  9. File protocol NFS, CIFS, SMB File system ext, reiserFS, NTFS Block protocol SCSI, SATA, FC RAID Mirrors, Stripes Hardware Disks and stuff The storage stack

  10. Hardware overview The storage scale

  11. Internal storage • A disk in a computer • SCSI, IDE, SATA • 4 disks in 1U is common • 8 for half depth boxes

  12. DAS Direct attached storage Disk shelf, connected by SCSI/SATA HP MSA30 – 14 disks in 3U

  13. SAN • Storage Area Network • Dumb disk shelves • Clients connect via a ‘fabric’ • Fibre Channel, iSCSI, Infiniband • Low level protocols

  14. NAS • Network Attached Storage • Intelligent disk shelf • Clients connect via a network • NFS, SMB, CIFS • High level protocols

  15. Of course, it’s more confusing than that

  16. Meet the LUN • Logical Unit Number • A slice of storage space • Originally for addressing a single drive: • c1t2d3 • Controller, Target, Disk (Slice) • Now means a virtual partition/volume • LVM, Logical Volume Management

  17. NAS vs SAN With a SAN, a single host (initiator) owns a single LUN/volume With NAS, multiple hosts own a single LUN/volume NAS head – NAS access to a SAN

  18. SAN Advantages Virtualization within a SAN offers some nice features: • Real-time LUN replication • Transparent backup • SAN booting for host replacement

  19. Some Practical Examples • There are a lot of vendors • Configurations vary • Prices vary wildly • Let’s look at a couple • Ones I happen to have experience with • Not an endorsement ;)

  20. NetApp Filers Heads and shelves, up to 500TB in 6 Cabs FC SAN with 1 or 2 NAS heads

  21. Isilon IQ • 2U Nodes, 3-96 nodes/cluster, 6-600 TB • FC/InfiniBand SAN with NAS head on each node

  22. Scaling Vertical vs Horizontal

  23. Vertical scaling • Get a bigger box • Bigger disk(s) • More disks • Limited by current tech – size of each disk and total number in appliance

  24. Horizontal scaling • Buy more boxes • Add more servers/appliances • Scales forever* *sort of

  25. Storage scaling approaches • Four common models: • Huge FS • Physical nodes • Virtual nodes • Chunked space

  26. Huge FS • Create one giant volume with growing space • Sun’s ZFS • Isilon IQ • Expandable on-the-fly? • Upper limits • Always limited somewhere

  27. Huge FS • Pluses • Simple from the application side • Logically simple • Low administrative overhead • Minuses • All your eggs in one basket • Hard to expand • Has an upper limit

  28. Physical nodes • Application handles distribution to multiple physical nodes • Disks, Boxes, Appliances, whatever • One ‘volume’ per node • Each node acts by itself • Expandable on-the-fly – add more nodes • Scales forever

  29. Physical Nodes • Pluses • Limitless expansion • Easy to expand • Unlikely to all fail at once • Minuses • Many ‘mounts’ to manage • More administration

  30. Virtual nodes • Application handles distribution to multiple virtual volumes, contained on multiple physical nodes • Multiple volumes per node • Flexible • Expandable on-the-fly – add more nodes • Scales forever

  31. Virtual Nodes • Pluses • Limitless expansion • Easy to expand • Unlikely to all fail at once • Addressing is logical, not physical • Flexible volume sizing, consolidation • Minuses • Many ‘mounts’ to manage • More administration

  32. Chunked space • Storage layer writes parts of files to different physical nodes • A higher-level RAID striping • High performance for large files • read multiple parts simultaneously

  33. Chunked space • Pluses • High performance • Limitless size • Minuses • Conceptually complex • Can be hard to expand on the fly • Can’t manually poke it

  34. Real Life Case Studies

  35. GFS – Google File System • Developed by … Google • Proprietary • Everything we know about it is based on talks they’ve given • Designed to store huge files for fast access

  36. GFS – Google File System • Single ‘Master’ node holds metadata • SPF – Shadow master allows warm swap • Grid of ‘chunkservers’ • 64bit filenames • 64 MB file chunks

  37. GFS – Google File System Master 1(a) 2(a) 1(b)

  38. GFS – Google File System • Client reads metadata from master then file parts from multiple chunkservers • Designed for big files (>100MB) • Master server allocates access leases • Replication is automatic and self repairing • Synchronously for atomicity

  39. GFS – Google File System • Reading is fast (parallelizable) • But requires a lease • Master server is required for all reads and writes

  40. MogileFS – OMG Files • Developed by Danga / SixApart • Open source • Designed for scalable web app storage

  41. MogileFS – OMG Files • Single metadata store (MySQL) • MySQL Cluster avoids SPF • Multiple ‘tracker’ nodes locate files • Multiple ‘storage’ nodes store files

  42. MogileFS – OMG Files Tracker MySQL Tracker

  43. MogileFS – OMG Files • Replication of file ‘classes’ happens transparently • Storage nodes are not mirrored – replication is piecemeal • Reading and writing go through trackers, but are performed directly upon storage nodes

  44. Flickr File System • Developed by Flickr • Proprietary • Designed for very large scalable web app storage

  45. Flickr File System • No metadata store • Deal with it yourself • Multiple ‘StorageMaster’ nodes • Multiple storage nodes with virtual volumes

  46. Flickr File System SM SM SM

  47. Flickr File System • Metadata stored by app • Just a virtual volume number • App chooses a path • Virtual nodes are mirrored • Locally and remotely • Reading is done directly from nodes

  48. Flickr File System • StorageMaster nodes only used for write operations • Reading and writing can scale separately

  49. Amazon S3 • A big disk in the sky • Multiple ‘buckets’ • Files have user-defined keys • Data + metadata

  50. Amazon S3 Servers Amazon

More Related