70 likes | 208 Views
Introduction to Distributed Storage Systems . Harry Xu CS 239, Fall 2019. Problems and Challenges. Extremely large amounts of data are available these days FB Social: 721M vertices, 68.7B edges in May 2011 Google Maps: 20 petabytes of data Where to put them Single machine? Servers?
E N D
Introduction to Distributed Storage Systems Harry Xu CS 239, Fall 2019
Problems and Challenges • Extremely large amounts of data are available these days • FB Social: 721M vertices, 68.7B edges in May 2011 • Google Maps: 20 petabytes of data • Where to put them • Single machine? Servers? • How can we enable applications to easily access them? • What interfaces do they provide? • What guarantees do they provide? • How to enable applications to efficiently access these data? • What should be the right architecture (e.g., master+slave, peer-to-peer, etc)? • What if a machine crashes?
Solution: Distributed Storage Systems • Where to put them? • On a cluster of commodity servers • How to enable applications to easily access them? • Depending on data types (e.g., files, structured data, or unstructured data) • Standard interfaces • What guarantees do they provide? • Consistency guarantees • What if a machine crashes • Fault tolerance: replication + quick recovery • Consistency between replicas
Three Different Kinds of Systems • Distributed File Systems • HDFS -- Yahoo • GFS -- Google • Distributed Structured Data Storage Systems (a.k.a., databases) • Bigtable(wide column DB) • Spanner (NewSQL DB) • A mix of both • Azure
Distributed File Systems • HDFS • One “metadata” server (NameNode) and a set of DataNodes • A file is divided in blocks and each block has several replicas on different DataNodes • File operations are recorded on journals, which are replayed to maintain consistency upon failure • Supports a wide variety of applications including Hadoop and everything on top of Hadoop • GFS • Using a similar architecture • Replicating both file chunks and namespaces • Using checksum to detect data corruption
Data Storage Systems (NoSQL Databases) • Bigtable • Built on top of GFS, available as part of Google Cloud Platform • It is a map (or a wide column store) that maps a row key and column key to a byte array • Designed to scale to petabyte-size data • Each table has multiple dimensions and is divided into a bunch of small tablets for better integration with GFS • No notion of transaction • Spanner • A “NewSQL” database supporting externally consistent transactions • Windows Azure Storage (WAS) • Supports strong consistency and various types of data