560 likes | 712 Views
An Introduction to Data Intensive Computing Chapter 2: Data Management. Robert Grossman University of Chicago Open Data Group Collin Bennett Open Data Group November 14, 2011. What Are the Choices?. Applications (R, SAS, Excel, etc. ). File Systems.
E N D
An Introduction to Data Intensive ComputingChapter 2: Data Management Robert Grossman University of Chicago Open Data Group Collin Bennett Open Data Group November 14, 2011
What Are the Choices? Applications (R, SAS, Excel, etc. ) File Systems Clustered File Systems (glusterfs, …) Databases (SqlServer, Oracle, DB2) Distributed File Systems (Hadoop, Sector) NoSQL Databases (HBase, Accumulo, Cassandra, SimpleDB, …)
What is the Fundamental Trade Off? Scale up … vs Scale out
Advice From Jim Gray • Analyzing big data requires scale-outsolutions not scale-up solutions (GrayWulf) • Move the analysis to the data. • Work with scientists to find the most common “20 queries” and make them fast. • Go from “working to working.”
Pattern 1: Put the metadata in a database and point to files in a file system.
Example: Sloan Digital Sky Survey • Two surveys in one • Photometric survey in 5 bands • Spectroscopic redshift survey • Data is public • 40 TB of raw data • 5 TB processed catalogs • 2.5 Terapixels of images • Catalog uses Microsoft SQLServer • Started in 1992, finished in 2008 • JHU SkyServer serves millions of queries
Example: Bionimbus Genomics Cloud www.bionimbus.org
GWT-based Front End Elastic Cloud Services Database Services Analysis Pipelines & Re-analysis Services Intercloud Services Large Data Cloud Services Data Ingestion Services
(Eucalyptus, OpenStack) GWT-based Front End Elastic Cloud Services (PostgreSQL) Database Services Analysis Pipelines & Re-analysis Services Intercloud Services (IDs, etc.) Large Data Cloud Services (UDT, replication) Data Ingestion Services (Hadoop, Sector/Sphere)
Section 2.2Distributed File Systems Sector/Sphere
Hadoop’s Large Data Cloud Applications Compute Services Hadoop’sMapReduce Data Services NoSQL Databases Hadoop Distributed File System (HDFS) Storage Services Hadoop’s Stack
Hadoop Design • Designed to run over commodity components that fail. • Data replicated, typically three times. • Block-based storage. • Single name server containing all required metadata, which is a single point of failure. • Optimized for efficient scans with high throughput, not low latency access. • Designed for write once, read many. • Append operation planned for future.
Hadoop Distributed File System (HDFS) Architecture control • HDFS is block-based. • Written in Java. Client Name Node data Data Node Data Node Data Node Data Node Data Node Data Node Rack Rack Rack
Sector Distributed File System (SDFS) Architecture • Broadly similar to Google File System and Hadoop Distributed File System. • Uses native file system. It is not block based. • Has security server that provides authorizations. • Has multiple master name servers so that there is no single point of failure. • Use UDT to support wide area operations.
Sector Distributed File System (SDFS) Architecture control • HDFS is file-based. • Written in C++. • Security server. • Multiple masters. Master Node control Client Master Node Security Server data Slave Node Slave Node Slave Node Slave Node Slave Node Slave Node Rack Rack Rack
GlusterFS Architecture • No metadata server. • No single point of failure. • Uses algorithms to determine location of data. • Can scale out by adding more bricks.
GlusterFS Architecture File-based. Client GlusterFS Server data Brick Brick Brick Brick Brick Brick Rack Rack Rack
Evolution • Standard architecture for simple web applications: • Presentation: front-end, load balanced web servers • Business logic layer • Backend database • Database layer does not scale with large numbers of users or large amounts of data • Alternatives arose • Sharded (partitioned) databases or master-slave dbs • memcache
Scaling RDMS • Master – slave database systems • Writes to master • Reads from slaves • Can be bottlenecks writing to slaves; can be inconsistent • Sharded databases • Applications and queries must understand sharing schema • Both reads and writes scale • No native, direct support for joins across shards
NoSQL Systems • Suggests No SQL support, also Not Only SQL • One or more of the ACID properties not supported • Joins generally not supported • Usually flexible schemas • Some well known examples: Google’s BigTable, Amazon’s Dynamo & Facebook’s Cassandra • Several recent open source systems
CAP – Choose Two Per Operation Consistency C CP: always consistent, even in a partition, but a reachable replica may deny service without quorum. CA: available and consistent, unless there is a partition. BigTable, HBase Dynamo, Cassandra A P AP: a reachable replica provides service even in a partition, but may be inconsistent. Availability Partition-resiliency
CAP Theorem • Proposed by Eric Brewer, 2000 • Three properties of a system: consistency, availability and partitions • You can have at most two of these three properties for any shared-data system • Scale out requires partitions • Most large web-based systems choose availability over consistency Reference: Brewer, PODC 2000; Gilbert/Lynch, SIGACT News 2002
Eventual Consistency • If no updates occur for a while, all updates eventually propagate through the system and all the nodes will be consistent • Eventually, a node is either updated or removed from service. • Can be implemented with Gossip protocol • Amazon’s Dynamo popularized this approach • Sometimes this is called BASE (Basically Available, Soft state, Eventual consistency), as opposed to ACID
Different Types of NoSQL Systems • Distributed Key-Value Systems • Amazon’s S3 Key-Value Store (Dynamo) • Voldemort • Cassandra • Column-based Systems • BigTable • HBase • Cassandra • Document-based systems • CouchDB
Client Client Client Client Client Hbase Architecture Java Client REST API HBaseMaster HRegionServer HRegionServer HRegionServer HRegionServer HRegionServer Disk Disk Disk Disk Source: RaghuRamakrishnan
Memcache HRegion Server • Records partitioned by column family into HStores • Each HStore contains many MapFiles • All writes to HStore applied to single memcache • Reads consult MapFiles and memcache • Memcaches flushed as MapFiles (HDFS files) when full • Compactions limit number of MapFiles HRegionServer writes Flush to disk HStore reads MapFiles Source: RaghuRamakrishnan
Facebook’s Cassandra • Modeled after BigTable’s data model • Modeled after Dynamo’s eventual consistency • Peer to peer storage architecture using consistent hashing (Chord hashing)
Zoom Levels / Bounds Zoom Level 1: 4 images Zoom Level 2: 16 images Zoom Level 3: 64 images Zoom Level 4: 256 images Source: Andrew Levine
Build Tile Cache in the Cloud - Mapper Mapper Input Key: Bounding Box Mapper Output Key: Bounding Box Mapper Output Value: (minx = -135.0 miny = 45.0 maxx = -112.5 maxy = 67.5) Mapper Input Value: Mapper Output Key: Bounding Box Mapper Output Value: Mapper Output Key: Bounding Box Mapper Output Value: Mapper Output Key: Bounding Box Step 1: Input to Mapper Mapper Output Value: Mapper resizes and/or cuts up the original image into pieces to output Bounding Boxes Mapper Output Key: Bounding Box Mapper Output Value: Mapper Output Key: Bounding Box Mapper Output Value: Mapper Output Key: Bounding Box Mapper Output Value: Mapper Output Key: Bounding Box Mapper Output Value: Step 3: Mapper Output Step 2: Processing in Mapper Source: Andrew Levine
Build Tile Cache in the Cloud - Reducer Reducer Key Input: Bounding Box (minx = -45.0 miny = -2.8125 maxx = -43.59375 maxy = -2.109375) Reducer Value Input: • Output to HBase • Builds up Layers for WMS for various datasets … Step 1: Input to Reducer Assemble Images based on bounding box Step 2: Reducer Output Source: Andrew Levine
HBase Tables • Open Geospatial Consortium (OGC) Web Mapping Service (WMS) Query translates to HBase scheme • Layers, Styles, Projection, Size • Table name: WMS Layer • Row ID: Bounding Box of image-Column Family: Style Name and Projection -Column Qualifier: Width x Height -Value: Buffered Image
Pattern 4: Put the data into a distributed key-value store .
S3 Buckets • S3 bucket names must be unique across AWS • A good practice is to use a pattern like tutorial.osdc.org/dataset1.txt for a domain you own. • The file is then referenced as tutorial.osdc.org.s3. amazonaws.com/dataset1.txt • If you own osdc.org you can create a DNS CNAME entry to access the file as tutorial.osdc.org/dataset1.txt
S3 Keys • Keys must be unique within a bucket. • Values can be as large as 5 TB (formerly 5 GB)
S3 Security • AWS access key (user name) • This function as your S3 username. It is an alphanumeric text string that uniquely identifies users. • AWS Secret key (functions as password)
Access Keys User Name Password
Other Amazon Data Services • Amazon Simple Database Service (SDS) • Amazon’s Elastic Block Storage (EBS)
The Basic Problem • TCP was never designed to move large data sets over wide area high performance networks. • As a general rule, reading data off disks is slower than transporting it over the network.
LAN US US-EU US-ASIA 1000 800 600 400 200 1000 800 0.01% 0.05% 600 0.1% 400 0.5% 200 0.1% 1 10 100 200 400 TCP Throughput vs RTT and Packet Loss Throughput (Mb/s) Packet Loss Round Trip Time (ms) Source: YunhongGu, 2007, experiments over wide area 1G.
The Solution • Use parallel TCP streams • GridFTP • Use specialized network protocols • UDT, FAST, etc. • Use RAID to stripe data across disks to improve throughput when reading • These techniques are well understood in HEP, astronomy, but not yet in biology.