big-data storage

Big Data storage model Distributed Hash table http://www.allsoftsolutions.in

Introduction • With the development of information technology and personal computing devices, the whole world of data showing explosive growth, the huge amount of data, available from a variety of sources, different forms, the rapid growth of the data, it puts forward new requirements to the data capture, storage, query, sharing, analysis, display etc • The diversity of data sources leads to the complexity of its structure, network log, video, pictures, geographical location can be as big data sources, most of them are semi-structured or unstructured data. Big data is generated in real time, the requirements of data management tools for real-time analysis of the data, and get the conclusion in real time. http://www.allsoftsolutions.in

Storage Models • Distributed Hash table (DHT) • Key value storage model • Graph Storage model • Document storage model http://www.allsoftsolutions.in

Distributed Hash table (DHT) • Enabled a form of content overlay called as Structured Content Overlay. • Distributed Hash Table: Chord and underlying mechanism that enables it called Consistent Hashing. http://www.allsoftsolutions.in

Distributed Hash table (DHT) • Chord: • Scalable • Distributed “lookup service” A lookup service is simply any service that maps keys to values (eg. DNS, directories) • Properties: • Scalability • Provable correctness • Performance http://www.allsoftsolutions.in

Consistent Hashing • Main Idea is Keys and nodes map to same ID space. • Create a matrix space such as a ring and put nodes on this ring each having some ID. 44 1 32 10 http://www.allsoftsolutions.in

Consistent Hashing • Keys should also map to the id space, for eg: we have 6 bit id space (ranging from 0-63 ). • The consistent hash function will assign the nodes and keys an identifier in the space. 55 60 44 37 1 5 32 10 8 17 http://www.allsoftsolutions.in

Consistent Hashing • In the case of nodes the id might be: • Node: hash(IP) • In the case of keys the id might simply just be: • Keys: hash(key) • Mapping to resolve lookup for a particular key. 55 60 44 37 1 5 32 10 8 17 http://www.allsoftsolutions.in

Consistent Hashing • Properties: • Load Balance: all nodes receive roughly the same no of keys. • Flexibility: node joining or leaving the network, only a fraction of keys to be moved. http://www.allsoftsolutions.in

Generalized Column Model or DHT : Big Table http://www.allsoftsolutions.in

Google Big Table • What is it ? • How does it work ? • Why is it important ? • Related products http://www.allsoftsolutions.in

Big Table – What is it ? • Distributed multi- level map • With an interesting data model • Data accessed by row key, column key, timestamp • Fault- tolerant, persistent • Scalable • Thousands of servers • Terabytes of in-memory data • Petabyte of disk-based data • Millions of reads / writes per second, efficient scans • Self- managing • Servers can be added/ removed dynamically • Servers adjust to load imbalance http://www.allsoftsolutions.in

Big Table – What is it ? • Bigtable is a distributed storage system for managing structured data • NoSqldatabase developed by Google • In-house database for very large data sets • Has influenced the NoSql db market place • Highly distributed • Client based validation • Row / column / timestamp indexing • No Joins available • Based on assumption of write once read many http://www.allsoftsolutions.in

Big Table – What is it ? • Every column can store any name value pairs of the form column family, label and string. • Column families are close together in a distributed file system. • Each Big Table cell like row column can contain multiple versions of data that are usually stored in a decreasing timestamp order. • The row key is record identifier, the row name is a reverse url, contents column family stores page content, anchor column family stores text of any anchors that reference page. • Each anchor cell has one version, contents column has three versions, each row range is called as Tablet. Tablet is a unit of load balancing. • Column names defined dynamically hold actual data themselves. • Timestamp allows to have multiple versions over time as well as making it possible to expire or for garbage collection of all data. http://www.allsoftsolutions.in

Big Table - Example • The location column family stores column relating to where the procurement occurred whereas inventory column family stores actual products procured and their classification. • Note that there are two values for location having different timestamp. • Data in Big Table can be stored in the normalized fashion Furniture Services Desktop College Campus A Architecture Logistics Informatics Campus B $ 800,000 http://www.allsoftsolutions.in

Big Table – How does it work ? • Uses Google file system and Chubby Lock service • Storage and versioning based upon time stamps • Data sorted by youngest first • No validation, validation left to client • Arbitrary row columns • Arbitrary column data types • Very high data throughput http://www.allsoftsolutions.in

Big Table tables on GFS http://www.allsoftsolutions.in

Big Table tables on GFS • Each table is split into different row ranges called tablets. • Each tablet is managed by a tablet server that stores each column family for the given row range in a separate distributed file called SSTable. • Additionally a single meta data table is managed by a meta data sever that is used to locate the tablets of any user table in response to any read or write request. • A meta data table on its own can be large and is also split into tables, the root tablet points to the location of other meta data tablets. • Big Table relies on GFS and therefore supports parallel reads and inserts efficiently even if they are performed simultaneously on the same table. http://www.allsoftsolutions.in

Big Table Architecture http://www.allsoftsolutions.in

Big Table – How does it work ? • Table columns grouped into families • Column families named • Access granted at the family level • Family members grouped in compression • No column datatype constraint • Integrated with MapReduce http://www.allsoftsolutions.in

Big Table – Related Products http://www.allsoftsolutions.in

Big Table – Why is it important ? • Google leads the way in terms of functionality • Their market needs drive their systems and offerings • What they have done till date will • Affect the rest of the big data market in the future • Via cross pollination of ideas • Via emulation of their success http://www.allsoftsolutions.in

HBase HBase is an open source, non-relational, distributed database modeled after Google's Bigtable and is written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed File System), providing Bigtable-like capabilities for Hadoop. http://www.allsoftsolutions.in

Limitations of Hadoop • Hadoop can perform only batch processing, and data will be accessed only in a sequential manner. That means one has to search the entire dataset even for the simplest of jobs. • A huge dataset when processed results in another huge data set, which should also be processed sequentially. At this point, a new solution is needed to access any point of data in a single unit of time (random access). Hadoop Random Access Databases • Applications such as HBase, Cassandra, couchDB, Dynamo, and MongoDB are some of the databases that store huge amounts of data and access the data in a random manner. http://www.allsoftsolutions.in

Open-source non-relational distributed column-oriented database modeled after Google’sBigTable. (row key, column family, column, timestamp) ->value http://www.allsoftsolutions.in

What HBase isNOT Not an SQLdatabase Notrelational Nojoins No fancyquery language andno sophisticatedquery engine No transactions out-of-thebox No secondary indices out-of-thebox Not a drop-in replacement foryour RDBMS http://www.allsoftsolutions.in

What:Features-1 Linearscalability, capableof storing hundreds of terabytes ofdata Automatic and configurable sharding oftables Automatic failoversupport Strictly consistent reads andwrites http://www.allsoftsolutions.in

What: Part of Hadoop ecosystem HBase write read Data Consumer Data Producer read write HDFS write http://www.allsoftsolutions.in

What:Features-2 Integrates nicely withHadoop MapReduce (both as source anddestination) Easy Java API for client access Thrift gateway and RESTAPIs Bulk import of large amount of data Replication across clusters &backup options Block cache and Bloom filters for queries and manymore... real-time http://www.allsoftsolutions.in

http://www.allsoftsolutions.in

How: theData Data geo:{‘country’:‘Belarus’,‘region’:‘Minsk’} demography:{‘population’:‘1,937,000’@ts=2011} geo:{‘country’:‘USA’,‘state’:’NY’} demography:{‘population’:‘8,175,133’@ts=2010, ‘population’:‘8,244,910’@ts=2011} New_York_City Suva geo:{‘country’:‘Fiji’} http://www.allsoftsolutions.in

How: Writing theData Row updates areatomic Updates across multiple atomic, no transaction thebox rows areNOT support outof HBasestores N versions (default3) of a cell Tables are usually “sparse”, columns populatedin a row notall http://www.allsoftsolutions.in

How: Reading theData Reader values will always read thelast written (and committed) Reading single row:Get Reading multiple rows: Scan (veryfast) Scan usually defines start keyand stopkey Rows areordered, easy todo partial keyscan RowKey Data ‘login_2012-03-01.00:09:17’ d:{‘user’:‘alex’} ... ... ‘login_2012-03-01.23:59:35’ d:{‘user’:‘otis’} ‘login_2012-03-02.00:00:21’ d:{‘user’:‘david’} Query predicate pushed down viaserver-side Filters http://www.allsoftsolutions.in

How: MapReduceIntegration Out ofthe MapReduce boxintegration withHadoop Data from for MRjob HBase table can besource MR jobcan MR jobcan write data into HBase write data into HDFS files HBase canbe via directlyand then output very “Bulk quickly loadedinto Loading” functionality http://www.allsoftsolutions.in

How: Sharding theData Automatic tables: andconfigurable shardingof Regions & endrow of Tables partitioned into Region definedby start keys Regions arethe distribution “atoms” Regions are assigned (HBasecluster toRegionServers slaves) http://www.allsoftsolutions.in

How: Setup:Components HBasecomponents ZooKeeper client http://www.allsoftsolutions.in

How: Setup: HadoopCluster Typical Hadoop+HBasesetup HDFS MapReduce HBase MasterNode JobTracker NameNode HMaster RegionServer RegionServer TaskTracker TaskTracker DataNode DataNode Slave Node Slave Node http://www.allsoftsolutions.in

How: Setup: AutomaticFailover DataNodefailures (replication) handled byHDFS RSsfailures (incl. caused bywhole server failure) handledautomatically Master re-assignes Regions to availableRSs http://www.allsoftsolutions.in

When to UseHBase? http://www.allsoftsolutions.in

When: What HBase is goodat • Serving large amount of data: built to scale from theget-go • fastrandom access tothedata Write-heavyapplications* • Append-style writing (inserting/ overwriting new data) rather than heavy read-modify-writeoperations** http://www.allsoftsolutions.in

When: HBase vs... • Favorsconsistency overavailability • Partof a Hadoopecosystem • Greatcommunity; adoptedbytech giants like Facebook, Twitter, Yahoo!, Adobe,etc. http://www.allsoftsolutions.in

When:Use-cases Audit loggingsystems track user actions answerquestions/queries like: what are the last10 user? actions madeby row key: userId_timestamp which users logged intosystem yesterday? rowkey: action_timestamp_userId http://www.allsoftsolutions.in

When:Use-cases Real-time analytics, OLAP real-timecounters interactive reports showing trends, breakdowns,etc time-seriesdatabases http://www.allsoftsolutions.in

When:Use-cases Monitoring systemexample http://www.allsoftsolutions.in

When:Use-cases Messages-centered systems twitter-like messages/statuses Content management systems serving content out ofHBase Canonical use-case: webtable (pages stored during crawling theweb) Andothers http://www.allsoftsolutions.in

Future • Making stable enough to substitute RDBMS in mission criticalcases • Easier systemmanagement Performanceimprovements http://www.allsoftsolutions.in

Big Table vs HBase • Timestamps: HBase - milliseconds and Big Table- microseconds. • HDFS and GFS, HBase can also run on other file systems. • Mapping storage files into memory: HBase cannot. • Commit logs: Big Table has two commit logs, Hbase has an option to skip the commit log. • HBase is an open source implementation of Google Big Table Architecture. • Access control: BigTable Enforces access control on a call on family level and HBase does not. • Apache, Google http://www.allsoftsolutions.in

big-data storage

big-data storage

Presentation Transcript

Storage codes: Managing Big Data with Small Overheads

Huawei Big Data Storage N9000

Data Storage

Big Data Overview of storage and processing

Data Storage

Big Data - Storage

Data Storage

Data Storage

Data Storage

Data Storage

Big Data Overview of storage and processing

Big Data Big Data

CSCI 765 Big Data and Infinite Storage

Big Table: Distributed Storage System For Structured Data

STORAGE IN BIG DATA MARKET ANALYSIS (2019-2027)