Big Data Technologies: HDFS -- Map-Reduce && NoSQL DBs

Big Data Technologies:HDFS -- Map-Reduce && NoSQL DBs S.Sioutas, Ionian University Ion Stoica, http://inst.eecs.berkeley.edu/~cs162

Big Data Technology is based on: Hash Functions(Input: Filesf.e. strings -- output: hash-keys) • Folding Method: int h(String x, int D) { inti, sum; for (sum=0, i=0; i<x.length(); i++) sum+= (int)x.charAt(i); return (sum%D);/* D is the cluster size*/ } • sums the ASCII values of the letters in the string • ASCII value for “A” =65; sum will be in range 650-900 for 10 upper-case letters; good when D around 100, for example • order of chars in string has no effect

Big Data Technology is based on: Distributed Hash Tables (DHTs) • Distribute (partition) a hash table data structure across a large number of servers • Also called, key-value store • Key identifier = SHA-1(key), Node identifier = SHA-1(IP address) • Each key_id is mapped to the node_id with the smallest node_id >= key_id • Two operations • put(key, data); // insert “data” identified by “key” • data = get(key); // get data associated to “key” key, value Sorted Key-value stores into DHT table …

Hadoop Distributed File System (HDFS) • Files split into 128MB blocks • Blocks replicated across several datanodes (often 3) • Namenode stores metadata (file names, locations, etc) • Optimized for large files, sequential reads • Files are append-only Namenode File1 1 2 3 4 1 2 1 3 2 1 4 2 4 3 3 4 Datanodes

HadoopCluster

Typical Hadoop Cluster • 40 nodes/rack, 1000-4000 nodes in cluster • 1 Gbps bandwidth in rack, 8 Gbps out of rack • Node specs (Facebook):8-16 cores, 32-48 GB RAM, 10×2TB disks Aggregation switch Rack switch

The lookup cluster architecture of NoSQL DB (f.e. Cassandra) • N1, N2…..Nx are computing nodes of the same rack • M1, M2…..My are computing nodes of the same rack • Each rack is structured as a Chord overlay-network • The whole CLUSTER (CLOUD) is structured as a Chord overlay Between rack-switches (Each rack-switch talks directly to it’s Master node)

Hadoop Components • Distributed file system (HDFS) • Single namespace for entire cluster • Replicates data 3x for fault-tolerance • MapReduce framework • Runs jobs submitted by users • Manages work distribution & fault-tolerance • Colocated with file system

Distributed Hash Tables (DHTs) (cont’d) • Just need a lookup service, i.e., given a key (ID), map it to machine n n = lookup(key); • Invoking put() and get() at node m m.put(key, data) { n= lookup(key); // get node “n” mapping “key” n.store(key, data); // store data at node “n” } data = m.get(key) { n= lookup(key); // get node “n” storing data associated to “key” return n.retrieve(key); // get data stored at “n” associated to “key” }

Chord Lookup Service (Protocol) • Support of just oneoperation: given a key, Chordmapsthekeyonto a node • Associate to each node and item, a unique id/key in an uni-dimensional space 0..2m-1 • Partition this space across N machines • Each key is mapped to the node with the smallest largest id (consistent hashing) • Key design decision • Decouple correctness from efficiency • Properties • Routing table size O(log(N)), where N is the total number of nodes • Guarantees that a file is found in O(log(N)) steps

The Abstraction: Distributed hash table (DHT)

The lookup problem N2 N1 N3 Key=“title” Value=MP3 data… Cloud Publisher N4 N6 ? N5 Client Lookup(“title”)

Routed queries (Freenet, Chord, etc.) N4 N1 N6 Client N2 Lookup(“title”) Publisher Key=“title” Value=MP3 data… N7 N3 N8 N9

Chord software • 3000 lines of C++ code • Library to be linked with the application • provides a lookup(key) – function: yields the IP address of the node responsible for the key • Notifies the node of changes in the set of keys the node is responsible for

The Chordalgorithm –ConstructionoftheChord ring • theconsistenthashfunctionassignseachnodeandeachkey an m-bitidentifierusing SHA 1 (Secure Hash Standard). m = anynumberbigenoughtomakecollisionsimprobable Key identifier = SHA-1(key) Node identifier = SHA-1(IP address) • Both are uniformly distributed • Both exist in the same ID space

Challenges • System churn: machines can fail or exit the system any time • Scalability: need to scale to 10s or 100s of thousands machines • Heterogeneity: • Latency: 1ms to 1000ms • Bandwidth: 32Kb/s to 100Mb/s • Nodes stay in system from 10s to a year …

The Chordalgorithm –ConstructionoftheChord ring • identifiersarearranged on a identifiercirclemodulo 2 => Chord ring m

The Chordalgorithm –ConstructionoftheChord ring • a key k isassignedtothenodewhoseidentifierisequaltoorgreaterthanthekey‘sidentifier • thisnodeiscalledsuccessor(k) andisthefirstnodeclockwisefrom k.

The Chord algorithm –Simple node localization // ask node n to find the successor of id n.find_successor(id) if (id (n; successor]) return successor; else // forward the query around the circle return successor.find_successor(id); => Number of messages linear in the number of nodes !

The Chordalgorithm –Scalablenodelocalization • Additional routinginformationtoacceleratelookups • Eachnode n contains a routingtablewithupto m entries (m: numberofbitsoftheidentifiers) => finger table • i thentryin thetableatnode n containsthefirstnode s thatsucceds n byat least 2i-1 • s = successor (n + 2 i-1) • s iscalledthei th finger ofnode n

The Chord algorithm –Scalable node localization Finger table: finger[i] = successor (n + 2 i-1)

The Chord algorithm –Scalable node localization Finger table: finger[i] = successor (n + 2i-1)

The Chord algorithm –Scalable node localization Finger table: finger[i] = successor (n + 2 i-1)

The Chordalgorithm –Scalablenodelocalization Importantcharacteristicsofthisscheme: • Eachnodestoresinformationaboutonly a smallnumberofnodes (m) • Eachnodesknowsmoreaboutnodescloselyfollowingitthanaboutnodesfareraway • A finger tablegenerallydoes not containenoughinformationtodirectlydeterminethesuccessorof an arbitrarykey k

The Chord algorithm –Scalable node localization • Search in finger table for the nodes which most immediatly precedes id • Invoke find_successor from that node => Number of messages O(log N)!

The Chord algorithm –Node joins and stabilization

The Chordalgorithm –Nodejoinsandstabilization • Toensurecorrectlookups, all successorpointers must beuptodate • stabilizationprotocolrunningperiodically in thebackgroundandUpdates finger tablesandsuccessorpointers

The Chordalgorithm –Nodejoinsandstabilization Stabilizationprotocol: • Stabilize(): n asksitssuccessorforitspredecessor p anddecideswhether p shouldben‘ssuccessorinstead (thisisthecaseif p recentlyjoinedthesystem). • Notify(): notifiesn‘ssuccessorofitsexistence, so itcanchangeitspredecessorto n • Fix_fingers(): updates finger tables

The Chord algorithm –Node joins and stabilization

The Chord algorithm –Node joins and stabilization • N26 joins the system • N26 aquires N32 as its successor • N26 notifies N32 • N32 aquires N26 as its predecessor

The Chord algorithm –Node joins and stabilization • N26 copies keys • N21 runs stabilize() and asks its successor N32 for its predecessor which is N26.

The Chord algorithm –Node joins and stabilization • N21 aquires N26 as its successor • N21 notifies N26 of its existence • N26 aquires N21 as predecessor

The Chord algorithm –Impact of node joins on lookups • All finger tableentriesarecorrect => O(log N) lookups • Successorpointerscorrect, but fingersinaccurate=> correct but slowerlookups

The Chordalgorithm –Impact ofnodejoins on lookups • Incorrectsuccessorpointers => lookupmightfail, retry after a pause • But still correctness!

The Chordalgorithm –Impact ofnodejoins on lookups • Stabilizationcompleted => noinfluence on performence • Onlyforthenegligiblecasethat a large numberofnodesjoinsbetweenthetarget‘spredecessorandthetarget, thelookupisslightlyslower • Noinfluence on performanceaslongasfingersareadjustedfasterthanthenetworkdoubles in size

The Chord algorithm –Failure of nodes • Correctness relies on correct successor pointers • What happens, if N14, N21, N32 fail simultaneously? • How can N8 aquire N38 as successor?

The Chord algorithm –Failure of nodes • Eachnodemaintains a successorlistofsize r • Ifthenetworkisinitiallystable, andeverynodefailswithprobability ½, find_successor still findstheclosestlivingsuccessortothequerykeyandtheexpected time toexecutefind_succesoris O(log N) • Proofsare in theresearchpaper

The Chordalgorithm –Failureofnodes Massive failures have little impact (1/2)6 is 1.6% Failed Lookups (Percent) Failed Nodes (Percent)

What Can You Run in Cloud Computing? • Almost everything! • Virtual Machine instances • Storage services • Simple Storage Service (S3) • Elastic Block Storage (RBS) • Databases: • Database instances (e.g., mySQL, SQL Server, …) • SimpleDB • Content Distribution Network: CloudFront • MapReduce: Amazon Elastic MapReduce • …

Big Data Technologies: HDFS -- Map-Reduce && NoSQL DBs