large scale data analysis

Large Scale Data Analysiswith Map/Reduce http://www.allsoftsolution.in

Contents • Map/Reduce • Dryad • •Sector/Sphere • Open source M/R frameworks &tools • Hadoop(Yahoo/Apache) • Cloud MapReduce(Accenture) • –Elastic MapReduce (Hadoop onAWS) • MR.Flow • Some M/Ralgorithms • Graph algorithms, Text Indexing &retrieval http://www.allsoftsolution.in

Contents PartI Distributedcomputing frameworks http://www.allsoftsolution.in

Scalability &Parallelisation • Scalabilityapproaches • Scale up (verticalscaling) • Only one direction of improvement (biggerbox) • Scale out (horizontalscaling) • Two directions – add more nodes + scale up eachnode • Can achieve x4 the performance of a similarly priced scale-up system (ref?) • Hybrid (“scale out in abox”) • Parallel algorithms...Not • Algorithms withstate • Dependencies from one iteration to another (recurrence,induction) http://www.allsoftsolution.in

Parallelisationapproaches • Parallelizationapproaches • Taskdecomposition • Distribute coarse-grained(synchronisation wise) and computationally expensivetasks (otherwise too much coordination/management overhead) • Dependencies - execution order vs. datadependencies • Move the datato the processing (whenneeded) • Datadecomposition • Each parallel task works with a data partition assigned to it (nosharing) • Data has regular structure, i.e. chunks expected to need the same amount of processing time • Two criteria: granularity (size of chunk) and shape (dataexchange • between chunkneighbours) • Move the processingto thedata http://www.allsoftsolution.in

Amdahl’slaw • Impossible to achieve linearspeedup • Maximum speedup is always bounded by the overheadfor • parallelisation and by the serial processingpart • Amdahl’slaw • max_speedup= • P: proportion of the program than can be parallelised (1-P still remains serial oroverhead) • N: number of processors / parallelnodes • Example: P=75% (i.e. 25% serial oroverhead) http://www.allsoftsolution.in

Map/Reduce • Google (2005), US patent(2010) • General idea - co-locate data with computationnodes • Data decomposition (parallelization) – no data/orderdependencies • between tasks (except the Map-to-Reducephase) • Try to utilise data locality(bandwidth is$$$) • Implicit data flow(higher abstraction level thanMPI) • Partial failure handling (failed map/reduce tasks arere-scheduled) • Structure • Map - for each input (Ki,Vi) produce zero or more outputpairs • (Km,Vm) • Combine – optional intermediate aggregation (less M->R data transfer) • Reduce - for input pair (Km, list(V1,V2,…, Vn)) produce zero ormore • output pairs(Kr,Vr) http://www.allsoftsolution.in

Map/Reduce(2) (C) JimmyLin http://www.allsoftsolution.in

Map/Reduce -examples • In otherwords… • Map = partitioning of the data (compute part of a problem across severalservers) • Reduce = processing of the partitions (aggregate the partial results • from all servers into a single resultset) • The M/R framework takes care of grouping of partitions bykey • Example: wordcount • Map (1 task per document in thecollection) • In:docx • Out: (term1, count1,x), (term2, count2,x),… • Reduce (1 task per term in thecollection) • In: (term1, < count1,x, count1,y, … count1,z>) • Out: (term1, SUM(count1,x, count1,y, …count1,z)) http://www.allsoftsolution.in

Map/Reduce examples(2) • Example: Shortest path in graph(naïve) • Map: in (nodein, dist); out (nodeout, dist++) wherenodein->nodeout • Reduce: in (noder, <dista,r, distb,r, …, dustc,r>); out (noder,MIN(dista,r, • distb,r, …,dustc,r)) • Multiple M/R iterations required, start with(nodestart,0) • Example: Inverted indexing (full textsearch) • Map • In:docx • out: (term1, (docx, pos’1,x)), (term1, (docx, pos’’1,x)), (term2, (docx,pos2,x))… • Reduce • in = (term1, < (docx, pos’1,x), (docx, pos’’1,x), (docy, pos1,y), … (docz,pos1,z)>) • out = (term1, < (docx, <pos’1,x, pos’’1,x,…>), (docy, <pos1,y>), …(docz, • <pos1,z>)>) http://www.allsoftsolution.in

Map/Reduce - examples(3) • Inverted index examplerundown • input • Doc1: “Why did the chicken cross theroad?” • Doc2: “The chicken and eggproblem” • Doc3: “Kentucky FriedChicken” • Map phase (3 paralleltasks) • – map1 => (“why”,(doc1,1)), (“did”,(doc1,2)),(“the”,(doc1,3)), • (“chicken”,(doc1,4)), (“cross”,(doc1,5)), (“the”,(doc1,6)), (“road”,(doc1,7)) • – map2 => (“the”,(doc2,1)), (“chicken”,(doc2,2)),(“and”,(doc2,3)), • (“egg”,(doc2,4)), (“problem”,(doc2,5)) • – map3 => (“kentucky”,(doc3,1)), (“fried”,(doc3,2)),(“chicken”,(doc3,3)) http://www.allsoftsolution.in

Map/Reduce - examples(4) • Inverted index example rundown(cont.) • Intermediate shuffle & sortphase • – (“why”,<(doc1,1)>), • – (“did”,<(doc1,2)>), • – (“the”, <(doc1,3), (doc1,6),(doc2,1)>) • – (“chicken”, <(doc1,4), (doc2,2),(doc3,3)>) • – (“cross”,<(doc1,5)>) • – (“road”,<(doc1,7)>) • – (“and”,<(doc2,3)>) • – (“egg”,<(doc2,4)>) • – (“problem”,<(doc2,5)>) • – (“kentucky”,<(doc3,1)>) • – (“fried”,<(doc3,2)>) http://www.allsoftsolution.in

Map/Reduce - examples(5) • Inverted index example rundown(cont.) • Reduce phase (11 paralleltasks) • – (“why”,<(doc1,<1>)>), • – (“did”,<(doc1,<2>)>), • – (“the”, <(doc1, <3,6>), (doc2,<1>)>) • – (“chicken”, <(doc1,<4>), (doc2,<2>),(doc3,<3>)>) • – (“cross”,<(doc1,<5>)>) • – (“road”,<(doc1,<7>)>) • – (“and”,<(doc2,<3>)>) • – (“egg”,<(doc2,<4>)>) • – (“problem”,<(doc2,<5>)>) • – (“kentucky”,<(doc3,<1>)>) • – (“fried”,<(doc3,<2>)>) http://www.allsoftsolution.in

Map/Reduce – pros &cons • Goodfor • Lots of input, intermediate & outputdata • Little or no synchronisationrequired • “Read once”, batch oriented datasets(ETL) • Badfor • Fast response time • Large amounts of shareddata • Fine-grained synchronisationrequired • CPU intensive operations (as opposed to dataintensive) http://www.allsoftsolution.in

Dryad • Microsoft Research (2007), http://research.microsoft.com/en-us/projects/dryad/ • General purpose distributed executionengine • Focus on throughput, notlatency • Automatic management of scheduling, distribution &faulttolerance • Simple DAGmodel • Vertices -> processes (processingnodes) • Edges -> communication channels between theprocesses • DAG modelbenefits • Genericscheduler • No deadlocks /deterministic • Easier fault tolerance http://www.allsoftsolution.in

Dryad DAGjobs (C) MichaelIsard http://www.allsoftsolution.in

Dryad(3) • The job graph can mutate during execution(?) • Channel types (oneway) • Files on aDFS • Temporaryfile • Shared memoryFIFO • TCPpipes • Faulttolerance • Node fails =>re-run • Input disappears => re-run upstreamnode • Node is slow => run a duplicate copy at another node, get firstresult http://www.allsoftsolution.in

Dryad architecture &components (C) MihaiBudiu http://www.allsoftsolution.in

Dryadprogramming • C++ API (incl. Map/Reduceinterfaces) • SQL Integration Services(SSIS) • Many parallel SQL Server instances (each is a vertex in theDAG) • DryadLINQ • LINQ to Dryadtranslator • Distributedshell • Generalisation of the Unix shell &pipes • Many inputs/outputs perprocess! • Pipes span multiplemachines http://www.allsoftsolution.in

Dryad vs.Map/Reduce (C) MihaiBudiu http://www.allsoftsolution.in

Contents PartII Open SourceMap/Reduce frameworks http://www.allsoftsolution.in

Hadoop • Apache Nutch (2004), Yahoo is currently the major contributor • http://hadoop.apache.org/ • Not only a Map/Reduceimplementation! • HDFS – distributedfilesystem • HBase – distributed columnstore • Pig – high level query language (SQLlike) • Hive – Hadoop based datawarehouse • ZooKeeper, Chukwa, Pipes/Streaming,… • Also available on AmazonEC2 • Largest Hadoop cluster – 25K nodes / 100K cores(Yahoo) http://www.allsoftsolution.in

Hadoop -Map/Reduce • Components • Jobclient • JobTracker • Only one • Scheduling, coordinating, monitoring, failurehandling • TaskTracker • Many • Executes tasks received by the JobTracker • Sends “heartbeats” and progress reports back to the JobTracker • TaskRunner • The actual Map or Reduce task started in a separateJVM • Crashes & failures do not affect the Task Tracker on thenode! http://www.allsoftsolution.in

Hadoop - Map/Reduce (2) (C) TomWhite http://www.allsoftsolution.in

Hadoop - Map/Reduce (3) • Integrated withHDFS • Map tasks executed on the HDFS node where the data is (data locality => reducetraffic) • Data locality is not possible for Reducetasks • Intermediate outputs of Map tasks (nodes) are not stored on HDFS, but locally, and then sent to the proper Reduce task(node) • Statusupdates • Task Runner => Task Tracker, progress updates every3s • Task Tracker => Job Tracker, heartbeat + progress for all local tasks every5s • If a task has no progress report for too long, it will beconsidered • failed andre-started http://www.allsoftsolution.in

Hadoop - Map/Reduce (4) • Someextras • Counters • Gather stats about atask • Globally aggregated (Job Runner => Task Tracker => JobTracker) • M/R counters: M/R input records, M/R outputrecords • Filesystem counters: bytesread/written • Job counters: launched M/R tasks, failed M/R tasks,… • Joins • Copy the small seton each node and perform joins locally. Useful when one dataset is very large, the other very small (e.g. “Scalable Distributed Reasoning using MapReduce” fromVUA) • Map side join – data is joined beforethe Map function, very efficientbut • less flexible (datasets must be partitioned & sorted in a particularway) • Reduce side join – more general but less efficient (Map generates (K,V) pairs using the joinkey) http://www.allsoftsolution.in

Hadoop - Map/Reduce (5) • Built-in mappers andreducers • Chain – run a chain/pipe of sequential Maps (M+RM*). The last Map output is the Taskoutput • FieldSelection – select a list of fields from the input dataset tobe • used as MRkeys/values • TokenCounterMapper, SumReducer – (remember the “word count” example?) • RegexMapper – matches a regex in the input key/valuepairs http://www.allsoftsolution.in

CloudMapReduce • Accenture(2010) • http://code.google.com/p/cloudmapreduce/ • Map/Reduce implementation for AWS (EC2, S3,SimpleDB, • SQS) • fast (reported as up to 60 times faster than Hadoop/EC2 in some cases) • scalable & robust(no single point of bottleneck orfailure) • simple (3 KLOC) • Features • No need for centralised coordinator (JobTracker), just put jobstatus • in the cloud datastore(SimpleDB) • All data transfer & communication is handled by theCloud • All I/O and storage is handled by theCloud http://www.allsoftsolution.in

Cloud MapReduce(2) (C) RickyHo http://www.allsoftsolution.in

Cloud MapReduce(3) • Job clientworkflow • Store input data(S3) • Create a Map task for each data split & put it into theMapper • Queue(SQS) • Create Multiple Partition Queue(SQS) • Create Reducer Queue (SQS) & put a Reduce task for each Partition Queue • Create the Output Queue(SQS) • Create a Job Request (ref to all queues) and put it intoSimpleDB • Start EC2 instances for Mappers &Reducers • Poll SimpleDB for jobstatus • When job complete download results fromS3 http://www.allsoftsolution.in

Cloud MapReduce(4) • Mapperworflow • Dequeue a Map task from the MapperQueue • Fetch data fromS3 • Perform user defined map function, add multiple output (Km,Vm) pairs to some Multiple Partition Queue (hash(Km)) => several partition keys may sharethe same partitionqueue! • When done remove Map task from MapperQueue • Reducerworkflow • Dequeue a Reeduce task from the ReducerQueue • Dequeue the (Km,Vm) pairs from the corresponding PartitionQueue • => several partitions may share the samequeue! • Perform a user defined reduce function and add output pairs (Kr,Vr) to the OutputQueue • When done remove the Reduce task from the ReducerQueue http://www.allsoftsolution.in

MR.Flow • Web based M/Reditor • http://www.mr-flow.com • Reusable M/Rmodules • Execution & status monitoring (Hadoopclusters) http://www.allsoftsolution.in

Contents PartIII SomeMap/Reduce algorithms http://www.allsoftsolution.in

Generalconsiderations • Map execution order is notdeterministic • Map processing time cannot bepredicted • Reduce tasks cannot start before all Maps havefinished • (dataset needs to be fullypartitioned) • Not suitable for continuous inputstreams • There will be a spike in network utilisation after the Map/ • before the Reducephase • Number & size of key/valuepairs • Object creation & serialisation overhead (Amdahl’slaw!) • Aggregate partial results whenpossible! • Use Combiners http://www.allsoftsolution.in

Graphalgorithms • Very suitable for M/Rprocessing • Data (graph node)locality • “spreading activation” type ofprocessing • Some algorithms with sequential dependency not suitable forM/R • Breadth-first search algorithms better thandepth-first • GeneralApproach • Graph represented by adjacencylists • Map task – input: node + its adjacency list; perform some analysis over the node link structure; output: target key + analysis result • Reduce task – aggregate values bykey • Perform multiple iterations (with a terminationcriteria) http://www.allsoftsolution.in

Social NetworkAnalysis • Problem: recommend new friends (friend-of-a-friend,FOAF) • Maptask • U (target user) is fixed and its friends list copied to all clusternodes • (“copy join”); each cluster node stores part of the socialgraph • In: (X, <friendsX>), i.e. the local data for the cluster node • Out: • if (U, X) are friends => (U, <friendsX\friendsU>), i.e. the users whoare • friends of X but not already friends ofU • nilotherwise • Reducetask • In: (U, <<friendsA\friendsU>,<friendsB\friendsU>, … >), i.e. the FOAF lists for all users A, B, etc. who are friends withU • Out (U, <(X1, N1), (X2, N2), …>), where each X is a FOAF for U, and Nis • its total number of occurrences in all FOAF lists (sort/rank theresult!) http://www.allsoftsolution.in

PageRank withM/R (C) JimmyLin http://www.allsoftsolution.in

Text Indexing &Retrieval • Indexing is very suitable forM/R • Focus on scalability, not on latency & responsetime • Batchoriented • Maptask • emit (Term, (DocID,position)) • Reducetask • Group pairs by Term and sort byDocID http://www.allsoftsolution.in

Text Indexing & Retrieval(2) (C) JimmyLin http://www.allsoftsolution.in

Text Indexing & Retrieval(3) • Retrieval not suitable forM/R • Focus on responsetime • Startup of Mappers & Reducers is usually prohibitivelyexpensive • Katta • http://katta.sourceforge.net/ • Distributed Lucene indexing with Hadoop(HDFS) • Multicast querying &ranking http://www.allsoftsolution.in

Usefullinks • "MapReduce: Simplified Data Processing on LargeClusters" • “Dryad: Distributed Data-Parallel Programs fromSequential • BuildingBlocks” • “Cloud MapReduce TechnicalReport” • Data-Intensive Text Processing withMapReduce • Hadoop - The Definitive Guide http://www.allsoftsolution.in

large scale data analysis

large scale data analysis

Presentation Transcript

Computational Methods for Large Scale DNA Data Analysis

Proteomics Analysis and integration of large-scale data sets

Algorithmic and Statistical Perspectives on Large-Scale Data Analysis

Large-Scale Phylogenetic Analysis

Large scale genomic data mining

LARGE SCALE

Large-scale Data Processing Challenges

Large scale genomic data mining

Large- scale Linked Data Management

Integrative Analysis of multiple large-scale molecular biological data

A Comparison of Approaches to Large-Scale Data Analysis

Large scale data processing

Large scale

Large-Scale Static Timing Analysis

Large Scale Data Processing with DryadLINQ

Web Research - Large-Scale Web Data Analysis

Large Scale Data Integration

Large Scale Data Analytics

Are Large Scale Data Breaches Inevitable?

Algorithmic and Statistical Perspectives on Large-Scale Data Analysis

Computational Mathematics for Large-scale Data Analysis