1 / 31

Berkley Data Analysis Stack (BDAS)

Berkley Data Analysis Stack (BDAS). Mesos , Spark, Shark, Spark Streaming. Current Data Analysis Open Stack. Data Processing. Storage. Application. Characteristics: Batch Processing on on-disk data.. Not very efficient with “Interactive” and “Streaming” computations. . Infrastructure.

tryna
Download Presentation

Berkley Data Analysis Stack (BDAS)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Berkley Data Analysis Stack (BDAS) Mesos, Spark, Shark, Spark Streaming

  2. Current Data Analysis Open Stack Data Processing Storage Application • Characteristics: • Batch Processing on on-disk data.. • Not very efficient with “Interactive” and “Streaming” computations. Infrastructure

  3. Goal

  4. Berkeley Data Analytics Stack (BDAS) New apps: AMP-Genomics, Carat, … Application Application • in-memory processing • trade between time, quality, and cost Data Processing Data Processing Storage Data Management Efficient data sharing across frameworks Infrastructure Resource Management Share infrastructure across frameworks (multi-programming for datacenters)

  5. BDAS Components

  6. Mesos • A platform for sharing commodity clustersbetween diverse computing frameworks. B. Hindman et. al, Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center, tech report, UCB, 2010

  7. Mesos • Has to deal with “framework” specific constraints (without knowing the specific constraints). • e.g. data locality. • Allows the framework scheduler to “reject offer” if constraints are not met. • “Resource Offers” to publish available resources B. Hindman et. al, Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center, tech report, UCB, 2010

  8. Mesos • Other Issues: • Resource Allocation Strategies: Pluggable • fair sharing plugin implemented • Revocation • Isolation: • “Existing OS isolation techniques: Linux Containers”. • Fault Tolerance: • Master: stand by master nodes and zoo keeper • Slaves: Reports task/slave failures to the framework, the latter handles. • Framework scheduler failure: Replicate B. Hindman et. al, Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center, tech report, UCB, 2010

  9. Spark Current popular programming models for clusters transform data flowing from stable storage to stable storage • Spark: In-MemoryCluster Computing for Iterative and Interactive Applications. • Assumption • Same working data set is used across iterations or for a number of interactive queries. • Commodity Cluster • Local Data partitions fit in Memory Some Slides taken from presentation:

  10. Spark • Acyclic data flows – powerful abstractions • But not efficient for Iterative/interactive applications that repeatedly use the same “working data set”. Solution: augment data flow model with in-memory“resilient distributed datasets” (RDDs)

  11. RDDs • An RDD is an immutable, partitioned, logical collection of records • Need not be materialized, but rather contains information to rebuild a dataset from stable storage (lazy-loading and lineage) • can be rebuilt if a partition is lost (Transform once read many) • Partitioning can be based on a key in each record (using hash or range partitioning) • Created by transforming data in stable storage using data flow operators (map, filter, group-by, …) • Can be cached for future reuse

  12. Generality of RDDs • Claim: Spark’s combination of data flow with RDDs unifies many proposed cluster programming models • General data flow models:MapReduce, Dryad, SQL • Specialized models for stateful apps:Pregel (BSP), HaLoop (iterative MR), Continuous Bulk Processing • Instead of specialized APIs for one type of app, give user first-class control of distributed datasets

  13. Programming Model

  14. Example: Log Mining • Load error messages from a log into memory, then interactively search for various patterns Cache 1 Base RDD Transformed RDD • lines = spark.textFile(“hdfs://...”) • errors = lines.filter(_.startsWith(“ERROR”)) • messages = errors.map(_.split(‘\t’)(2)) • cachedMsgs = messages.cache() Worker • results • tasks Block 1 Driver Cached RDD Parallel operation • cachedMsgs.filter(_.contains(“foo”)).count Cache 2 • cachedMsgs.filter(_.contains(“bar”)).count Worker • . . . Cache 3 Block 2 Worker Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Block 3

  15. RDD Fault Tolerance • RDDs maintain lineage information that can be used to reconstruct lost partitions • Ex: • cachedMsgs = textFile(...).filter(_.contains(“error”)) • .map(_.split(‘\t’)(2)) • .cache() HdfsRDD path: hdfs://… FilteredRDD func: contains(...) MappedRDD func: split(…) CachedRDD

  16. Benefits of RDD Model • Consistency is easy due to immutability • Inexpensive fault tolerance (log lineage rather than replicating/checkpointing data) • Locality-aware scheduling of tasks on partitions • Despite being restricted, model seems applicable to a broad variety of applications

  17. Example: Logistic Regression • Goal: find best line separating two sets of points • random initial line • + • + • + • + • + • + • – • + • + • – • – • + • – • + • – • – • – • – • – • – • target

  18. Logistic Regression Code val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w)

  19. Logistic Regression Performance • 127 s / iteration • first iteration 174 s • further iterations 6 s

  20. Page Rank: Scala Implementation vallinks = // RDD of (url, neighbors) pairs varranks = // RDD of (url, rank) pairs for (i <- 1 to ITERATIONS) { valcontribs = links.join(ranks).flatMap { case (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } ranks = contribs.reduceByKey(_ + _) .mapValues(0.15 + 0.85 * _) }ranks.saveAsTextFile(...)

  21. SparkSummary • Fast, expressive cluster computing system compatible with Apache Hadoop • Works with any Hadoop-supported storage system (HDFS, S3, Avro, …) • Improves efficiency through: • In-memory computing primitives • General computation graphs • Improves usability through: • Rich APIs in Java, Scala, Python • Interactive shell Up to 100× faster Often 2-10× less code

  22. Spark Streaming • Framework for large scale stream processing • Scales to 100s of nodes • Can achieve second scale latencies • Integrates with Spark’s batch and interactive processing • Provides a simple batch-like API for implementing complex algorithm • Can absorb live data streams from Kafka, Flume, ZeroMQ, etc.

  23. Requirements • Scalableto large clusters • Second-scale latencies • Simpleprogramming model • Integratedwith batch & interactive processing

  24. StatefulStream Processing • Traditional streaming systems have a event-driven record-at-a-time processing model • Each node has mutable state • For each record, update state & send new records • State is lost if node dies! • Making stateful stream processing be fault-tolerant is challenging mutable state input records node 1 node 3 input records node 2

  25. Discretized Stream Processing Run a streaming computation as a series of very small, deterministic batch jobs live data stream Spark Streaming • Chop up the live stream into batches of X seconds • Spark treats each batch of data as RDDs and processes them using RDD operations • Finally, the processed results of the RDD operations are returned in batches batches of X seconds Spark processed results

  26. Example 1 – Get hashtags from Twitter valtweets= ssc.twitterStream(<Twitter username>, <Twitter password>) DStream: a sequence of RDD representing a stream of data Twitter Streaming API tweets DStream batch @ t+1 batch @ t batch @ t+2 stored in memory as an RDD (immutable, distributed)

  27. Example 1 – Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) valhashTags= tweets.flatMap(status => getTags(status)) new DStream transformation: modify data in one Dstream to create another DStream batch @ t+1 batch @ t batch @ t+2 hashTags Dstream [#cat, #dog, … ] new RDDs created for every batch tweets DStream flatMap flatMap flatMap …

  28. Example 1 – Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) valhashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") output operation: to push data to external storage batch @ t+1 batch @ t batch @ t+2 tweets DStream flatMap flatMap flatMap hashTags DStream every batch saved to HDFS save save save

  29. Fault-tolerance tweets RDD input data replicated in memory • RDDs remember the sequence of operations that created it from the original fault-tolerant input data • Batches of input data are replicated in memory of multiple worker nodes, therefore fault-tolerant • Data lost due to worker failure, can be recomputed from input data flatMap hashTags RDD lost partitions recomputed on other workers

  30. Key concepts • DStream – sequence of RDDs representing a stream of data • Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets • Transformations – modify data from on DStream to another • Standard RDD operations – map, countByValue, reduce, join, … • Stateful operations – window, countByValueAndWindow, … • Output Operations – send data to external entity • saveAsHadoopFiles – saves to HDFS • foreach – do anything with each batch of results

  31. Comparison with Storm and S4 Higher throughput than Storm • Spark Streaming: 670k records/second/node • Storm: 115k records/second/node • Apache S4: 7.5k records/second/node

More Related