310 likes | 497 Views
Berkley Data Analysis Stack (BDAS). Mesos , Spark, Shark, Spark Streaming. Current Data Analysis Open Stack. Data Processing. Storage. Application. Characteristics: Batch Processing on on-disk data.. Not very efficient with “Interactive” and “Streaming” computations. . Infrastructure.
E N D
Berkley Data Analysis Stack (BDAS) Mesos, Spark, Shark, Spark Streaming
Current Data Analysis Open Stack Data Processing Storage Application • Characteristics: • Batch Processing on on-disk data.. • Not very efficient with “Interactive” and “Streaming” computations. Infrastructure
Berkeley Data Analytics Stack (BDAS) New apps: AMP-Genomics, Carat, … Application Application • in-memory processing • trade between time, quality, and cost Data Processing Data Processing Storage Data Management Efficient data sharing across frameworks Infrastructure Resource Management Share infrastructure across frameworks (multi-programming for datacenters)
Mesos • A platform for sharing commodity clustersbetween diverse computing frameworks. B. Hindman et. al, Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center, tech report, UCB, 2010
Mesos • Has to deal with “framework” specific constraints (without knowing the specific constraints). • e.g. data locality. • Allows the framework scheduler to “reject offer” if constraints are not met. • “Resource Offers” to publish available resources B. Hindman et. al, Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center, tech report, UCB, 2010
Mesos • Other Issues: • Resource Allocation Strategies: Pluggable • fair sharing plugin implemented • Revocation • Isolation: • “Existing OS isolation techniques: Linux Containers”. • Fault Tolerance: • Master: stand by master nodes and zoo keeper • Slaves: Reports task/slave failures to the framework, the latter handles. • Framework scheduler failure: Replicate B. Hindman et. al, Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center, tech report, UCB, 2010
Spark Current popular programming models for clusters transform data flowing from stable storage to stable storage • Spark: In-MemoryCluster Computing for Iterative and Interactive Applications. • Assumption • Same working data set is used across iterations or for a number of interactive queries. • Commodity Cluster • Local Data partitions fit in Memory Some Slides taken from presentation:
Spark • Acyclic data flows – powerful abstractions • But not efficient for Iterative/interactive applications that repeatedly use the same “working data set”. Solution: augment data flow model with in-memory“resilient distributed datasets” (RDDs)
RDDs • An RDD is an immutable, partitioned, logical collection of records • Need not be materialized, but rather contains information to rebuild a dataset from stable storage (lazy-loading and lineage) • can be rebuilt if a partition is lost (Transform once read many) • Partitioning can be based on a key in each record (using hash or range partitioning) • Created by transforming data in stable storage using data flow operators (map, filter, group-by, …) • Can be cached for future reuse
Generality of RDDs • Claim: Spark’s combination of data flow with RDDs unifies many proposed cluster programming models • General data flow models:MapReduce, Dryad, SQL • Specialized models for stateful apps:Pregel (BSP), HaLoop (iterative MR), Continuous Bulk Processing • Instead of specialized APIs for one type of app, give user first-class control of distributed datasets
Example: Log Mining • Load error messages from a log into memory, then interactively search for various patterns Cache 1 Base RDD Transformed RDD • lines = spark.textFile(“hdfs://...”) • errors = lines.filter(_.startsWith(“ERROR”)) • messages = errors.map(_.split(‘\t’)(2)) • cachedMsgs = messages.cache() Worker • results • tasks Block 1 Driver Cached RDD Parallel operation • cachedMsgs.filter(_.contains(“foo”)).count Cache 2 • cachedMsgs.filter(_.contains(“bar”)).count Worker • . . . Cache 3 Block 2 Worker Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Block 3
RDD Fault Tolerance • RDDs maintain lineage information that can be used to reconstruct lost partitions • Ex: • cachedMsgs = textFile(...).filter(_.contains(“error”)) • .map(_.split(‘\t’)(2)) • .cache() HdfsRDD path: hdfs://… FilteredRDD func: contains(...) MappedRDD func: split(…) CachedRDD
Benefits of RDD Model • Consistency is easy due to immutability • Inexpensive fault tolerance (log lineage rather than replicating/checkpointing data) • Locality-aware scheduling of tasks on partitions • Despite being restricted, model seems applicable to a broad variety of applications
Example: Logistic Regression • Goal: find best line separating two sets of points • random initial line • + • + • + • + • + • + • – • + • + • – • – • + • – • + • – • – • – • – • – • – • target
Logistic Regression Code val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w)
Logistic Regression Performance • 127 s / iteration • first iteration 174 s • further iterations 6 s
Page Rank: Scala Implementation vallinks = // RDD of (url, neighbors) pairs varranks = // RDD of (url, rank) pairs for (i <- 1 to ITERATIONS) { valcontribs = links.join(ranks).flatMap { case (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } ranks = contribs.reduceByKey(_ + _) .mapValues(0.15 + 0.85 * _) }ranks.saveAsTextFile(...)
SparkSummary • Fast, expressive cluster computing system compatible with Apache Hadoop • Works with any Hadoop-supported storage system (HDFS, S3, Avro, …) • Improves efficiency through: • In-memory computing primitives • General computation graphs • Improves usability through: • Rich APIs in Java, Scala, Python • Interactive shell Up to 100× faster Often 2-10× less code
Spark Streaming • Framework for large scale stream processing • Scales to 100s of nodes • Can achieve second scale latencies • Integrates with Spark’s batch and interactive processing • Provides a simple batch-like API for implementing complex algorithm • Can absorb live data streams from Kafka, Flume, ZeroMQ, etc.
Requirements • Scalableto large clusters • Second-scale latencies • Simpleprogramming model • Integratedwith batch & interactive processing
StatefulStream Processing • Traditional streaming systems have a event-driven record-at-a-time processing model • Each node has mutable state • For each record, update state & send new records • State is lost if node dies! • Making stateful stream processing be fault-tolerant is challenging mutable state input records node 1 node 3 input records node 2
Discretized Stream Processing Run a streaming computation as a series of very small, deterministic batch jobs live data stream Spark Streaming • Chop up the live stream into batches of X seconds • Spark treats each batch of data as RDDs and processes them using RDD operations • Finally, the processed results of the RDD operations are returned in batches batches of X seconds Spark processed results
Example 1 – Get hashtags from Twitter valtweets= ssc.twitterStream(<Twitter username>, <Twitter password>) DStream: a sequence of RDD representing a stream of data Twitter Streaming API tweets DStream batch @ t+1 batch @ t batch @ t+2 stored in memory as an RDD (immutable, distributed)
Example 1 – Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) valhashTags= tweets.flatMap(status => getTags(status)) new DStream transformation: modify data in one Dstream to create another DStream batch @ t+1 batch @ t batch @ t+2 hashTags Dstream [#cat, #dog, … ] new RDDs created for every batch tweets DStream flatMap flatMap flatMap …
Example 1 – Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) valhashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") output operation: to push data to external storage batch @ t+1 batch @ t batch @ t+2 tweets DStream flatMap flatMap flatMap hashTags DStream every batch saved to HDFS save save save
Fault-tolerance tweets RDD input data replicated in memory • RDDs remember the sequence of operations that created it from the original fault-tolerant input data • Batches of input data are replicated in memory of multiple worker nodes, therefore fault-tolerant • Data lost due to worker failure, can be recomputed from input data flatMap hashTags RDD lost partitions recomputed on other workers
Key concepts • DStream – sequence of RDDs representing a stream of data • Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets • Transformations – modify data from on DStream to another • Standard RDD operations – map, countByValue, reduce, join, … • Stateful operations – window, countByValueAndWindow, … • Output Operations – send data to external entity • saveAsHadoopFiles – saves to HDFS • foreach – do anything with each batch of results
Comparison with Storm and S4 Higher throughput than Storm • Spark Streaming: 670k records/second/node • Storm: 115k records/second/node • Apache S4: 7.5k records/second/node