580 likes | 770 Views
Real-Time Stream Processing. CMSC 491/691 Hadoop-Based Distributed Computing Spring 2014 Adam Shook. Agenda. Apache Storm Apache Spark. Traditional Data Processing. !!!ALL!!! the data. Batch Pre-Computation (aka MapReduce). Index. Query. Index. Query. Index.
E N D
Real-Time Stream Processing CMSC 491/691 Hadoop-Based Distributed Computing Spring 2014 Adam Shook
Agenda • Apache Storm • Apache Spark
Traditional Data Processing !!!ALL!!! the data Batch Pre-Computation (aka MapReduce) Index Query Index Query Index
Traditional Data Processing • Slow... and views are out of date Absorbed into batch views Not absorbed Now Time
Compensating for the real-time stuff • Need some kind of stream processing system to supplement our batch views • Applications can then merge the batch and the real time views together!
Enter: Storm • Open-Source project originally built by Twitter • Now lives in the Apache Incubator • Enables distributed, fault-tolerant real-time computation
A History Lesson on Twitter Metrics Twitter Firehose
A History Lesson on Metrics Twitter Firehose
Problems! • Scaling is painful • Fault-tolerance is practically non-existent • Coding for it is awful
Wanted to Address • Guaranteed data processing • Horizontal Scalability • Fault-tolerance • No intermediate message brokers • Higher level abstraction than message passing • “Just works”
Storm Delivers • Guaranteed data processing • Horizontal Scalability • Fault-tolerance • No intermediate message brokers • Higher level abstraction than message passing • “Just works”
Use Cases • Stream Processing • Distributed RPC • Continuous Computation
Storm Architecture Supervisor ZooKeeper Supervisor ZooKeeper Supervisor Nimbus ZooKeeper Supervisor Supervisor
Glossary • Streams • Constant pump of data as Tuples • Spouts • Source of streams • Bolts • Process input streams and produce new streams • Functions, Filters, Aggregation, Joins, Talk to databases, etc. • Topologies • Network of spouts and bolts
Grouping • When a Tuple is emitted from a Spout or Bolt, where does it go? • Shuffle Grouping • Pick a random task • Fields Grouping • Consistent hashing on a subset of tuple fields • All Grouping • Send to all tasks • Global Grouping • Pick task with lowest ID
Topology [“id1”, “id2”] shuffle shuffle [“url”] shuffle all
Guaranteed Message Processing • A tuple has not been fully processed until it all tuples in the “tuple tree” have been completed • If the tree is not completed within a timeout, it is replayed • Programmers need to use the API to ‘ack’ a tuple as completed
Stream Processing ExampleWord Count TopologyBuilder builder = new TopologyBuilder(); builder.setSpout(1, new SentenceSpout(true), 5); builder.setBolt(2, new SplitSentence(), 8).shuffleGrouping(1); builder.setBolt(3, new WordCount(), 12).fieldsGrouping(2, new Fields(“word”)); Map conf = new HashMap(); conf.put(Config.TOPOLOGY_WORKERS, 5); StormSubmitter.submitTopology(“word-count”, conf, builder.createTopology());
public static class SplitSentence extends ShellBolt implements IRichBolt { public SplitSentence() { super(“python”, “splitsentence.py”); } public void declareOutputFields(OutputFieldsDeclaraer declarer) { declarer.declare(new Fields(“word”)); } } #!/usr/bin/python import storm class SplitSentenceBolt(storm.BasicBolt): def process(Self, tup): words = tup.values[0].split(“ “) for word in words: storm.emit([word])
public static class WordCount implements IBasicBolt { Map<String, Integer> counts = new HashMap<String, Integer>(); public void prepare(Map conf, TopologyContext context) {} public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.get(word); if (count == null) { count = 0; } ++count; counts.put(Word, count); collector.emit(new Values(word, count)); } public void cleanup () {} public void declareOutputFields(OutputFieldsDeclarer declarer) {declarer.declare(new Fields(“word”, “count”)); } }
Local Mode! TopologyBuilder builder = new TopologyBuilder(); builder.setSpout(1, new SentenceSpout(true), 5); builder.setBolt(2, new SplitSentence(), 8).shuffleGrouping(1); builder.setBolt(3, new WordCount(), 12).fieldsGrouping(2, new Fields(“word”)); Map conf = new HashMap(); conf.put(Config.TOPOLOGY_WORKERS, 5); LocalCluster cluster = new LocalCluster(); cluster.submitTopology(“word-count”, conf, builder.createTopology()); Thread.sleep(10000); cluster.shutdown();
Command Line Interface • Starting a topology storm jar mycode.jar twitter.storm.MyTopologydemo • Stopping a topology storm kill demo
DRPC ExampleReach • Reach is the number of unique people exposed to a specific URL on Twitter Follower Tweeter Distinct Follower Follower Count URL Tweeter Distinct Follower Reach Follower Follower Tweeter Distinct Follower Follower
Reach Topology shuffle shuffle GetTweeters GetFollowers Spout [“follower-id”] Distinct global CountAggregator
Storm Review • Distributed code and configurations • Robust process management • Monitors topologies and reassigns failed tasks • Provides reliability by tracking tuple trees • Routing and partitioning of streams • Serialization • Fine-grained performance stats of topologies
Concern! • Say I have an application that involves many iterations... • Graph Algorithms • K-Means Clustering • Six Degrees of Bieber Fever • What's wrong with Hadoop MapReduce?
New Frameworks! • Researchers have developed new frameworks to keep intermediate data in-memory • Only support specific computation patterns (Map...Reduce... repeat) • No abstractions for general re-use of data
Enter: RDDs • Or Resilient Distributed Datasets • Fault-tolerant parallel data structures that enables: • Persisting data in memory • Specifying partitioning schemes for optimal placement • Manipulating them with a rich set of operators
Apache SparkLightning-Fast Cluster Computation • Open-source top-level Apache project that came out of Berkeley in 2010 • General-purpose cluster computation system • High-level APIs in Scala, Java, and Python • Higher-level tools: • Shark for HiveQLon Spark • MLlibfor machine learning • GraphXfor graph processing • Spark Streaming
RDD Persistence and Partitioning • Persistence • Users can control which RDDs will be reused and choose a storage strategy • Partitioning • What we know and love! • Hash-partitioning based on some key for efficient joins
RDD Fault-Tolerance • Replicating data in-flight is costly and hard • Instead of replicating every data set, let's just log the transformations of each data set to keep its lineage • Loss of an RDD partition can be rebuilt by replaying the transformations • Only the lost partitions need to be rebuilt!
RDD Storage • Transformations are lazy operations • No computations occur until an action • RDDs can be persisted in-memory, but are spilled to disk if necessary • Users can specify a number of flags to persist the data • Only on disk • Partitioning schemes • Persistence priorities
RDD Eviction Policy • LRU policy at an RDD level • New RDD partition is computed, but not enough space? • Evict partition from the least recently accessed RDD • Unless it is the same RDD as the one with the new partition
Example! Log Mining • Say you want a search through terabytes of log files stored in HDFS for errors and play around with them lines = spark.textFile("hdfs://...") errors = lines.filter(_.startsWith("ERROR")) errors.persist()
Example! Log Mining // Count number of errors logs errors.count() // Count errors mentioning MySQL: errors.filter(_.contains("MySQL")).count() // Return the time fields of errors mentioning // HDFS as an array errors.filter(_.contains("HDFS")) .map(_.split('\t')(3)) .collect()
Spark Execution Flow • Nothing happens to errors until an action occurs • The original HDFS file is not stored in-memory, only the final RDD • This will greatly increase all of the future actions on the RDD
Spark PageRank // Load graph as an RDD of (URL, outlinks) pairs val links = spark.textFile(...).map(...).persist() var ranks = // RDD of (URL, rank) pairs for (i <- 1 to ITERATIONS) { // Build an RDD of (targetURL, float) pairs // with the contributions sent by each page valcontribs = links.join(ranks).flatMap{ (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } // Sum contributions by URL and get new ranks ranks = contribs.reduceByKey((x,y) => x+y) .mapValues(sum => a/N + (1-a)*sum) }
Spark API • Every data set is an object, and transformations are invoked on these objects • Start with a data set, then transform it using operators like map, filter,and join • Then, do some actions like count, collect, or save