260 likes | 604 Views
What’s New in Spark 0.6 and Shark 0.2. November 5, 2012. www.spark-project.org. UC BERKELEY. Agenda. Intro & Spark 0.6 tour (Matei Zaharia) Standalone deploy mode (Denny Britz ) Shark 0.2 ( Reynold Xin ) Q & A. What Are Spark & Shark?.
E N D
What’s New in Spark 0.6 and Shark 0.2 November 5, 2012 www.spark-project.org UC BERKELEY
Agenda • Intro & Spark 0.6 tour (Matei Zaharia) • Standalone deploy mode (Denny Britz) • Shark 0.2 (ReynoldXin) • Q & A
What Are Spark & Shark? • Spark: fast cluster computing engine based on general operators & in-memory computing • Shark: Hive-compatible data warehouse system built on Spark • Both are open source projectsfrom the UCBerkeley AMP Lab
What is the AMP Lab? • 60-person lab focusing on big data • Funded by NSF, DARPA, 18 companies • Goal: build an open-source, next-generation analytics stack . . . Streaming Learning Graph Hadoop, MPI Shark . . . Spark UC BERKELEY Mesos
Some Exciting News • Recently, three full-time developers joined AMP to work on these projects • Also encourage outside contributions! • This release: Shark server (Yahoo!), improved accumulators (Quantifind)
Spark 0.6 Release • Biggest release so far in terms of features • Biggest in terms of developers (18 total, 12 new) • Focus areas: ease-of-use and performance
Ease-of-Use • Spark already had good traction despite two fairly researchy aspects • Scala language • Requirement to run on Mesos • A big goal was to improve these: • Java API (and upcoming API in Python) • Simpler deployment (standalone mode, YARN)
Java API • lines.filter(_.contains(“error”)).count() • JavaRDD<String> lines = sc.textFile(...);lines.filter(new Function<String, Boolean>() { Boolean call(String s) {returns.contains(“error”); }}).count();
Java API Features • Supports all existing Spark features • RDDs, accumulators, broadcast variables • Retains type safety through specific classes for RDDs of special types • E.g. JavaPairRDD<K, V> for key-value pairs
Using Key-Value Pairs • import scala.Tuple2; • JavaRDD<String> words = ...; • JavaPairRDD<String, Integer> ones = words.map(newPairFunction<String, String, Integer> {public Tuple2<String, Integer> call(String s) {returnnew Tuple2(s, 1);} }); • // Can now call ones.reduceByKey(), groupByKey(), etc More info: spark-project.org/docs/0.6.0/
Coming Next: PySpark • lines = sc.textFile(sys.argv[1])counts = lines.flatMap(lambda x: x.split(' ')) \ .map(lambda x: (x, 1)) \.reduceByKey(lambda x, y: x + y)
Simpler Deployment • Refactored Spark’s scheduler to allow running on different cluster managers • Denny will talk about the standalone mode…
Other Ease-of-Use Work • Documentation • Big effort to improve Spark’s help and Scaladoc • Debugging hints (pointers to user code in logs) • Maven Central artifacts spark-project.org/documentation.html
Performance • New ConnectionManager and BlockManager • Replace simple HTTP shuffle with faster, async NIO • Faster control-plane (task scheduling & launch) • Per-RDD control of storage level
Some Graphs Wikipedia Search Demo Large User App(2000 maps / 1000 reduces)
Per-RDD Storage Level • importspark.storage.StorageLevelval data = file.map(...) • // Keep in memory, recompute when out of space// (default behavior with cache())data.persist(StorageLevel.MEMORY_ONLY) • // Drop to disk instead of recomputingdata.persist(StorageLevel.MEMORY_AND_DISK) • // Serialize in-memory datadata.persist(StorageLevel.MEMORY_ONLY_SER)
Compatibility • We’ve always strived to stay source-compatible! • Only change in this release is in configuration: spark.cache.class replaced with per-RDD levels
Shark 0.2 • Hive compatibility improvements • Thrift server mode • Performance improvements • Simpler deployment (comes with Spark 0.6)
Hive Compatibility • Hive 0.9 support • Full UDF/UDAF support • ADD FILE support for running scripts • User-supplied jars using ADD JAR
Thrift Server • Contributed by Yahoo!, compatible with Hive Thrift server • Enable multiple clients share cached tables • BI tool integration (e.g. Tableau)
Performance Join (1B join 150M) Group By(1B items, 150M distinct)
Shark 0.3 Preview • In-memory columnar compression (dictionary encoding, run length encoding, etc) • Map pruning • JVM bytecode generation for expression evals • Persist cached table meta data across sessions
Spark 0.7+ • Spark Streaming • PySpark: Python API for Spark • Memory monitoring dashboard