1 / 6

Making Big Data Analytics Interactive

Making Big Data Analytics Interactive. Matei Zaharia , in collaboration with Mosharaf Chowdhury , Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin , Michael Franklin, Scott Shenker , Ion Stoica AMP Lab, UC Berkeley. UC BERKELEY. Motivation.

ramla
Download Presentation

Making Big Data Analytics Interactive

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Making Big Data Analytics Interactive Matei Zaharia, in collaboration withMosharafChowdhury, Tathagata Das, Ankur Dave, Haoyuan Li,Justin Ma, Murphy McCauley, Joshua Rosen, ReynoldXin,Michael Franklin, Scott Shenker, Ion Stoica AMP Lab, UC Berkeley UC BERKELEY

  2. Motivation Data is growing faster than computation speeds • Web apps, mobile, scientific, … Requires large clusters to analyze • Programming clusters is hard • Failures, placement, load balancing

  3. Spark Programming Model Manipulate distributed datasets as you would local collections Concept: resilient distributed datasets (RDDs) • Distributed collections of objects that are automatically reconstructed on failure • Built through parallel transformations (map, filter, etc) • Can be cached in memory for fast reuse

  4. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Cache 1 lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) messages.cache() Worker results tasks Block 1 Master messages.filter(_.contains(“Linux”)).count() Cache 2 messages.filter(_.contains(“PHP”)).count() Worker Cache 3 Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5 sec (vs 170 s for on-disk data) Block 2 Worker Block 3

  5. Projects Built on Spark • Spark available in Java, Scala, Python • Growing open source community • 500-member meetup • 14 companies contributing • Using it to develop a stack offastanalytics tools • Shark SQL, Spark Streaming, … . . . Streaming Learning Hadoop, MPI Graph BlinkDB . . . Shark Spark Mesos Private Cluster Cloud Berkeley Data AnalyticsStack (BDAS)

  6. Find Out More • Visit us at the AMP Lab: 4th floor Soda Hall www.spark-project.org UC BERKELEY

More Related