120 likes | 400 Views
Intro to Spark 0.7: PySpark and Streaming. February 21, 2013 www.spark -project.org. UC BERKELEY. Agenda. Introduction (Matei Zaharia) Python API (Josh Rosen) Spark Streaming ( Tathagata Das). What is Spark?. Fast, next-generation data analysis platform started at UC Berkeley
E N D
Intro to Spark 0.7:PySpark and Streaming February 21, 2013 www.spark-project.org UC BERKELEY
Agenda • Introduction (Matei Zaharia) • Python API (Josh Rosen) • Spark Streaming (Tathagata Das)
What is Spark? • Fast, next-generation data analysis platform started at UC Berkeley • Multiple emerging workloads: • Batch, interactive, streaming • Easy APIs in multiple languages: • Java, Scala, Python • Growing higher-level stack (e.g. Shark) . . . Shark (SQL) Learning Graph Streaming . . . Spark Mesos YARN EC2
Spark 0.7: Statistics • Biggest release yet in terms of contributors • 30 people contributed (19 non-Berkeley) • 12 companies contributed code • 28K lines of patches, 700 commits
Spark 0.7: Contributors • Mikhail Bautin* • Denny Britz • Paul Cavallaro* • Tathagata Das • Thomas Dudziak* • Harvey Feng • Stephen Haberman* • Tyson Hamilton* • Mark Hamstra* • Michael Heuer* • Shane Huang* • Andy Konwinski • Ryan LeCompte* • Haoyuan Li • Richard McKinley* • Sean McNamara* • Lee Moon Soo* • FernandPajot* • Nick Pentreath* • Andrew Psaltis* • Imran Rashid* • Charles Reiss • Josh Rosen • Peter Sankauskas* • PrashantSharma* • ShivaramVenkataraman • Patrick Wendell • ReynoldXin • Matei Zaharia • Eric Zhang* * = non-Berkeley
Spark 0.7: Features • Two major additions: • Python API (PySpark) • Spark Streaming alpha • Many smaller ones: • Memory monitoring dashboard • Maven build & Debian packages • RDD checkpointing • Metadata cleanup (TTL) • Shuffle speedups • Improved EC2 scripts • … Expect the release in 3-4 days!