130 likes | 142 Views
CS 239 – Big Data Systems Fall 2019. Harry Xu UCLA. My Research Background. Compilers and systems Static and dynamic program analysis Compiler Runtime/operating systems Big Data Analytics Dataflow systems Graph systems Machine learning systems Some industrial experience
E N D
CS 239 – Big Data SystemsFall 2019 Harry Xu UCLA
My Research Background • Compilers and systems • Static and dynamic program analysis • Compiler • Runtime/operating systems • Big Data Analytics • Dataflow systems • Graph systems • Machine learning systems • Some industrial experience • Microsoft – created and developed an optimizing compiler for Cosmos/Scope that improved the overall performance of production jobs by up to 3X • IBM – created and developed a series of profiling tools for large-scale systems Big Data system support for scalable program analysis system support for scalable analytics
BigDatalog Application Circle Infrastructure Circle
This Course: Big Data Systems • What it is about • Low-level infrastructures • Programming models • Runtimes • Scalability and efficiency • What it is NOT about • High-level applications • Workloads • Data collection and usage • An example • We are going to discuss some papers on machine learning systems • We are NOT going to discuss learning models and algorithms
Industrial Relevance • Many papers came directly from industry • GFS, MapReduce, Bigtable, Spanner, TensorFlow (Google) • HDFS (Yahoo) • Azure, Trill, Dryad, Naiad (Microsoft) • Spark, Tachyon (Databricks) • Applications v.s. systems • Many people can develop applications • Few people can develop systems • Applications are specific to domains while skills required to build infrastructures are generic
Goals to Achieve • Understand what systems are available for data analytics • Understand fundamental challenges in system design • Understand how to design a customized system for a certain workload • Gain experience with system development by proposing and implementing a new idea
What This Course is Related To • Distributed systems • Database systems • Computer Architecture • Networking • Storage (memory, disk, file system,etc.) • Graph algorithms • Statistics • Machine learning
Aspects of Big Data Processing • Where to put data? • How to process data at scale? • How to process different types of data? • Structured data • Unstructured data • Streaming data • Graph data • Data for model training • How to take advantage of technological advances • How to make processing efficient?
Topics Covered (I) • Distributed storage systems • HDFS, GFS, Bigtable, Spanner, and Azure storage • Dataflow engines • MapReduce, Dryad, AsterixDB, Spark • Batch processing • Hive, Spark SQL, and SCOPE • Resource Management • Mesos, YARN, LATE, Borg, Sparrow
Topics Covered (II) • Stream processing • Storm, Flink, Kafka, Naiad, Trill, SVE, Drizzle • Graph processing • Pregel, Ligra, GraphChi, Xstream, GridGraph • Machine learning • TensorFlow, Parameter Servers, Project Adam
Why Do We Need Those Systems • Enablers • Better performance • Scalability • Efficiency • Energy • Easy/flexible programmability
Course Structure • Paper critiques • Due before each presentation day • Presentation • 20-25 mins • Participation in active discussion • Project • 2-3 students form a group, working on an innovative idea in system development
Things about Presentations/Critiques • Reuse slides as much as possible • A good rule of thumb is to follow this order • What problems does the paper solve? • Why are they (serious) problems? • Why aren’t they already solved? • What are the main challenges? • How did the authors overcome them? • What evidence did the authors show that the problems is solved? • Questions, concerns, opportunities for improvement