Scalable Data Science Systems

Scalable Data Science Systems

My research on Data Science • Input? large data sets, large files, many documents, many tables, fast growing => Big Data • How? Fast external algorithms; efficient data structures at two storage levels. • Parallel: multi-threaded or multi-node, distributed • Ideal goals: linear time O(n), linear speedup • Hardware? Multicore CPU, GPU or parallel cluster • Infrastructure? Cloud, distributed memory, parallel file system • Analytics: queries, cubes, statistics, Machine learning • Challenge: apply CS Theory to programming

Data Systems research today • Transaction processing? More into main memory, lock-free • Efficient analysis? joins, compiled queries, streams, exploit ample RAM, multi-core, leverage R/ScaLAPACK • Compiler versus interpreter? Dev. More into Python and JavaScript • Massive storage? Posix file system vs HDFS • Fast external algorithms? Simple tasks. • Parallel computation? Multi-core with threads, Shared-nothing (embedded message-passing) • Exploiting new hardware? Interesting,difficult,but customized

Data Science involves Core CS research:Theory+Programming • Theory we use: • Time complexity (big O()) and I/O cost (disk, solid state memory) • Many data structures (arrays, trees, hash tables, linked lists) • Relational model and information retrieval models • Linear algebra • Multivariate statistics, machine learning models • Compilers and programming languages: parsing/compiling/optimizing code; recursion • Programming: • Languages: C++ and Python, Also, Java combined with R, SQL, Scala • Systems: Unix (Linux), Spark • OS: Unix, but we have a lot of past work on MS Windows .net • Systems aspects: Threads, text/binary I/O, parallel file systems, memroy management, code generation, code optimization, ..a lot of fun.

Typical Problems Summarization for linear models: vector outer products Exploration: cubes, lattices Graph transitive closure (linear recursion), clique enumeration Bayesian models: MCMC, classification, regression, variable/feature selection

Why join my group? • Balance between theory (mathematics) and programming (C++) • Lots of machine learning and graph analytics • Build libraries and tools to help analysts • Many scientific applications

Scalable Data Science Systems

Scalable Data Science Systems

Presentation Transcript

Parallel Scalable Operating Systems

Scalable Data Mining

Data-Intensive Scalable Science: Beyond MapReduce

Member Fast Facts Scalable Data Systems Australia

Designing Highly Scalable OLTP Systems

Scalable Data Management@facebook

Scalable Ontology-Based Information Systems

Where to leave the data ? Parallel systems Scalable Distributed Data Structures

Earth Science Data Systems

Data Acquisition Systems for Big Science

Scalable Cache Coherent Systems

Scalable and Dynamic Quorum Systems

Designing Highly Scalable OLTP Systems

Scalable Systems Software Project

Scalable Systems and Technology

Scalable Programming and Algorithms for Data Intensive Life Science Applications

Towards Scalable Pub/Sub Systems

Data Acquisition Systems for Big Science

Scalable Systems Software Project

Scalable Cache Coherent Systems

Scalable Ontology-Based Information Systems

Scalable Cache Coherent Systems