60 likes | 76 Views
Scalable Data Science Systems. My research on Data Science. Input? large data sets, large files, many documents, many tables, fast growing => Big Data How? Fast external algorithms; efficient data structures at two storage levels. Parallel: multi-threaded or multi-node, distributed
E N D
My research on Data Science • Input? large data sets, large files, many documents, many tables, fast growing => Big Data • How? Fast external algorithms; efficient data structures at two storage levels. • Parallel: multi-threaded or multi-node, distributed • Ideal goals: linear time O(n), linear speedup • Hardware? Multicore CPU, GPU or parallel cluster • Infrastructure? Cloud, distributed memory, parallel file system • Analytics: queries, cubes, statistics, Machine learning • Challenge: apply CS Theory to programming
Data Systems research today • Transaction processing? More into main memory, lock-free • Efficient analysis? joins, compiled queries, streams, exploit ample RAM, multi-core, leverage R/ScaLAPACK • Compiler versus interpreter? Dev. More into Python and JavaScript • Massive storage? Posix file system vs HDFS • Fast external algorithms? Simple tasks. • Parallel computation? Multi-core with threads, Shared-nothing (embedded message-passing) • Exploiting new hardware? Interesting,difficult,but customized
Data Science involves Core CS research:Theory+Programming • Theory we use: • Time complexity (big O()) and I/O cost (disk, solid state memory) • Many data structures (arrays, trees, hash tables, linked lists) • Relational model and information retrieval models • Linear algebra • Multivariate statistics, machine learning models • Compilers and programming languages: parsing/compiling/optimizing code; recursion • Programming: • Languages: C++ and Python, Also, Java combined with R, SQL, Scala • Systems: Unix (Linux), Spark • OS: Unix, but we have a lot of past work on MS Windows .net • Systems aspects: Threads, text/binary I/O, parallel file systems, memroy management, code generation, code optimization, ..a lot of fun.
Typical Problems Summarization for linear models: vector outer products Exploration: cubes, lattices Graph transitive closure (linear recursion), clique enumeration Bayesian models: MCMC, classification, regression, variable/feature selection
Why join my group? • Balance between theory (mathematics) and programming (C++) • Lots of machine learning and graph analytics • Build libraries and tools to help analysts • Many scientific applications