60 likes | 82 Views
Dive into fast external algorithms, parallel computation, and efficient analysis in scalable data science systems. Explore hardware, infrastructure, analytics, and apply CS theory to programming challenges. Embrace various programming languages and systems aspects to solve typical problems in machine learning and graph analytics.
E N D
My research on Data Science • Input? large data sets, large files, many documents, many tables, fast growing => Big Data • How? Fast external algorithms; efficient data structures at two storage levels. • Parallel: multi-threaded or multi-node, distributed • Ideal goals: linear time O(n), linear speedup • Hardware? Multicore CPU, GPU or parallel cluster • Infrastructure? Cloud, distributed memory, parallel file system • Analytics: queries, cubes, statistics, Machine learning • Challenge: apply CS Theory to programming
Data Systems research today • Transaction processing? More into main memory, lock-free • Efficient analysis? joins, compiled queries, streams, exploit ample RAM, multi-core, leverage R/ScaLAPACK • Compiler versus interpreter? Dev. More into Python and JavaScript • Massive storage? Posix file system vs HDFS • Fast external algorithms? Simple tasks. • Parallel computation? Multi-core with threads, Shared-nothing (embedded message-passing) • Exploiting new hardware? Interesting,difficult,but customized
Data Science involves Core CS research:Theory+Programming • Theory we use: • Time complexity (big O()) and I/O cost (disk, solid state memory) • Many data structures (arrays, trees, hash tables, linked lists) • Relational model and information retrieval models • Linear algebra • Multivariate statistics, machine learning models • Compilers and programming languages: parsing/compiling/optimizing code; recursion • Programming: • Languages: C++ and Python, Also, Java combined with R, SQL, Scala • Systems: Unix (Linux), Spark • OS: Unix, but we have a lot of past work on MS Windows .net • Systems aspects: Threads, text/binary I/O, parallel file systems, memroy management, code generation, code optimization, ..a lot of fun.
Typical Problems Summarization for linear models: vector outer products Exploration: cubes, lattices Graph transitive closure (linear recursion), clique enumeration Bayesian models: MCMC, classification, regression, variable/feature selection
Why join my group? • Balance between theory (mathematics) and programming (C++) • Lots of machine learning and graph analytics • Build libraries and tools to help analysts • Many scientific applications