80 likes | 237 Views
Database Systems. What is “Database systems” research?. Input? large data sets, large files, relational tables How? Fast external algorithms; RAM-efficient data structures at two storage levels Efficiency? Desirable linear time complexity O(n)
E N D
What is “Database systems” research? • Input? large data sets, large files, relational tables • How? Fast external algorithms; RAM-efficient data structures at two storage levels • Efficiency? Desirable linear time complexity O(n) • Hardware? Small computer, single server, parallel DBMS server, parallel cluster; disk, RAID • Infrastructure? DBMS, parallel file system; any large files • Boring? Theory+programming
Database systems research today • Transaction processing? done • Efficient querying? done • Fast external algorithms? Simple tasks. • Parallel computation? Well proven DBMS technology, still many challenges. • Exploiting new hardware? Difficult • Analyzing? Most difficult: data mining, stats • Future? Information integration (db+docs)
DB Systems involves Core CS research:Theory+Programming • Theory we use: • Time complexity and I/O cost • Data structures; especially external • Relational model is here to stay • Multivariate statistics, machine learning, discrete math • Numerical methods • Compilers: parsing/compiling/optimizing code; recursion • Programming (even some hacking): • Systems in a broad sense • Languages: C, C++; efficiency, low-level pointer manipulation, legacy code; Java, C# mainly for portability • Numerical, OS libraries • DBMS • SQL • UDFs • API with C, C++, C#
Research topics • GOAL: Integrating statistical and machine learning algorithms with a DBMS (external algorithms, queries, UDFs) • Difference with machine learning algorithms: Size, external algorithms (small RAM), queries, low level optimization, generally simpler models • Main topics by students: • Zhibo Chen: OLAP cubes, parametric statistical tests, cube ops on flash memory • Mario Navas: Singular Value Decomposition for PCA and ML Factor Analysis, data summarization on multicore CPUs • Carlos Garcia-Alvarado: keyword search across docs and db, ranking, query recommendation • Sasi Pitchaimalai: Bayesian classification, multithreaded summarization • Kai Zhao: predictive association rules, frequent subgraphs • Manish Limaye: ER modeling for data pre-processing • Anu Goyal: accelerating convergence of EM for mixtures of Gaussians
Representative problems Finding predictive association rules OLAP cubes Cluster, PCA and regression Bayesian classification
Why is our database systems research “cool”? • Theory+Programming • Optimization, O(f(n)), systems (external data structures, discrete math, compiler, OS) • Goes from hardware-level stuff (multi-core, cache memory), to high-level query optimization in SQL • Database systems techniques are used in search engines like Google and Yahoo (and vice-versa) • DBMS technology used everywhere
Why join DBMS group? • Balance between theory (math) and programming • We target “cool” conferences: ACM SIGMOD (core database systems); ACM CIKM (IR+DB+DM); IEEE ICDM (DM) • Mature and stable CS research area • Job/internship opportunities in DBMS and search engines; Job security on any IT department • Visit my web page, DBLP. Google “Ordonez SQL”