DBMS Group Overview of our research

Carlos Ordonez University of Houston DBMS GroupOverview of our research

Outline • Research on DBS overview • Research topics • Papers • Working with me • Advice & Recommendations • Members

Notes • Hands on • no math or code! • Same presentation I give to my students • Once a year • Take notes!

OVERVIEWAREAS & PROGRAMMING

DB systems today • Modeling: ER, UML, temporal, workflows, metadata, docs • Query languages: relational, logic • File systems and indexing: blocked, row store, B-trees, hash, bitmap • Query optimization: SPJA, recursion • Transaction processing: ACID, 2PL • Analytics: OLAP cubes and DM • Non-relational: provenance, keyword, column, XML, sensors, probabilistic

Research: Database Systems • Both CS main areas • Theory: sets, external algorithms, discrete math, relational, complexity, cost analysis • Systems: large indexed files, query optimizer • Software • Languages: C++ and SQL, but also Java and C#. • Systems: DBMS, Unix, MS-DOS • Math Tools: R, WEKA, SAS, Matlab • Libraries: LAPACK, BLAS

Knowledge Background • Computer Science • algorithms: linear, blocked, complexity, parall • data structures: external, secondary storage • OS: multithreaded programming, file systems, parallel programming (shared-nothing) • Compilers: parsing, optimization, query lang • Mathematics: • Discrete math: sets, combinatorial math, graphs, boolean algebra, algebra, OR • Continuous math: probability, multivariate statistics, numerical methods, optimization

RESEARCH TOPICS

Research Topics • Main: • Integrating data mining algorithms with a DBMS • Other: • Query optimization: Data pre-processing, linear recursive queries, transformation, Cubes • Medical data mining(heart disease, cancer) • Data quality: referential integr., distributed DBs • information retrieval, keyword search, database and document integration, recommendation

Research Topic: Data Mining • Analytics • OLAP Cubes • Statistics: prob., multivariate stats, categorical data analysys, Bayesian, time series • Machine Learning, but Statistical Learning; much less pure AI: generative models • NOT • generic DM algorithm (complex math on flat files), • “mumbo jumbo” AI algorithms (Bayesian nets, SVM, Kohonen, FA,GMM,MRF,MCMC, ROC)

What we don’t do • Algorithms incompatible with a traditional DBMS • AI, machine learning (saturated) • Text mining • image processing, pattern recognition • search engine infrastructure • generic data mining

Specific topics, all “in a DBMS” • Numerical analysis • Bayesian statistics • Machine Learning models • Linear recursive queries • Measuring and reparing referential integrity • OLAP cubes, association rules and graphs • Keyword search • Query recommendation

PAPERS

Proceedings vs journals • Only CS has proceedings. Likely to change in future • Most journal papers appear in proceedings before • Thomson Reuters neutral to all areas, but incomplete and $$

A good CS publicationcheck yourself in Google! • DBLP • ACM • Has citations in Google Scholar • Has impact factor on MS Academic search • A good paper in other areas: • Math: Mathscinet • Medical: pubmed • Thomson Reuters

Preferred publicationsproceedings • Top proceedings • 1st: ACM (factor >20, SIG*) • 2nd: IEEE (stronger in imaging, AI, medical) • 3rd: LNCS from Springer • 4th: other, no proceedings, high % papers accepted, unknown PC members, local

Preferred publicationsJournals • 1st in CS: ACM, IEEE 1stoutside CS: SIAM, AMA, ASA • 2nd: Elsevier, Springer, IOS Press, Kluwer (Metapress) • 3rd: IGI, Wiley, Blackwell, inderscience, KIISE • Impact factor (IF): • Thomson Reuters, Microsoft Academic, notyet Google Scholar.. • IF above 0.5 OK, 1 good, 2 isverygood.

WORKING WITH ME

Recommendations • Read DBLP • Check ACM • Read DBWorld • Read CACM • Read SIGMOD site

Students: My experience • You need a lot of guidance • Most of you do not read papers on your own • Not everyone can do research: not motivated, lack initiative, lack knowledge • Few have good DB systems and math background • Some people are slow/bad programmers • Math background varies, but generally not good enough in linear algebra, calculus, multivariate statistics and discrete math

Classification of students • BAD • LAZY: excuses • DUMB: does not understand, sleepy, wishful thinking • LOST: I want to work with someone else on their project • RANDOM: X+Y+.. different ideas without thought • GOOD • CREATIVE: I have this idea backed by these experimental results • WINNER: I made several comparisons and my program is faster or more accurate • CRAZY: I read this paper and I think my idea is better because of XYZ reasons • SCIENTIST: I have a draft of a paper with theory and experiments, here it is

Bad student: typical comments • This stuff looks too difficult • Why resubmit a rejected paper? Not worth it • I cannot write well; it is hard and boring • I cannot follow your notation, I prefer to use one I invented; mine is better • I did some programming, but it is at home/USB • Finally here is my paper (a draft paper full of spelling errors and disconnected phrases) • I forgot to tell you I was not coming (yesterday, another day) • I did not read any new papers • Where is your web page?, What is ACM?, What is DBLP?, what is DBworld? • Can you help me debug my program? • Random ideas to do new stuff

Good studentam I dreaming? • I did some theoretical analysis that I want to discuss it with you • I think this aspect from one of your papers or this famous paper can be improved • I have submitted two papers every semester • I have at least one paper accepted per year • My paper is ready two weeks before the deadline • Can we can briefly meet on the weekend? • I would like to stay late (8pm) two days per week, can you let me know when you are around?

My requirements • B.S.: open • MS: One ACM/IEEE accepted paper; journal paper desirable, but not required • PhD • Preferred: 2 accepted journal papers; 5 papers total • Alternative: 10+ proceedings papers • 3+ papers on the same topic; 1+ paper per year • Defend dissertation between 4th and 5th year

Research method • Hypothesis • New knowledge • In context, knowledge boundary • Important, current, needed • tell me something I don't know • Prove it • Theory: theorem (property, bound, existence,.) • Experiment: feasible, better, faster • Theory versus Experiment? mention Einstein

Work routine • Send me status every week, Thursday morning preferred • Send paper, results, comments BEFORE we talk • My time is PRIME time. Do not waste it. • Compile experimental results in an .xls spreadsheet • Collect all status and my answers in a log file • Avoid excuses; avoid “forgetting” email • We can talk on the phone in the morning preferably (<30 minutes)

My commitment • Answer e-mail within 2 days unless traveling • Can call you any time • Guidance to submit papers • Give you lead in procedings papers • Take lead in journal papers

Deliverables • Experiments: spreadsheet with one sheet per week (main ingredient for a strong paper) • Program: C++ or Java modules that can be independently called (easy to use by someone else) • Source code: follow math notation from my papers, index arrays from 1, log comments, parameters, config • Papers: latex, db.bib, send me pdf in ACM 2-column format by default

How you will be when finished • Understand how to apply math, theory • Strong C++, SQL programmer • More organized, proactive, creative • Better writing, better presenter

Job prospect: excellent • Corporate world runs on DBMSs, academics underexploit DBMSs • DBMS industry: Oracle, Microsoft, Teradata, Greenplum, IBM • Search Engines: Google, Yahoo • Academic positions: always vacant positions due to industry demand

Advice

Quotes • Datta: Every paper has a home • Ezquerra: A significant knowledge contribution should appear in journal form • Vardi: journals should be #1 • Journal papers last forever (archival) • CO: Citations impact: 0 worthless: 1 even, 5+ OK, 10+ good, 100+ famous • Gray: There are lies, damn lies and benchmarks • Europe: Theory is the compass of CS programming • People do care who is 1st author, except theory, nth author of 4+ authors worth not much; 2-author papers optimal, 3 author paper upper bound • CO: Free ride OK in proceedings; Rarely free ride in journal

CS vs other areas • A person in CS will not respect you if you do not have a paper in a top conference • A person oustide CS will not respect you if you do not have a journal paper • People in industry will not respect you if you do not have a patent, but proceedings help!

Advice (mainly academic PhD) • Avoid gaps in your publication record (year without papers) • Always have journal papers in the pipeline • Submit to top conference every year • Choose 1 or 2 applications, careful 3+ • Paper count • Ph.D. wihout many papers is worthless, 1 journal • M.S. with 1 paper is exceptional • B.S. 1 paper automatic acceptance to PhD

Recommendations: programming • Try ideas soon (discussed with me) • Always ask yourself O(n) • Index arrays from 1..d, 1..n in C++ and Java (Example double X[d+1]; X[1]=0) • Choose an acronym for your research; 8 letters (lrq, udfmodel, olaptest, ssvs) • Every file you send me should have the acronym as prefix • Parameter passing for experiments and GUI; every algorithm has one main call • C++: Follow GNU C++, plain editor preferred • SQL: reserved keywords uppercase, tables, columns lower case, indent, one term per line

Recommendations: writing • 1st Section 4! End with abstract & related work • Novel writing: be clear Section 2, increase interest Section 3, deliver in Section 4, political Section 5 • Avoid creating multiple versions of same file; instead use source control • Write in ACM format by default (www-acm.cls) • Use our master db.bib • Put everything in zip file with acronym • Log main changes • Keep paper reviews handy

RESEARCH ACCOMPLISHMENTS

Visit my web page • Members • Articles • Courses

Research Accomplishments • Over 70 articles • 20 journal artices, 15 in Thomson Reuters • 50 ACM/IEEE proceedings • 800 citations Google Scholar • 15 patentapplications; 9 patents • 57 articleson DBLP, 56 on ACM • H-index=14 per Google Scholar, 8 MS • Students • 1 PhDdissertation: Javier Garcia-Garcia • 6 MS theses • 2 PhDdissertations, in progress (Zhibo, Carlos G)

My Publications • Conferences: • 1st: SIGMOD, CIKM, KDD, • 2nd: ICDM, ICDE, ADL • 3rd: MLDM • Journals: • 1st: IEEE TKDE, IEEE TITB, ACM TODS, • 2nd: DKE (Elsevier), KAIS (Springer), DSS (Elsevier), IDA (IOS Press) • 3rd: JCSE

Members • 4 PhD students: • Zhibo Chen (US, Haliburton) • Carlos Garcia-Alvarado (Mexico, Greenplum) • Mario Navas (Ecuador) • Sasi K. Pitchaimalai (India)

Old and new students • Old MS students: • Anu Goyal (India) • Kai Zhao (China) • Georgey Golovko (Ukraine) • Waree Rinsurong (Thailand) • Ahmad Qwasmeh (Jordan) • Rengan Xu (China) • new MS students: • Manish Limaye (India) • Manas Saha (India) • Naveen Mohanam (India)

DBMS Group Overview of our research

DBMS Group Overview of our research

Presentation Transcript

Cooperation of our group

Architecture of DBMS

Our Group

DBMS

OVERVIEW OF RESEARCH

Sustainable Lifestyles Research Group An overview

Our Group

Overview of research

Our Research Group on Globular Clusters

Overview of DBMS recovery and concurrency control:

Our Group

Utah Verifier Group Research Overview

DBMS

Overview of Carbon Group Research Program

Overview of our Approach

DBMS

OVERVIEW OF RESEARCH

Group Research Interests - an overview

dbms

Overview of research

Overview of our Presentation