420 likes | 432 Views
Get an overview of our research on Database Systems - topics, papers, collaboration, and recommendations. Dive into areas like modeling, query optimization, analytics, and more. Discover our research interests in integrating data mining algorithms with DBMS and data quality enhancement.
E N D
Carlos Ordonez University of Houston DBMS GroupOverview of our research
Outline • Research on DBS overview • Research topics • Papers • Working with me • Advice & Recommendations • Members
Notes • Hands on • no math or code! • Same presentation I give to my students • Once a year • Take notes!
DB systems today • Modeling: ER, UML, temporal, workflows, metadata, docs • Query languages: relational, logic • File systems and indexing: blocked, row store, B-trees, hash, bitmap • Query optimization: SPJA, recursion • Transaction processing: ACID, 2PL • Analytics: OLAP cubes and DM • Non-relational: provenance, keyword, column, XML, sensors, probabilistic
Research: Database Systems • Both CS main areas • Theory: sets, external algorithms, discrete math, relational, complexity, cost analysis • Systems: large indexed files, query optimizer • Software • Languages: C++ and SQL, but also Java and C#. • Systems: DBMS, Unix, MS-DOS • Math Tools: R, WEKA, SAS, Matlab • Libraries: LAPACK, BLAS
Knowledge Background • Computer Science • algorithms: linear, blocked, complexity, parall • data structures: external, secondary storage • OS: multithreaded programming, file systems, parallel programming (shared-nothing) • Compilers: parsing, optimization, query lang • Mathematics: • Discrete math: sets, combinatorial math, graphs, boolean algebra, algebra, OR • Continuous math: probability, multivariate statistics, numerical methods, optimization
Research Topics • Main: • Integrating data mining algorithms with a DBMS • Other: • Query optimization: Data pre-processing, linear recursive queries, transformation, Cubes • Medical data mining(heart disease, cancer) • Data quality: referential integr., distributed DBs • information retrieval, keyword search, database and document integration, recommendation
Research Topic: Data Mining • Analytics • OLAP Cubes • Statistics: prob., multivariate stats, categorical data analysys, Bayesian, time series • Machine Learning, but Statistical Learning; much less pure AI: generative models • NOT • generic DM algorithm (complex math on flat files), • “mumbo jumbo” AI algorithms (Bayesian nets, SVM, Kohonen, FA,GMM,MRF,MCMC, ROC)
What we don’t do • Algorithms incompatible with a traditional DBMS • AI, machine learning (saturated) • Text mining • image processing, pattern recognition • search engine infrastructure • generic data mining
Specific topics, all “in a DBMS” • Numerical analysis • Bayesian statistics • Machine Learning models • Linear recursive queries • Measuring and reparing referential integrity • OLAP cubes, association rules and graphs • Keyword search • Query recommendation
Proceedings vs journals • Only CS has proceedings. Likely to change in future • Most journal papers appear in proceedings before • Thomson Reuters neutral to all areas, but incomplete and $$
A good CS publicationcheck yourself in Google! • DBLP • ACM • Has citations in Google Scholar • Has impact factor on MS Academic search • A good paper in other areas: • Math: Mathscinet • Medical: pubmed • Thomson Reuters
Preferred publicationsproceedings • Top proceedings • 1st: ACM (factor >20, SIG*) • 2nd: IEEE (stronger in imaging, AI, medical) • 3rd: LNCS from Springer • 4th: other, no proceedings, high % papers accepted, unknown PC members, local
Preferred publicationsJournals • 1st in CS: ACM, IEEE 1stoutside CS: SIAM, AMA, ASA • 2nd: Elsevier, Springer, IOS Press, Kluwer (Metapress) • 3rd: IGI, Wiley, Blackwell, inderscience, KIISE • Impact factor (IF): • Thomson Reuters, Microsoft Academic, notyet Google Scholar.. • IF above 0.5 OK, 1 good, 2 isverygood.
Recommendations • Read DBLP • Check ACM • Read DBWorld • Read CACM • Read SIGMOD site
Students: My experience • You need a lot of guidance • Most of you do not read papers on your own • Not everyone can do research: not motivated, lack initiative, lack knowledge • Few have good DB systems and math background • Some people are slow/bad programmers • Math background varies, but generally not good enough in linear algebra, calculus, multivariate statistics and discrete math
Classification of students • BAD • LAZY: excuses • DUMB: does not understand, sleepy, wishful thinking • LOST: I want to work with someone else on their project • RANDOM: X+Y+.. different ideas without thought • GOOD • CREATIVE: I have this idea backed by these experimental results • WINNER: I made several comparisons and my program is faster or more accurate • CRAZY: I read this paper and I think my idea is better because of XYZ reasons • SCIENTIST: I have a draft of a paper with theory and experiments, here it is
Bad student: typical comments • This stuff looks too difficult • Why resubmit a rejected paper? Not worth it • I cannot write well; it is hard and boring • I cannot follow your notation, I prefer to use one I invented; mine is better • I did some programming, but it is at home/USB • Finally here is my paper (a draft paper full of spelling errors and disconnected phrases) • I forgot to tell you I was not coming (yesterday, another day) • I did not read any new papers • Where is your web page?, What is ACM?, What is DBLP?, what is DBworld? • Can you help me debug my program? • Random ideas to do new stuff
Good studentam I dreaming? • I did some theoretical analysis that I want to discuss it with you • I think this aspect from one of your papers or this famous paper can be improved • I have submitted two papers every semester • I have at least one paper accepted per year • My paper is ready two weeks before the deadline • Can we can briefly meet on the weekend? • I would like to stay late (8pm) two days per week, can you let me know when you are around?
My requirements • B.S.: open • MS: One ACM/IEEE accepted paper; journal paper desirable, but not required • PhD • Preferred: 2 accepted journal papers; 5 papers total • Alternative: 10+ proceedings papers • 3+ papers on the same topic; 1+ paper per year • Defend dissertation between 4th and 5th year
Research method • Hypothesis • New knowledge • In context, knowledge boundary • Important, current, needed • tell me something I don't know • Prove it • Theory: theorem (property, bound, existence,.) • Experiment: feasible, better, faster • Theory versus Experiment? mention Einstein
Work routine • Send me status every week, Thursday morning preferred • Send paper, results, comments BEFORE we talk • My time is PRIME time. Do not waste it. • Compile experimental results in an .xls spreadsheet • Collect all status and my answers in a log file • Avoid excuses; avoid “forgetting” email • We can talk on the phone in the morning preferably (<30 minutes)
My commitment • Answer e-mail within 2 days unless traveling • Can call you any time • Guidance to submit papers • Give you lead in procedings papers • Take lead in journal papers
Deliverables • Experiments: spreadsheet with one sheet per week (main ingredient for a strong paper) • Program: C++ or Java modules that can be independently called (easy to use by someone else) • Source code: follow math notation from my papers, index arrays from 1, log comments, parameters, config • Papers: latex, db.bib, send me pdf in ACM 2-column format by default
How you will be when finished • Understand how to apply math, theory • Strong C++, SQL programmer • More organized, proactive, creative • Better writing, better presenter
Job prospect: excellent • Corporate world runs on DBMSs, academics underexploit DBMSs • DBMS industry: Oracle, Microsoft, Teradata, Greenplum, IBM • Search Engines: Google, Yahoo • Academic positions: always vacant positions due to industry demand
Quotes • Datta: Every paper has a home • Ezquerra: A significant knowledge contribution should appear in journal form • Vardi: journals should be #1 • Journal papers last forever (archival) • CO: Citations impact: 0 worthless: 1 even, 5+ OK, 10+ good, 100+ famous • Gray: There are lies, damn lies and benchmarks • Europe: Theory is the compass of CS programming • People do care who is 1st author, except theory, nth author of 4+ authors worth not much; 2-author papers optimal, 3 author paper upper bound • CO: Free ride OK in proceedings; Rarely free ride in journal
CS vs other areas • A person in CS will not respect you if you do not have a paper in a top conference • A person oustide CS will not respect you if you do not have a journal paper • People in industry will not respect you if you do not have a patent, but proceedings help!
Advice (mainly academic PhD) • Avoid gaps in your publication record (year without papers) • Always have journal papers in the pipeline • Submit to top conference every year • Choose 1 or 2 applications, careful 3+ • Paper count • Ph.D. wihout many papers is worthless, 1 journal • M.S. with 1 paper is exceptional • B.S. 1 paper automatic acceptance to PhD
Recommendations: programming • Try ideas soon (discussed with me) • Always ask yourself O(n) • Index arrays from 1..d, 1..n in C++ and Java (Example double X[d+1]; X[1]=0) • Choose an acronym for your research; 8 letters (lrq, udfmodel, olaptest, ssvs) • Every file you send me should have the acronym as prefix • Parameter passing for experiments and GUI; every algorithm has one main call • C++: Follow GNU C++, plain editor preferred • SQL: reserved keywords uppercase, tables, columns lower case, indent, one term per line
Recommendations: writing • 1st Section 4! End with abstract & related work • Novel writing: be clear Section 2, increase interest Section 3, deliver in Section 4, political Section 5 • Avoid creating multiple versions of same file; instead use source control • Write in ACM format by default (www-acm.cls) • Use our master db.bib • Put everything in zip file with acronym • Log main changes • Keep paper reviews handy
Visit my web page • Members • Articles • Courses
Research Accomplishments • Over 70 articles • 20 journal artices, 15 in Thomson Reuters • 50 ACM/IEEE proceedings • 800 citations Google Scholar • 15 patentapplications; 9 patents • 57 articleson DBLP, 56 on ACM • H-index=14 per Google Scholar, 8 MS • Students • 1 PhDdissertation: Javier Garcia-Garcia • 6 MS theses • 2 PhDdissertations, in progress (Zhibo, Carlos G)
My Publications • Conferences: • 1st: SIGMOD, CIKM, KDD, • 2nd: ICDM, ICDE, ADL • 3rd: MLDM • Journals: • 1st: IEEE TKDE, IEEE TITB, ACM TODS, • 2nd: DKE (Elsevier), KAIS (Springer), DSS (Elsevier), IDA (IOS Press) • 3rd: JCSE
Members • 4 PhD students: • Zhibo Chen (US, Haliburton) • Carlos Garcia-Alvarado (Mexico, Greenplum) • Mario Navas (Ecuador) • Sasi K. Pitchaimalai (India)
Old and new students • Old MS students: • Anu Goyal (India) • Kai Zhao (China) • Georgey Golovko (Ukraine) • Waree Rinsurong (Thailand) • Ahmad Qwasmeh (Jordan) • Rengan Xu (China) • new MS students: • Manish Limaye (India) • Manas Saha (India) • Naveen Mohanam (India)