70 likes | 80 Views
This article explores the integration of data mining into essential functions of DBMS, such as indexing, data cleaning, data integration, and query processing. It discusses the potential benefits of incorporating data mining techniques into DBMS and examines various applications, including indexing graphs, cleaning messy data, integrating multiple data relations, and refining query plans.
E N D
Will Data Mining Change the Functions of DBMS? Jiawei Han DAIS (Data And Information Systems) Lab University of Illinois at Urbana-Champaign
Will DM Be Integrated with DB Functions? • DM: Already a functional component of DBMS • Microsoft/SQLServer: Analysis Manager • IBM/DB2 & IntelligentMiner • Oracle: Data Mining Package • But will DM be “intruding” into DBMS, i.e., be integrated with essential DBMS functions? • Indexing • Data integration • Data cleaning • Query processing
Indexing by Data Mining • Indexing graphs? ─ # of subgraphs: exponential! • Chemical Informatics/bioinformatics … • Discriminative frequent graph patterns (SIGMOD’04) • Indexing subsequences? • Shopping sequence, DNA/protein sequence (SDM’05) • When is discriminative frequent pattern indexing useful? • Complex objects, big (object) queries Sample database (a) (b) (c) Query graph
Data Cleaning by Data Mining • Load messy data into a structured database? • Inconsistent data: age = “1946”? • Field mis-alignments • Glitches of data: completely messed up inputs • Missing/un-matching delimiters: XML, HTML data • Big field: BLOB, CLOB, multimedia and text • Data mining • Data cleaning by distribution/outlier analysis • Dependency/correlation analysis • Schema-directed or schema “discovery”
Data Integration by Data Mining • Linking and mining cross-over multiple data relations • Cross-mine (Classification across multiple data relations: ICDE’04) • Search across heterogeneous databases • Object identification/merge, reference reconciliation (Alon’s group) • Mining across heterogeneous DBs • Personalizing data from heterogeneous sources
Query Processing by Data Mining • Query plan refinement based on query execution history • Better query planning by investigating additional data statistics • Current optimizer: key/foreign key, cardinality, # distinct values • Additional information: • Strong dependency/correlation • Histogram, dense vs. sparse regions, etc.
Conclusions • DBers have been “invading” into DM and made great contributions • It is time to consider that DM may invade DBMS to enhance its functionality • General philosophy • Invisible data mining • Google is doing this for page ranking successfully • Can we do it to enhance DBMS? • You can do better if you know your data better!