70 likes | 87 Views
Explore the potential impact of integrating data mining into essential database management functions such as indexing, data cleaning, integration, and query processing. Understand how this integration can improve data management efficiency and unlock hidden insights from diverse data sources. Learn about the possibilities and challenges of merging data mining with DBMS.
E N D
Will Data Mining Change the Functions of DBMS? Jiawei Han DAIS (Data And Information Systems) Lab University of Illinois at Urbana-Champaign
Will DM Be Integrated with DB Functions? • DM: Already a functional component of DBMS • Microsoft/SQLServer: Analysis Manager • IBM/DB2 & IntelligentMiner • Oracle: Data Mining Package • But will DM be “intruding” into DBMS, i.e., be integrated with essential DBMS functions? • Indexing • Data integration • Data cleaning • Query processing
Indexing by Data Mining • Indexing graphs? ─ # of subgraphs: exponential! • Chemical Informatics/bioinformatics … • Discriminative frequent graph patterns (SIGMOD’04) • Indexing subsequences? • Shopping sequence, DNA/protein sequence (SDM’05) • When is discriminative frequent pattern indexing useful? • Complex objects, big (object) queries Sample database (a) (b) (c) Query graph
Data Cleaning by Data Mining • Load messy data into a structured database? • Inconsistent data: age = “1946”? • Field mis-alignments • Glitches of data: completely messed up inputs • Missing/un-matching delimiters: XML, HTML data • Big field: BLOB, CLOB, multimedia and text • Data mining • Data cleaning by distribution/outlier analysis • Dependency/correlation analysis • Schema-directed or schema “discovery”
Data Integration by Data Mining • Linking and mining cross-over multiple data relations • Cross-mine (Classification across multiple data relations: ICDE’04) • Search across heterogeneous databases • Object identification/merge, reference reconciliation (Alon’s group) • Mining across heterogeneous DBs • Personalizing data from heterogeneous sources
Query Processing by Data Mining • Query plan refinement based on query execution history • Better query planning by investigating additional data statistics • Current optimizer: key/foreign key, cardinality, # distinct values • Additional information: • Strong dependency/correlation • Histogram, dense vs. sparse regions, etc.
Conclusions • DBers have been “invading” into DM and made great contributions • It is time to consider that DM may invade DBMS to enhance its functionality • General philosophy • Invisible data mining • Google is doing this for page ranking successfully • Can we do it to enhance DBMS? • You can do better if you know your data better!