130 likes | 142 Views
Explore the potentials and challenges of data mining in deployed applications and commercial products, with a focus on vertical applications and horizontal tools. Discover new opportunities and challenges in non-conventional domains, structured and unstructured data, and security/privacy concerns.
E N D
Data Mining:Potentials and Challenges Rakesh Agrawal & Jeff Ullman
Observations • Transfer of data mining research into deployed applications and commercial products • Greater success in vertical applications • Horizontal tools: Examples: • SAS Enterprise Miner: Sophisticated Statisticians segment • DB2 Intelligent Miner: database applications requiring mining • Emergence of the application of data mining in non-conventional domains • Combination of structured and unstructured data • New challenges due to security/privacy concerns • DARPA initiative to fund data mining research
Identifying Social Links Using Association Rules Input: Crawl of about 1 million pages
Website Profiling using Classification Input: Example pages for each category during training
Discovering Trends Using Sequential Patterns & Shape Queries Input: i) patent database ii) shape of interest
Discovering Micro-communities Frequently co-cited pages are related. Pages with large bibliographic overlap are related.
New Challenges • Privacy-preserving data mining • Data mining over compartmentalized databases
30 | 25K | … 50 | 40K | … Randomizer Randomizer 65 | 50K | … 35 | 60K | … Reconstruct Age Distribution Reconstruct Salary Distribution Decision Tree Algorithm Model Inducing Classifiers over Privacy Preserved Numeric Data Alice’s age Alice’s salary John’s age 30 becomes 65 (30+35)
Other recent work • Cryptographic approach to privacy-preserving data mining • Lindell & Pinkas, Crypto 2000 • Privacy-Preserving discovery of association rules • Vaidya & Clifton, KDD2002 • Evfimievski et. Al, KDD 2002 • Rizvi & Haritsa, VLDB 2002
Some Hard Problems • Past may be a poor predictor of future • Abrupt changes • Wrong training examples • Actionable patterns (principled use of domain knowledge?) • Over-fitting vs. not missing the rare nuggets • Richer patterns • Simultaneous mining over multiple data types • When to use which algorithm? • Automatic, data-dependent selection of algorithm parameters
Discussion • Should data mining be viewed as “rich’’ querying and “deeply’’ integrated with database systems? • Most of current work make little use of database functionality • Should analytics be an integral concern of database systems? • Issues in data mining over heterogeneous data repositories (Relationship to the heterogeneous systems discussion)
Summary • Data mining has shown promise but needs much more further research We stand on the brink of great new answers, but even more, of great new questions -- Matt Ridley