450 likes | 647 Views
From DBMiner to WebMiner: What is the Future of Data Mining ?. Jiawei Han Intelligent Database System Research Lab School of Computing Science Simon Fraser University, Canada http://www.cs.sfu.ca/~han Tuesday, January 11, 2000. Data Mining: “Necessity is the mother of invention”.
E N D
From DBMiner to WebMiner: What is the Future of Data Mining? Jiawei Han Intelligent Database System Research Lab School of Computing Science Simon Fraser University, Canada http://www.cs.sfu.ca/~han Tuesday, January 11, 2000
Data Mining: “Necessity is the mother of invention” • On-line databases are widely available • NASA’s EOS (Earth Observation System), WWW, Digital Library, stock market data, e-commerce, tel-communication data, credit card transactions, market basket data, bio-medical data, etc. • We are drowning in data, but starving for knowledge! • Requirements: fast response, interactive and exploratory analysis, mining hidden patterns
Data Mining: A KDD Process Knowledge Pattern Evaluation • Data mining: the core of knowledge discovery process. Data Mining Task-relevant Data Selection Data Warehouse Data Cleaning Data Integration Databases
Data Mining and Business Intelligence Increasing potential to support business decisions End User Making Decisions Business Analyst Data Presentation Visualization Techniques Data Mining Data Analyst Information Discovery Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA DBA Data Sources Paper, Files, Information Providers, Database Systems, OLTP
Why Data Mining? — Potential Applications • Database analysis and decision support • Market analysis and management • target marketing, customer relation management, market basket analysis, cross selling, market segmentation. • Risk analysis and management • Forecasting, customer retention, improved underwriting, quality control, competitive analysis. • Fraud detection and management • Text mining (news, emails, documents) and Web mining. • BioInformatics (DNA), GeoInformatics (Maps, Remote sensing data), Intelligent query answering
Data Mining: On What Kind of Data? • Relational databases and Transactional databases • Data warehouses • Advanced DBMS and information repositories • Object-oriented and object-relational databases • Spatial databases • Time-series data and temporal data • Text databases and multimedia databases • Heterogeneous and legacy databases • WWW
Data Mining: Confluence of Multiple Disciplines • Database systems, data warehouse and OLAP • Statistics • Machine learning • Visualization • Information science • High performance computing • Business and application domain knowledge expertise • Other disciplines: • Neural networks, mathematical modeling, information retrieval, pattern recognition, etc.
Data Mining: Major Tasks • Characterization and descriptive data mining • Data distribution, dispersion and exception • Association, correlation, causality analysis • Find rules like “inside(x, city) à near(x, highway)” • Classification and predictive modeling • Classify countries based on climate • Predict sales based on product qualification • Clustering and outlier analysis • Cluster houses to find distribution patterns • Temporal and sequential pattern analysis • Trend and deviation, sequential patterns, periodicity
Batch Data Mining vs. On-Line Analytical Mining • Data mining — A costly process • Deep analysis: association, classification, prediction, clustering, sequence analysis, outline analysis, etc. • Huge amounts of data with wide diversity • Batch processing, “submit and wait?!” — is the status but is not the answer! • On-line analytical mining (OLAM) • Fast, interactive mining of multi-dimensional databases: response in seconds! • OLAM operations: mining withdrilling, etc.
Expected Features of On-Line Analytical Mining • Ability to mine anywhere • OLAP-like exploratory mining (interactive, progressive deepening, intelligent focusing) • Efficient, data cube-based mining methods • Dynamic selection and integration of data mining, OLAP, and statistical functions • Fast response and high performance • Visualization and extensibility
On-Line Analytical Mining: An Architecture Mining query Mining result Layer4 User Interface User GUI API Layer3 OLAP/OLAM OLAM Engine OLAP Engine Data Cube API Layer2 MDDB MDDB Meta Data Database API Filtering&Integration Filtering Layer1 Data Repository Data cleaning Data Warehouse Databases Data integration
From Research Prototypes to Data Mining System Products • DBMiner — One of the pioneering data mining systems. • Integration of data warehousing (OLAP) with data mining • On-Line Analytical Mining. • From research prototype to Enterprise 2.0 (6 years R&D results). • Demonstrated in many conferences and trial use in Boeing, HP, Hughes Research Labs.
Distinct Features of DBMiner • Multiple data mining functions. • OLAP service, cube exploration, statistical analysis, classification (market/customer segmentation, decision trees), association (basket data analysis), cluster analysis, etc. • On-line analytical mining of Microsoft/ PLATO OLAP cube. • Data and knowledge visualization tools: visual data mining. • OLEDB and RDBMS connections.
A Few Snapshots of DBMiner • OLAP-based graphical user interface • OLAP-based multi-dimensional analysis • Association rule graph • Association 2-D plane • Classification (decision tree analysis) • Cluster analysis • 3-D cube viewer and analyzer
Brief History of DBMiner Technology Inc • Research on data mining since 1989. • International reputation and recognition. • Substantial research supports and contracts. • DBMiner Technology Inc.: A Simon Fraser University Spin-Off Company • Incorporated in March 1997, dedicated to data mining system development and commercialization. • Major products: DBMiner 2.0 (Enterprise) • Customization and application-oriented data mining systems • GeoMiner, WebMiner, WebLogMiner, …, more miners in progress
Mining Complex Data: Costly and Largely Unexplored Frontier • Spatial OLAP and spatial data mining • maps, satellite images, geo-spatial modeling and reasoning • Time-series and sequential pattern mining • pattern match, pattern discovery, trend and periodicity analysis. • Mining hypertext and hypermedia data • Visual data mining • Scientific data mining • Web mining
Spatial OLAP: Pre- vs On-line Computation Precomputing all: too much storage space On-line merge: very expensive
Spatial Classification • Generalization-based induction • Interactive classification
From Coarse to Fine Resolution Mining Progressively mine finer resolutions only on candidate frequent item-sets Progressive Resolution Refinement Feature Localization Minimum bounding circles Tile Size i = 0; D0 =D; while (i < maxResLevel) do { Ri= {sufficiently frequent item-sets at res i} i = i + 1; Di = Filter(Di-1, Ri-1); } Coarse resolution Fine resolution
Web Mining: Lots To Be Done! • A taxonomy of Web mining • Web content mining • Web usage mining • Interesting and challenging problems on Web mining • Mining what Web search engine finds • Weblog mining (usage, access, and evolution) • Identification of authoritative Web pages • Web document classification • Warehousing a Meta-Web: Web yellow page service • Intelligent query answering in Web search • Web mining requires your response in seconds!
Challenges to Web Mining • Web: A huge, widely-distributed, highly heterogeneous, semi-structured, interconnected, evolving, hypertext/hypermedia information repository. • Problems: • the “abundance” problem • limited coverage of the Web (hidden Web sources) • limited query interface: keyword-oriented search • limited customization to individual users • DBMS, DBers, and data miners will play an increasingly important role in the new generation of Internet
Mine What Web Search Engine Finds • Current Web search engines: convenient source for mining • keyword-based, return too many answers, low quality answers, still missing a lot, not customized, etc. • Data mining will help: • coverage: “Enlarge and then shrink,” using synonyms and conceptual hierarchies • better search primitives: user preferences/hints • linkage analysis: authoritative pages and clusters • Web-based languages: XML + WebSQL + WebML • customization: home page + Weblog + user profiles
Web Log Mining • Weblog provides rich information about Web dynamics • Multidimensional Weblog analysis: • disclose potential customers, users, markets, etc. • Plan mining (mining general Web accessing regularities): • Web linkage adjustment, performance improvements • Web accessing association/sequential pattern analysis: • Web cashing, prefetching, swapping • Trend analysis: • Dynamics of the Web: what has been changing? • Customized to individual users
Discovery of Authoritative Pages in WWW • Page-rank method ( Brin and Page, 1998): • Rank the "importance" of Web pages, based on a model of a "random browser." • Hub/authority method (Kleinberg, 1998): • Prominent authorities often do not endorse one another directly on the Web. • Hub pages have a large number of links to many relevant authorities. • Thus hubs and authorities exhibit a mutually reinforcing relationship: • Both the page-rank and hub/authority methodologies have been shown to provide qualitatively good search results for broad query topics on the WWW.
Web Document Classification • Web document classification: • Good classification: Yahoo!, CS term hierarchies • Training set and learning model • Key-word based classification is different from multi-dimensional classification • association or clustering based classification is often more effective • multi-level classification is important • See K. Wang’s work and also S. Chakrabarti’s COMPUTER Aug.’99 paper.
Warehousing a Meta-Web: An MLDB Approach • Meta-Web: A structure which summarizes the contents, structure, linkage, and access of the Web and which evolves with the Web • Layer0:the Web itself • Layer1:the lowest layer of the Meta-Web • an entry: a Web page summary, including class, time, URL, contents, keywords, popularity, weight, links, etc. • Layer2 and up:summary/classification/clustering in various ways and distributed for various applications • Meta-Web can be warehoused and incrementally updated • Querying and mining can be performed on or assisted by meta-Web (a multi-layer digital library catalogue, yellow page).
A Multiple Layered Meta-Web Architecture More Generalized Descriptions Layern ... Generalized Descriptions Layer1 Layer0
Construction of Multi-Layer Meta-Web • XML: facilitates structured and meta-information extraction • Hidden Web: DB schema “extraction” + other meta info • Automatic classification of Web documents: • based on Yahoo!, etc. as training set + keyword-based correlation/classification analysis (IR/AI assistance) • Automatic ranking of important Web pages • authoritative site recognition and clustering Web pages • Generalization-based multi-layer meta-Web construction • With the assistance of clustering and classification analysis
Use of Multi-Layer Meta Web • Benefits of Multi-Layer Meta-Web: • Multi-dimensional Web info summary analysis • Approximate and intelligent query answering • Web high-level query answering (WebSQL, WebML) • Web content and structure mining • Observing the dynamics/evolution of the Web • Is it realistic to construct such a meta-Web? • Benefits even if it is partially constructed • Benefits may justify the cost of tool development, standardization and partial restructuring
Intelligent Web Query Answering • What is intelligent query answering? • Smart alternative answers, summary information, etc. • Based on user’s profiles or history • Web query needs more intelligent query answering mechanism • How to develop it? • Data warehouse and Web Yellow Page service will help • Data mining will help too!
Conclusions • Data Mining • A rich, promising, young field with broad applications and many challenging research issues • Progress • From research prototype to an on-line analytical mining system: DBMiner 2.0 (Enterprise) • Future work • Application-specific data mining • From DBMiner to WebMiner, and many more!
Current On-Going Projects (1) • Spatial data mining • GeoMiner: (SIGMOD’97 demo) • Spatial data warehouse modeling and spatial OLAP (TKDE’99) • Spatial data cube and on-line aggregation (PAKDD’98, SSD’99) • Constraint-based spatial clustering (VLDB’00 sub?) • Multimedia mining • MultiMediaMiner: (SIGMOD’98 demo) • Multimedia data cube and multi-dimension analysis • Mining multimedia associations (ICDE’00) • Time-series data mining • Partial periodicity mining (KDD’98, ICDE’99) • Inter-transaction association mining (TOIS’99, KDD’99)
Current On-Going Projects (2) • Web mining (WebMinerandMetaWeb) • Three categories of Web mining: structure, usage, and content. • Web mining language: WebML (WIDM’98) • Document classification: • Weblog mining (ADL’98) • Plan mining: mining plan databases • Plan mining by divide-and-conquer (DMKD’99) • Intelligent query answering • Intelligent query answering by data mining techniques (TKDE’96) • Book • Data mining: concepts and Techniques (Han & Kamber’00)
References:http://www.cs.sfu.ca/~han • J. Han. Towards on-line analytical mining in large databases. ACM-SIGMOD Record, 27:97-107, 1998 • J. Han, et al. DBMiner: A system for data mining in relational databases and data warehouses. Cascon'97 and KDD'96. • J. Han and Y. Fu. Discovery of multiple-level association rules from large databases. VLDB'95, Zurich, Switzerland, Sept. 1995. • J. Han, K. Koperski, and N. Stefanovic. GeoMiner: A system prototype for spatial data mining. SIGMOD'97 (demo), Tucson, Arizona, May 1997. • J. Han, L. V. S. Lakshmanan, and R. T. Ng. Human-centered, multidimensional data mining -- the constraints way. COMPUTER, 8, 1999. • K. Koperski and J. Han. Discovery of spatial association rules in geographic information databases. SSD'95, Portland, Maine, Aug. 1995. • L. V. S. Lakshmanan, R. Ng, J. Han, and A. Pang. Optimization of constrained frequent set queries with 2-variable constraints. SIGMOD'99, Philadelphia, PA, June 1999. • R. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang. Exploratory mining and pruning optimizations of constrained associations rules. SIGMOD'98, Seattle, Washington. • O. R. Zaiane, M. Xin, and J. Han. Discovering Web access patterns and trends by applying OLAP and data mining technology on Web logs. ADL'98, Santa Barbara, CA. • O. R. Zaiane, J. Han, et al. MultiMedia-Miner: A system prototype for multimedia data mining, SIGMOD'98 (demo), Seattle, Washington, June 1998.
http://db.cs.sfu.ca/ Thank you !!!