260 likes | 416 Views
Special Topics in Database Systems. Martin Ester Simon Fraser University School of Computing Science CMPT 884 Spring 2009. Introduction. [Fayyad, Piatetsky-Shapiro & Smyth 96].
E N D
Special Topics in Database Systems Martin Ester Simon Fraser University School of Computing Science CMPT 884 Spring 2009 CMPT 884, SFU, Martin Ester, 1-09
Introduction [Fayyad, Piatetsky-Shapiro & Smyth 96] • Knowledge discovery in databases (KDD)is the process of (semi-)automatic extraction of knowledge from databases which is • valid • previously unknown • and potentially useful. • Remarks • (semi)-automatic: distinction from manual analysis / OLAP. Typically, some user interaction necessary. • valid: in the statistical sense. • previously unknown: not explicit, no „common sense knowledge“. • potentially useful: for some given application. CMPT 884, SFU, Martin Ester, 1-09
Introduction • Statistics [Hand, Mannila & Smyth 2001] • representation of uncertainty • model-based inferences • focus on numeric data • Machine Learning [Mitchell 1997] • knowledge representation • search strategies • focus on symbolic data • Database Systems [Han & Kamber 2000] • data management • integration of data mining with DBS • scalability for large databases CMPT 884, SFU, Martin Ester, 1-09
Knowledge Pattern Evaluation Data Mining Task-relevant Data Selection Data Warehouse Data Cleaning Data Integration Introduction KDD Process [Han & Kamber 2000] Databases KDD Process [Fayyad, Piatetsky-Shapiro & Smyth 1996] Data Mining Trans- formation Pre-processing Evaluation Focussing Pattern Knowledge Database CMPT 884, SFU, Martin Ester, 1-09
• • • • • • • • • • • • • b b a b b a a a b b a a • • • • • • • • • • • • Data Mining • Definition [Fayyad, Piatetsky-Shapiro, Smyth 1996] • Data Mining is the application of efficient algorithms to determine the patterns contained in some database. • Data-Mining Tasks clustering classification A and B C association rules generalisation other tasks: regression, outlier detection . . . CMPT 884, SFU, Martin Ester, 1-09
Trends in KDD Research • KDD 2000 Conference • New Data Mining Algorithms • Efficiency and Scalability of Data Mining Algorithms • Interactive Data Exploration • Visualization • Constraints and Evaluation in the KDD Process CMPT 884, SFU, Martin Ester, 1-09
Trends in KDD Research • KDD 2002 Conference • Statistical Methods • Frequent Patterns • Streams and Time Series • Visualization • Web Search and Navigation • Text and Web Page Classification • Intrusion and Privacy • Applications CMPT 884, SFU, Martin Ester, 1-09
Trends in KDD Research • KDD 2004 Conference • Frequent Patterns / Association Rules • Clustering • Mining Spatio-Temporal Data • Mining Data Streams • Dimensionality Reduction • Privacy-Preserving Data Mining • Mining Biological Data • Applications (Web, biological data, security, . . .) CMPT 884, SFU, Martin Ester, 1-09
Trends in KDD Research • KDD 2006 Conference • Clustering • Classification / supervised ML • Privacy • Web / Graph Mining • Web / Text Mining • Frequent Pattern Mining • Structured Data CMPT 884, SFU, Martin Ester, 1-09
Trends in KDD Research • KDD 2008 Conference • Text Mining • Data Integration • Social Networks • Graph Mining • Distance Functions and Metric Learning • Active and Semi-supervised Learning • Pattern Mining • Collaborative Filtering CMPT 884, SFU, Martin Ester, 1-09
Trends in KDD Research • Some Hot Topics • Social Networks THE hot topic of KDD 08 topic of the only panel • Graph mining • Text mining and information extraction / integration • Collaborative Filtering more general, recommender systems $1M NetFlix prize CMPT 884, SFU, Martin Ester, 1-09
Overview of this Course • Prerequisites • Foundations of database systems and statistics • Introductory graduate data mining course or equivalent • Objectives • Introduction into some hot topics of data mining research • Training in research methodology • Presentation skills • start thesis work after this class! CMPT 884, SFU, Martin Ester, 1-09
Overview of this Course • Topics • Graph mining social network analysis and analysis of biological networks as driving applications • Recommender systems in particular trust-based recommendation • Information extraction and integration integration with existing databases CMPT 884, SFU, Martin Ester, 1-09
Overview of this Course • Format • Tutorial surveys by instructor • Written research paper reviews by students • Research paper presentations by students discussions in class • Course research projects by students on a topic of their choice CMPT 884, SFU, Martin Ester, 1-09
Overview of this Course • Tentative Grading Scheme • Paper review (20 %) • Paper presentation (20 %) • Course project report (40%) two steps: project proposal, final project report • Course project presentation (20 %) • marking criteria: originality, technical quality, presentation CMPT 884, SFU, Martin Ester, 1-09
Overview of this Course • Types of Course Projects • Literature surveysummarize the state-of-the-art and identify open research problems • New problemintroduce and analyze a new problem • New algorithm for known problemimplement and evaluate algorithm • Improvement of existing algorithmimplement and compare algorithm • Comparison of existing algorithms on a new, interesting datasetidentify criteria for choice of algorithms / open research problems CMPT 884, SFU, Martin Ester, 1-09
Graph Mining • Motivating Applications • Social network analysis • What communities exist? • How does information about a new product spread? • What customers should be targeted to maximize the profit of a marketing campaign? • Analysis of biological networks o What are the functional modules of an organism? o How do biological networks evolve in the course of time? o What protein should be targeted to inhibit some virulent bacteria? CMPT 884, SFU, Martin Ester, 1-09
Graph Mining • Methods • Frequent subgraph mining • frequent pattern mining approach • Graph clustering e.g., normalized cut, i.e. Minimize number of edges between graph components / clusters • Graph generative models probabilistic models that generate graphs similar to real graphs / networks CMPT 884, SFU, Martin Ester, 1-09
Graph Mining • Challenges • Complexity of graph algorithms • Many graph mining problems are NP-hard. • Real graphs tend to be extremely large. need efficient algorithms • Attribute data • Many graphs have attributes associated with the nodes. • Transformation into weighted graph looses a lot of information. need new models / algorithms considering relationship and attribute data CMPT 884, SFU, Martin Ester, 1-09
Recommender Systems • Motivating Applications • MotivationoThe internet provides a flood of information on all kinds of items.o There is a great need for personalized recommendations. o The internet also provides a wealth of item ratings / reviews. • Typical applications • Movie recommendation • Product recommendation • Keyword recommendation CMPT 884, SFU, Martin Ester, 1-09
Recommender Systems • Methods • Collaborative filteringoUses only a database of user – item ratings.o Recommendation based on ratings by users with similar rating patterns. • Content-based recommender systems • o Uses information about the content of items and / or the properties of users. • o Recommends items that have content similar to items liked by user. • Trust-based recommender systems • Assume a social network / trust network. Trust can be defined explicitly or implicitly. • Recommendation based on ratings by trusted neighbors. CMPT 884, SFU, Martin Ester, 1-09
Recommender Systems • Challenges • High dimensionality and sparsity of dataoThe overwhelming majority (> 99%) of user item ratings is unknown.o Recommendation especially hard for cold start users and controversial items. • dimensionality reduction, model based methods, trust-based approach • Fraud • o Memory-based collaborative filtering can be easily manipulated by adding fraudulent ratings. • trust-based approach more robust to fraud • Privacy issues with trust network data • o only very few trust networks are public domain CMPT 884, SFU, Martin Ester, 1-09
Information Extraction and Integration • Motivating Applications • Importance of unstructured text data oThe overwhelming majority (>= 80%) of human generated information is not in structured form, but in unstructured text. • Biomedical literature • o Contains a wealth of valuable information that cannot be processed / searched automatically. • o Extraction of entities and relationships such as proteins and their localizations. • Online product reviews • o A lot of product „reviews“ available online in community databases or blogs. • o Companies want to know what customers think of their products. CMPT 884, SFU, Martin Ester, 1-09
Information Extraction and Integration • Methods • Basic NLP methods oPart-of-speech tagging • o Lexica, ontologies, . . . • Machine learning methods • o Typically, supervised classification. • o CRFs and similar methods are state-of-the-art. • Bootstrapping approach • o Using a small labeled training dataset, find textual extraction patterns. • o Using these patterns, extract further entities / relationships and continue. CMPT 884, SFU, Martin Ester, 1-09
Information Extraction and Integration • Challenges • Text data is hard to understand oMany of the NLP problems are still essentially unsolved. relatively simple NLP methods often sufficient for information extraction • Portability across domains • o Extraction methods need to be portable from one domain to another. • o Knowledge engineering approach (domain expert defines rules) is labor-intensive and expensive. • machine learning methods • Entity mentions need to be resolved • o Information extraction produces strings referencing an entity of a given type. • o Without mapping to known real world entities, extracted information is of limited usefulness. need to integrate extracted information with existing databases CMPT 884, SFU, Martin Ester, 1-09
References • Graph mining • X Yan & Karsten Borgwardt, "Graph Mining and Graph Kernels", Tutorial KDD 08 • Jure Leskovec and Christos Faloutsos, “Mining Large Graphs: Models, Diffusion and Case Studies”, Tutorial ECML/PKDD 2007 • Recommender systems • Joseph Konstan, “Introduction to Recommender Systems”, Tutorial SIGMOD 2008 • Information extraction and integration - Eugene Agichtein & Sunita Sarawagi, “Scalable Information Extraction and Integration”, Tutorial KDD 06 • - AnHai Doan & Raghu Ramakrishnan & Shiv Vaithyanathan,“Managing Information Extraction”, Tutorial SIGMOD 2006 CMPT 884, SFU, Martin Ester, 1-09