Course on Data Mining (581550-4): Seminar Meetings

16.11. 02.11. 23.11. 09.11. 30.11. Seminar by Mika M Seminar by Pirjo P Course on Data Mining (581550-4): Seminar Meetings Ass. Rules Clustering P P Episodes KDD Process M P Text Mining Home Exam M

Course on Data Mining (581550-4): Seminar Meetings • R. Feldman, M. Fresko, H. Hirsh, et.al.: "Knowledge Management: A Text Mining Approach", Proc of the 2nd Int'l Conf. on Practical Aspects of Knowledge Management (PAKM98), 1998 • B. Lent, R. Agrawal, R. Srikant: "Discovering Trends in Text Databases", Proc. of the 3rd Int'l Conference on Knowledge Discovery in Databases and Data Mining, 1997. Today 16.11.2001

Course on Data Mining (581550-4): Seminar Meetings • Both papers refer to the Agrawal and Srikant paper we had last week: Rakesh Agrawal and Ramakrishnan Srikant: Mining Sequential Patterns. Int'l Conference on Data Engineering, 1995. Good to Read as Background

Knowledge Management: A Text Mining Approach R. Feldman, M. Fresko, H. Hirsh, et.al Bar-Ilan University and Instict Software, ISRAEL; Rutgers University, USA; LIA-EPFL, Switzerland Published in PAKM'98 (Int'l Conf. on Practical Aspects of Knowledge Management) Data Mining course Autumn 2001/University of Helsinki Summary by Mika Klemettinen

KM: A Text Mining Approach • Basic idea (see selected phases on the next slides): 1. Get input data in SGML (or XML) format Select only the contents of desired elements! (title, abstract, etc.) 2. Do linguistic preprocessing: 2.1 Term extraction (use linguistic software for this) 2.2 Term generation (combine adjacent terms to morpho- syntactic patterns like "noun-noun", "adj.-noun", etc. by calculating association coefficients) 2.3 Term filtering (select only the top M most frequent ones) 3. Create taxonomies (there is a tool for this) 4. Generate associations (you may constrain the creation) 5. Visualize/explore the results

2.1: Term Extraction

3: Taxonomy Construction

4: Association Rule Generation

5.1: Visualization/Exploration

5.2: Visualization/Exploration

Discovering Trends in Text Databases Brian Lent, Rakesh Agrawal and Ramakrishnan Srikant IBM Almaden Research Center, USA Published in KDD'97 Data Mining course Autumn 2001/University of Helsinki Summary by Mika Klemettinen

Discovering Trends in Text Databases • Basic ideas: • Identify frequent phrases using sequential patterns mining (see the slides & summaries from the Agrawal et. al paper "Mining Sequential Patterns" (MSP)) • Generate histories of phrases • Find phrases that satisfy a specified trend • Definitions: • Phrase: phrase p is  (w1)(w2) … (wn ), wherew is a word • 1-phrase:  (IBM) (data)(mining)  • 2-phrase:  (IBM) (data)(mining)   (Anderson) (Consulting)  (decision)(support)  • Itemset, sequence, is contained, etc.: as in MSP paper

Discovering Trends in Text Databases • Gaps: Minimum and maximum gaps between adjacent words: identify relations of words/phrases inside sentences/paragraphs, between words/phrases in different paragraphs, between words/phrases in different sections, etc. • Sentence boundary: 1000 • Paragraph boundary: 100.000 • Section boundary: 10.000.000 • Phases: • Partition data/documents based on their time stamps, create phrases for each partition (Lent & al. have patent data documents) • Select the frequent phrases and save their frequences • Define shape queries using SDL (Shape Definition Language)

Discovering Trends in Text Databases

Course on Data Mining (581550-4): Seminar Meetings

Course on Data Mining (581550-4): Seminar Meetings

Presentation Transcript

Data Mining

Data Mining

Data Mining

Data Mining: An Introduction

DATA MINING

Course on Data Mining (581550-4)

Data Mining

Applications and Trends in Data Mining

Data Mining

CHAPTER 17: DATA MINING BASICS

CHAPTER 17: DATA MINING BASICS

EECS 800 Research Seminar Mining Biological Data

Data Mining with DB

EECS 800 Research Seminar Mining Biological Data

Spatial and Temporal Data Mining

Data Mining: Extracting Knowledge from Past Data

Data Mining