Text Data Mining: Introduction

Text Data Mining: Introduction Hao Chen School of Information Systems University of California at Berkeley hchen@sims.berkeley.edu

The KDD Process for Extracting Useful Knowledge from Volumes of Data • Large databases becomes ubiquitous • grocery store’s checkout registry • credit card authorization • Computer technology allow efficient and inexpensive data storage and access • But our ability to analyze and understand large dataset lags far behind.

Manual Data Analysis Impractical • Slow, expensive, and highly subjective • Becomes impractical as data volumns grow • N: number of records (109) • D: number of fields (102 -- 103) • Need computer technology to automate the bookkeeping. • First KDD Workshop in 1989

Definitions of KDD • Knowledge Discovery from DataThe nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.

KDD Process: Selection • Learning the application domain • Creating a target dataset

KDD Process: Preprocessing • Data cleaning & preprocessing • remove noise • handle missing data fields • time sequence information

KDD Process: Transformation • Data reduction & projection • features extraction • dimensionality reduction • invariant representation

KDD Process: Data Mining • Choosing function of data mining • Choosing data mining algorithms • Data mining: searching for patterns of interest

KDD Process: Interpretation / Evaluation • Interpretation • Using discovered knowledge

What is Data Mining? • Fitting models to or determining patterns from very large datasets. • A “regime” which enables people to interact effectively with massive data stores. • Deriving new information from data. • finding patterns across large datasets • discovering heretofore unknown information

What is Data Mining? • Potential point of confusion: • The extracting ore from rock metaphor does not really apply to the practice of data mining • If it did, then standard database queries would fit under the rubric of data mining • Find all employee records in which employee earns $300/month less than their managers • In practice, DM refers to: • finding patterns across large datasets • discovering heretofore unknown information

Another Definition of DM • What SQL currently cannot do. • A standard query does not infer new information • It retrieves a subset of what is already present and known. • SQL originally intended for business apps • DM requires sophisticated aggregate queries

DM Touchstone Applications • Finding patterns across data sets: • Reports on changes in retail sales • to improve sales • Patterns of sizes of TV audiences • for marketing • Patterns in NBA play • to alter, and so improve, performance • Deviations in standard phone calling behavior • to detect fraud • for marketing

DM Touchstone Applications • Separating signal from noise: • Classifying faint astronomical objects • Finding genes within DNA sequences • Discovering novel tectonic activity

Components of Data Mining • The model • function of the model • classification • clustering • representational form of the model • linear function of multiple variables • Gaussian probability density function • The preference criterion • goodness of fit • avoiding overfitting • The search algorithm

Model Function • Classification • Regression • Clustering • Summarization • Dependency modeling • Link analysis • Sequence analysis

Model Representation • Decision tree • Linear model • Nonlinear model (e.g. Neural Network) • Example-based method (e.g. Nearest Neighbor) • Probabilistic graphical dependency model(e.g. Baysian Network) • Relational attribute model

Search Algorithm • Parameter search, given a model • Model search over model space • predictive • descriptive

What’s New Here? • Sounds like statistical modeling or machine learning. • Main difference: scale and availability • Datasets too large for classical analysis • Increased opportunity for access • end user is often not a statistician • New issues in sampling

Statistician’s Viewpoint • What’s new about DM? • Returns statisticians to their empirical roots • exploration rather than modeling • Hypothesis testing may be irrelevant • given the large data sizes everything is significant • Data was collected for some other purpose than what it is being analyzed for now

conservative rigorous abstract idealized adventurous engineering practical real solutions The Statistician’s Viewpoint (David Hand 97) Statistics vs. Machine Learning

Research Challenges • Massive datasets & high dimensionality • User interaction & prior knowledge • Overfitting & assessing statistical significance • Missing data • Understandability of patterns • Managing changing data and knowledge • Integration • Nonstandard, multimedia, object-oriented data

A Database Perspective on Knowledge Discovery • Concept of data mining as a querying process • First steps toward efficient development of knowledge discovery applications

New Research Frontier • Short term:Efficient algorithms implementing machine learning tools on the top of large databases • Long term:building optimizing compilers for ad hoc queries and embedding queries in application programming interfaces

KDDMS • KDD objects • a rule • a classifier • a clustering • KDD queries • a predicate returning a set of KDD or DB objects

Examples of KDD Query • Generate a classifier • Generate the strongest rule • Generate all rules with consequent attribute values computed by SQL query • Find tuples that belong to the largest cluster

Future Directions • KDD applications need development support • query KDD objects • data mining operations • nearest neighbors • clustering • Development of querying tools is a big challenge • Provide developers with build applications using a KDD query language

Text Data Mining • Peoples’ first thought: • Make it easier to find things on the Web. • But this is information retrieval! • The metaphor of extracting ore from rock: • Does make sense for extracting documents of interest from a huge pile. • But does not reflect notions of DM in practice: • finding patterns across large collections • discovering heretofore unknown information

Real Text DM • What would finding a pattern across a large text collection really look like?

Bill Gates + MS-DOS in the Bible! From: “The Internet Diary of the man who cracked the Bible Code” Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil (William Gates, agitator, leader)

From: “The Internet Diary of the man who cracked the Bible Code” Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil

Real Text DM • The point: • Discovering heretofore unknown information is not what we usually do with text. • (If it weren’t known, it could not have been written by someone!) • However: • There is a field whose goal is to learn about patterns in text for its own sake ...

Observation Research that exploits patterns in text does so mainly in the service of computational linguistics, rather than for learning about and exploring text collections.

TDM using Metadata (instead of Text) • Data: • Reuter’s newswire (22,000 articles, late 1980s) • Categories: commodities, time, countries, people, and topic • Goals: • distributions of categories across time (trends) • distributions of categories between collections • category co-occurrence (e.g., topic|country) • Interactive Interface: • lists, pie charts, 2D line plots

Combining Text with Metadata(images, hyperlinks) • Examples • Text + Links to find “authority pages” (Kleinberg at Cornell, Page at Stanford) • Usage + Time + Links to study evolution of web and information use (Pitkow et al. at PARC) • Images + Text to improve image search

True Text Data Mining:Don Swanson’s Medical Work • Given • medical titles and abstracts • a problem (incurable rare disease) • some medical expertise • find causal links among titles • symptoms • drugs • results

Swanson Example (1991) • Problem: Migraine headaches (M) • stress associated with M • stress leads to loss of magnesium • calcium channel blockers prevent some M • magnesium is a natural calcium channel blocker • spreading cortical depression (SCD)implicated in M • high levels of magnesium inhibit SCD • M patients have high platelet aggregability • magnesium can suppress platelet aggregability • All extracted from medical journal titles

Swanson’s TDM • Two of his hypotheses have received some experimental verification. • His technique • Only partially automated • Required medical expertise • Few people are working on this.

Conclusions • Currently, what might be construed as Text Data Mining is really Computational Linguistics • Text is tricky to process, but rich and abundant (now) • There are many CL tools available • Data Mining directly from text • tells us about language • produces meta-information that may be useful for information access

Conclusions • Information Access != Text Data Mining • IA = finding needle in haystack • TDM = finding patterns or new information • However, Information Access may potentially be served by Text Data Mining techniques: • automated metadata assignment • collection overviews • The synthesis of ideas from TDM and IA: • Perhaps a new field of exploratory data analysis over text!

Promising Research Directions • Text Data Mining Problems: • Patterns within sets of documents: • What is the latest in this field? • How is this field related to that field? • Chains of evidence embedded in text: • What drugs have been tested for this symptom? • What effects did this funding have on that field? • Human use of information over time • How does information diffuse across the web?

Needed from Systems • Support for linking chainsof associations • Support for combined structured andunstructured data • Support for combining disparate collections

Statistical Themes & Lessons for Data Mining • Statistical themes • Statistical lessons • Cooperation between statistical and computational communities

Overview of Statistical Science • Probability distributions • Estimation, consistency, uncertainty, assumptions, robustness, and model averaging • Hypothesis testing • Model scoring • Markov Chain Monte Carlo • Generalized model classes

Overview of Statistical Sciences • Rational decision making and planning • Inference to causes • Prediction

Important Themes of Statisticsto Data Mining • Clarity about goals • Use of model that are reliable means to the goal, understandable and plausible to users • Sense of uncertainties of models and predictions

Lessons • Data can lie • Sometimes it’s not what’s in the data that matters • Perversity of the pervasive P-value • Intervention and prediction

Text Data Mining: Introduction

Text Data Mining: Introduction

Presentation Transcript

Chapter 4 Data, Text, and Web Mining

Data Mining: Introduction

Web Mining An introduction to Web content text mining

Data Mining: Concepts and Techniques Mining Text Data

Introduction to Data Mining

ICS 278: Data Mining Lecture 14: Text Mining and Information Retrieval

Text Mining: Finding Nuggets in Mountains of Textual Data

Knowledge Discovery and Data Mining (An Introduction)

Taming Text: An Introduction to Text Mining CAS 2006 Ratemaking Seminar

Text Mining Overview

Overview of Text Data Mining

Overview of Text Mining Expertise @ SCD

Data and Text Mining for Computational Biology

Mining Complex Types of Data

Chapter 5: Text and Web Mining

Data Mining: Introduction

Data Mining: Introduction

Introduction to Data Mining

Data Mining: Introduction