470 likes | 585 Views
Text Data Mining: Introduction. Hao Chen School of Information Systems University of California at Berkeley hchen@sims.berkeley.edu. The KDD Process for Extracting Useful Knowledge from Volumes of Data. Large databases becomes ubiquitous grocery store’s checkout registry
E N D
Text Data Mining: Introduction Hao Chen School of Information Systems University of California at Berkeley hchen@sims.berkeley.edu
The KDD Process for Extracting Useful Knowledge from Volumes of Data • Large databases becomes ubiquitous • grocery store’s checkout registry • credit card authorization • Computer technology allow efficient and inexpensive data storage and access • But our ability to analyze and understand large dataset lags far behind.
Manual Data Analysis Impractical • Slow, expensive, and highly subjective • Becomes impractical as data volumns grow • N: number of records (109) • D: number of fields (102 -- 103) • Need computer technology to automate the bookkeeping. • First KDD Workshop in 1989
Definitions of KDD • Knowledge Discovery from DataThe nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.
KDD Process: Selection • Learning the application domain • Creating a target dataset
KDD Process: Preprocessing • Data cleaning & preprocessing • remove noise • handle missing data fields • time sequence information
KDD Process: Transformation • Data reduction & projection • features extraction • dimensionality reduction • invariant representation
KDD Process: Data Mining • Choosing function of data mining • Choosing data mining algorithms • Data mining: searching for patterns of interest
KDD Process: Interpretation / Evaluation • Interpretation • Using discovered knowledge
What is Data Mining? • Fitting models to or determining patterns from very large datasets. • A “regime” which enables people to interact effectively with massive data stores. • Deriving new information from data. • finding patterns across large datasets • discovering heretofore unknown information
What is Data Mining? • Potential point of confusion: • The extracting ore from rock metaphor does not really apply to the practice of data mining • If it did, then standard database queries would fit under the rubric of data mining • Find all employee records in which employee earns $300/month less than their managers • In practice, DM refers to: • finding patterns across large datasets • discovering heretofore unknown information
Another Definition of DM • What SQL currently cannot do. • A standard query does not infer new information • It retrieves a subset of what is already present and known. • SQL originally intended for business apps • DM requires sophisticated aggregate queries
DM Touchstone Applications • Finding patterns across data sets: • Reports on changes in retail sales • to improve sales • Patterns of sizes of TV audiences • for marketing • Patterns in NBA play • to alter, and so improve, performance • Deviations in standard phone calling behavior • to detect fraud • for marketing
DM Touchstone Applications • Separating signal from noise: • Classifying faint astronomical objects • Finding genes within DNA sequences • Discovering novel tectonic activity
Components of Data Mining • The model • function of the model • classification • clustering • representational form of the model • linear function of multiple variables • Gaussian probability density function • The preference criterion • goodness of fit • avoiding overfitting • The search algorithm
Model Function • Classification • Regression • Clustering • Summarization • Dependency modeling • Link analysis • Sequence analysis
Model Representation • Decision tree • Linear model • Nonlinear model (e.g. Neural Network) • Example-based method (e.g. Nearest Neighbor) • Probabilistic graphical dependency model(e.g. Baysian Network) • Relational attribute model
Search Algorithm • Parameter search, given a model • Model search over model space • predictive • descriptive
What’s New Here? • Sounds like statistical modeling or machine learning. • Main difference: scale and availability • Datasets too large for classical analysis • Increased opportunity for access • end user is often not a statistician • New issues in sampling
Statistician’s Viewpoint • What’s new about DM? • Returns statisticians to their empirical roots • exploration rather than modeling • Hypothesis testing may be irrelevant • given the large data sizes everything is significant • Data was collected for some other purpose than what it is being analyzed for now
conservative rigorous abstract idealized adventurous engineering practical real solutions The Statistician’s Viewpoint (David Hand 97) Statistics vs. Machine Learning
Research Challenges • Massive datasets & high dimensionality • User interaction & prior knowledge • Overfitting & assessing statistical significance • Missing data • Understandability of patterns • Managing changing data and knowledge • Integration • Nonstandard, multimedia, object-oriented data
A Database Perspective on Knowledge Discovery • Concept of data mining as a querying process • First steps toward efficient development of knowledge discovery applications
New Research Frontier • Short term:Efficient algorithms implementing machine learning tools on the top of large databases • Long term:building optimizing compilers for ad hoc queries and embedding queries in application programming interfaces
KDDMS • KDD objects • a rule • a classifier • a clustering • KDD queries • a predicate returning a set of KDD or DB objects
Examples of KDD Query • Generate a classifier • Generate the strongest rule • Generate all rules with consequent attribute values computed by SQL query • Find tuples that belong to the largest cluster
Future Directions • KDD applications need development support • query KDD objects • data mining operations • nearest neighbors • clustering • Development of querying tools is a big challenge • Provide developers with build applications using a KDD query language
Text Data Mining • Peoples’ first thought: • Make it easier to find things on the Web. • But this is information retrieval! • The metaphor of extracting ore from rock: • Does make sense for extracting documents of interest from a huge pile. • But does not reflect notions of DM in practice: • finding patterns across large collections • discovering heretofore unknown information
Real Text DM • What would finding a pattern across a large text collection really look like?
Bill Gates + MS-DOS in the Bible! From: “The Internet Diary of the man who cracked the Bible Code” Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil (William Gates, agitator, leader)
From: “The Internet Diary of the man who cracked the Bible Code” Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil
Real Text DM • The point: • Discovering heretofore unknown information is not what we usually do with text. • (If it weren’t known, it could not have been written by someone!) • However: • There is a field whose goal is to learn about patterns in text for its own sake ...
Observation Research that exploits patterns in text does so mainly in the service of computational linguistics, rather than for learning about and exploring text collections.
TDM using Metadata (instead of Text) • Data: • Reuter’s newswire (22,000 articles, late 1980s) • Categories: commodities, time, countries, people, and topic • Goals: • distributions of categories across time (trends) • distributions of categories between collections • category co-occurrence (e.g., topic|country) • Interactive Interface: • lists, pie charts, 2D line plots
Combining Text with Metadata(images, hyperlinks) • Examples • Text + Links to find “authority pages” (Kleinberg at Cornell, Page at Stanford) • Usage + Time + Links to study evolution of web and information use (Pitkow et al. at PARC) • Images + Text to improve image search
True Text Data Mining:Don Swanson’s Medical Work • Given • medical titles and abstracts • a problem (incurable rare disease) • some medical expertise • find causal links among titles • symptoms • drugs • results
Swanson Example (1991) • Problem: Migraine headaches (M) • stress associated with M • stress leads to loss of magnesium • calcium channel blockers prevent some M • magnesium is a natural calcium channel blocker • spreading cortical depression (SCD)implicated in M • high levels of magnesium inhibit SCD • M patients have high platelet aggregability • magnesium can suppress platelet aggregability • All extracted from medical journal titles
Swanson’s TDM • Two of his hypotheses have received some experimental verification. • His technique • Only partially automated • Required medical expertise • Few people are working on this.
Conclusions • Currently, what might be construed as Text Data Mining is really Computational Linguistics • Text is tricky to process, but rich and abundant (now) • There are many CL tools available • Data Mining directly from text • tells us about language • produces meta-information that may be useful for information access
Conclusions • Information Access != Text Data Mining • IA = finding needle in haystack • TDM = finding patterns or new information • However, Information Access may potentially be served by Text Data Mining techniques: • automated metadata assignment • collection overviews • The synthesis of ideas from TDM and IA: • Perhaps a new field of exploratory data analysis over text!
Promising Research Directions • Text Data Mining Problems: • Patterns within sets of documents: • What is the latest in this field? • How is this field related to that field? • Chains of evidence embedded in text: • What drugs have been tested for this symptom? • What effects did this funding have on that field? • Human use of information over time • How does information diffuse across the web?
Needed from Systems • Support for linking chainsof associations • Support for combined structured andunstructured data • Support for combining disparate collections
Statistical Themes & Lessons for Data Mining • Statistical themes • Statistical lessons • Cooperation between statistical and computational communities
Overview of Statistical Science • Probability distributions • Estimation, consistency, uncertainty, assumptions, robustness, and model averaging • Hypothesis testing • Model scoring • Markov Chain Monte Carlo • Generalized model classes
Overview of Statistical Sciences • Rational decision making and planning • Inference to causes • Prediction
Important Themes of Statisticsto Data Mining • Clarity about goals • Use of model that are reliable means to the goal, understandable and plausible to users • Sense of uncertainties of models and predictions
Lessons • Data can lie • Sometimes it’s not what’s in the data that matters • Perversity of the pervasive P-value • Intervention and prediction