1 / 47

Text Data Mining: Introduction

Text Data Mining: Introduction. Hao Chen School of Information Systems University of California at Berkeley hchen@sims.berkeley.edu. The KDD Process for Extracting Useful Knowledge from Volumes of Data. Large databases becomes ubiquitous grocery store’s checkout registry

lorne
Download Presentation

Text Data Mining: Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Data Mining: Introduction Hao Chen School of Information Systems University of California at Berkeley hchen@sims.berkeley.edu

  2. The KDD Process for Extracting Useful Knowledge from Volumes of Data • Large databases becomes ubiquitous • grocery store’s checkout registry • credit card authorization • Computer technology allow efficient and inexpensive data storage and access • But our ability to analyze and understand large dataset lags far behind.

  3. Manual Data Analysis Impractical • Slow, expensive, and highly subjective • Becomes impractical as data volumns grow • N: number of records (109) • D: number of fields (102 -- 103) • Need computer technology to automate the bookkeeping. • First KDD Workshop in 1989

  4. Definitions of KDD • Knowledge Discovery from DataThe nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.

  5. KDD Process: Selection • Learning the application domain • Creating a target dataset

  6. KDD Process: Preprocessing • Data cleaning & preprocessing • remove noise • handle missing data fields • time sequence information

  7. KDD Process: Transformation • Data reduction & projection • features extraction • dimensionality reduction • invariant representation

  8. KDD Process: Data Mining • Choosing function of data mining • Choosing data mining algorithms • Data mining: searching for patterns of interest

  9. KDD Process: Interpretation / Evaluation • Interpretation • Using discovered knowledge

  10. What is Data Mining? • Fitting models to or determining patterns from very large datasets. • A “regime” which enables people to interact effectively with massive data stores. • Deriving new information from data. • finding patterns across large datasets • discovering heretofore unknown information

  11. What is Data Mining? • Potential point of confusion: • The extracting ore from rock metaphor does not really apply to the practice of data mining • If it did, then standard database queries would fit under the rubric of data mining • Find all employee records in which employee earns $300/month less than their managers • In practice, DM refers to: • finding patterns across large datasets • discovering heretofore unknown information

  12. Another Definition of DM • What SQL currently cannot do. • A standard query does not infer new information • It retrieves a subset of what is already present and known. • SQL originally intended for business apps • DM requires sophisticated aggregate queries

  13. DM Touchstone Applications • Finding patterns across data sets: • Reports on changes in retail sales • to improve sales • Patterns of sizes of TV audiences • for marketing • Patterns in NBA play • to alter, and so improve, performance • Deviations in standard phone calling behavior • to detect fraud • for marketing

  14. DM Touchstone Applications • Separating signal from noise: • Classifying faint astronomical objects • Finding genes within DNA sequences • Discovering novel tectonic activity

  15. Components of Data Mining • The model • function of the model • classification • clustering • representational form of the model • linear function of multiple variables • Gaussian probability density function • The preference criterion • goodness of fit • avoiding overfitting • The search algorithm

  16. Model Function • Classification • Regression • Clustering • Summarization • Dependency modeling • Link analysis • Sequence analysis

  17. Model Representation • Decision tree • Linear model • Nonlinear model (e.g. Neural Network) • Example-based method (e.g. Nearest Neighbor) • Probabilistic graphical dependency model(e.g. Baysian Network) • Relational attribute model

  18. Search Algorithm • Parameter search, given a model • Model search over model space • predictive • descriptive

  19. What’s New Here? • Sounds like statistical modeling or machine learning. • Main difference: scale and availability • Datasets too large for classical analysis • Increased opportunity for access • end user is often not a statistician • New issues in sampling

  20. Statistician’s Viewpoint • What’s new about DM? • Returns statisticians to their empirical roots • exploration rather than modeling • Hypothesis testing may be irrelevant • given the large data sizes everything is significant • Data was collected for some other purpose than what it is being analyzed for now

  21. conservative rigorous abstract idealized adventurous engineering practical real solutions The Statistician’s Viewpoint (David Hand 97) Statistics vs. Machine Learning

  22. Research Challenges • Massive datasets & high dimensionality • User interaction & prior knowledge • Overfitting & assessing statistical significance • Missing data • Understandability of patterns • Managing changing data and knowledge • Integration • Nonstandard, multimedia, object-oriented data

  23. A Database Perspective on Knowledge Discovery • Concept of data mining as a querying process • First steps toward efficient development of knowledge discovery applications

  24. New Research Frontier • Short term:Efficient algorithms implementing machine learning tools on the top of large databases • Long term:building optimizing compilers for ad hoc queries and embedding queries in application programming interfaces

  25. KDDMS • KDD objects • a rule • a classifier • a clustering • KDD queries • a predicate returning a set of KDD or DB objects

  26. Examples of KDD Query • Generate a classifier • Generate the strongest rule • Generate all rules with consequent attribute values computed by SQL query • Find tuples that belong to the largest cluster

  27. Future Directions • KDD applications need development support • query KDD objects • data mining operations • nearest neighbors • clustering • Development of querying tools is a big challenge • Provide developers with build applications using a KDD query language

  28. Text Data Mining • Peoples’ first thought: • Make it easier to find things on the Web. • But this is information retrieval! • The metaphor of extracting ore from rock: • Does make sense for extracting documents of interest from a huge pile. • But does not reflect notions of DM in practice: • finding patterns across large collections • discovering heretofore unknown information

  29. Real Text DM • What would finding a pattern across a large text collection really look like?

  30. Bill Gates + MS-DOS in the Bible! From: “The Internet Diary of the man who cracked the Bible Code” Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil (William Gates, agitator, leader)

  31. From: “The Internet Diary of the man who cracked the Bible Code” Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil

  32. Real Text DM • The point: • Discovering heretofore unknown information is not what we usually do with text. • (If it weren’t known, it could not have been written by someone!) • However: • There is a field whose goal is to learn about patterns in text for its own sake ...

  33. Observation Research that exploits patterns in text does so mainly in the service of computational linguistics, rather than for learning about and exploring text collections.

  34. TDM using Metadata (instead of Text) • Data: • Reuter’s newswire (22,000 articles, late 1980s) • Categories: commodities, time, countries, people, and topic • Goals: • distributions of categories across time (trends) • distributions of categories between collections • category co-occurrence (e.g., topic|country) • Interactive Interface: • lists, pie charts, 2D line plots

  35. Combining Text with Metadata(images, hyperlinks) • Examples • Text + Links to find “authority pages” (Kleinberg at Cornell, Page at Stanford) • Usage + Time + Links to study evolution of web and information use (Pitkow et al. at PARC) • Images + Text to improve image search

  36. True Text Data Mining:Don Swanson’s Medical Work • Given • medical titles and abstracts • a problem (incurable rare disease) • some medical expertise • find causal links among titles • symptoms • drugs • results

  37. Swanson Example (1991) • Problem: Migraine headaches (M) • stress associated with M • stress leads to loss of magnesium • calcium channel blockers prevent some M • magnesium is a natural calcium channel blocker • spreading cortical depression (SCD)implicated in M • high levels of magnesium inhibit SCD • M patients have high platelet aggregability • magnesium can suppress platelet aggregability • All extracted from medical journal titles

  38. Swanson’s TDM • Two of his hypotheses have received some experimental verification. • His technique • Only partially automated • Required medical expertise • Few people are working on this.

  39. Conclusions • Currently, what might be construed as Text Data Mining is really Computational Linguistics • Text is tricky to process, but rich and abundant (now) • There are many CL tools available • Data Mining directly from text • tells us about language • produces meta-information that may be useful for information access

  40. Conclusions • Information Access != Text Data Mining • IA = finding needle in haystack • TDM = finding patterns or new information • However, Information Access may potentially be served by Text Data Mining techniques: • automated metadata assignment • collection overviews • The synthesis of ideas from TDM and IA: • Perhaps a new field of exploratory data analysis over text!

  41. Promising Research Directions • Text Data Mining Problems: • Patterns within sets of documents: • What is the latest in this field? • How is this field related to that field? • Chains of evidence embedded in text: • What drugs have been tested for this symptom? • What effects did this funding have on that field? • Human use of information over time • How does information diffuse across the web?

  42. Needed from Systems • Support for linking chainsof associations • Support for combined structured andunstructured data • Support for combining disparate collections

  43. Statistical Themes & Lessons for Data Mining • Statistical themes • Statistical lessons • Cooperation between statistical and computational communities

  44. Overview of Statistical Science • Probability distributions • Estimation, consistency, uncertainty, assumptions, robustness, and model averaging • Hypothesis testing • Model scoring • Markov Chain Monte Carlo • Generalized model classes

  45. Overview of Statistical Sciences • Rational decision making and planning • Inference to causes • Prediction

  46. Important Themes of Statisticsto Data Mining • Clarity about goals • Use of model that are reliable means to the goal, understandable and plausible to users • Sense of uncertainties of models and predictions

  47. Lessons • Data can lie • Sometimes it’s not what’s in the data that matters • Perversity of the pervasive P-value • Intervention and prediction

More Related