670 likes | 960 Views
Text Mining. What is Text Mining?. There are many examples of text-based documents (all in ‘electronic’ format…) e-mails, corporate Web pages, customer surveys, résumés, medical records, DNA sequences, technical papers, incident reports, news stories and more…
E N D
Text Mining Data Mining -Volinsky - 2011 - Columbia University
What is Text Mining? • There are many examples of text-based documents (all in ‘electronic’ format…) • e-mails, corporate Web pages, customer surveys, résumés, medical records, DNA sequences, technical papers, incident reports, news stories and more… • Not enough time or patience to read • Can we extract the most vital kernels of information… • So, we wish to find a way to gain knowledge (in summarised form) from all that text, without reading or examining them fully first…! • Some others (e.g. DNA seq.) are hard to comprehend! Data Mining -Volinsky - 2011 - Columbia University
What is Text Mining? • Traditional data mining uses ‘structured data’ (n x p matrix) • The analysis of ‘free-form text’ is also referred to as ‘unstructured data’, • successful categorisation of such data can be a difficult and time-consuming task… • Often, can combine free-form text and structured data to derive valuable, actionable information… (e.g. as in typical surveys) – semi-structured Data Mining -Volinsky - 2011 - Columbia University
Text Mining: Examples • Text mining is an exercise to gain knowledge from stores of language text. • Text: • Web pages • Medical records • Customer surveys • Email filtering (spam) • DNA sequences • Incident reports • Drug interaction reports • News stories (e.g. predict stock movement) Data Mining -Volinsky - 2011 - Columbia University
What is Text Mining • Data examples • Web pages • Customer surveys Data Mining -Volinsky - 2011 - Columbia University
Amazon.com Data Mining -Volinsky - 2011 - Columbia University
Of Mice and Men: Concordance Concordance is an alphabetized list of the most frequently occurring words in a book, excluding common words such as "of" and "it." The font size of a word is proportional to the number of times it occurs in the book. Data Mining -Volinsky - 2011 - Columbia University
Of Mice and Men: Text Stats Data Mining -Volinsky - 2011 - Columbia University
Text Mining: Yahoo Buzz Data Mining -Volinsky - 2011 - Columbia University
Text Mining: Google News Data Mining -Volinsky - 2011 - Columbia University
Text Mining • Typically falls into one of two categories • Analysis of text: I have a bunch of text I am interested in, tell me something about it • E.g. sentiment analysis, “buzz” searches • Retrieval: There is a large corpus of text documents, and I want the one closest to a specified query • E.g. web search, library catalogs, legal and medical precedent studies Data Mining -Volinsky - 2011 - Columbia University
Text Mining: Analysis • Which words are most present • Which words are most surprising • Which words help define the document • What are the interesting text phrases? Data Mining -Volinsky - 2011 - Columbia University
Text Mining: Retrieval • Find k objects in the corpus of documents which are most similar to my query. • Can be viewed as “interactive” data mining - query not specified a priori. • Main problems of text retrieval: • What does “similar” mean? • How do I know if I have the right documents? • How can I incorporate user feedback? Data Mining -Volinsky - 2011 - Columbia University
Text Retrieval: Challenges • Calculating similarity is not obvious - what is the distance between two sentences or queries? • Evaluating retrieval is hard: what is the “right” answer ? (no ground truth) • User can query things you have not seen before e.g. misspelled, foreign, new terms. • Goal (score function) is different than in classification/regression: not looking to model all of the data, just get best results for a given user. • Words can hide semantic content • Synonymy: A keyword T does not appear anywhere in the document, even though the document is closely related to T, e.g., data mining • Polysemy: The same keyword may mean different things in different contexts, e.g., mining Data Mining -Volinsky - 2011 - Columbia University
Basic Measures for Text Retrieval • Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses) • Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved Data Mining -Volinsky - 2011 - Columbia University
Precision vs. Recall • We’ve been here before! • Precision = TP/(TP+FP) • Recall = TP/(TP+FN) • Trade off: • If algorithm is ‘picky’: precision high, recall low • If algorithm is ‘relaxed’: precision low, recall high • BUT: recall often hard if not impossible to calculate actual outcome predicted outcome 16 Data Mining -Volinsky - 2011 - Columbia University
Precision Recall Curves • If we have a labelled training set, we can calculate recall. • For any given number of returned documents, we can plot a point for precision vs. recall. (similar to thresholds in ROC curves) • Different retrieval algorithms might have very different curves - hard to tell which is “best” Data Mining -Volinsky - 2011 - Columbia University
Term / document matrix • Most common form of representation in text mining is the term - document matrix • Term: typically a single word, but could be a word phrase like “data mining” • Document: a generic term meaning a collection of text to be retrieved • Can be large - terms are often 50k or larger, documents can be in the billions (www). • Can be binary, or use counts Data Mining -Volinsky - 2011 - Columbia University
Term document matrix Example: 10 documents: 6 terms • Each document now is just a vector of terms, sometimes boolean Data Mining -Volinsky - 2011 - Columbia University
Term document matrix • We have lost all semantic content • Be careful constructing your term list! • Not all words are created equal! • Words that are the same should be treated the same! • Stop Words • Stemming Data Mining -Volinsky - 2011 - Columbia University
Stop words • Many of the most frequently used words in English are worthless in retrieval and text mining – these words are called stop words. • the, of, and, to, …. • Typically about 400 to 500 such words • For an application, an additional domain specific stop words list may be constructed • Why do we need to remove stop words? • Reduce indexing (or data) file size • stopwords accounts 20-30% of total word counts. • Improve efficiency • stop words are not useful for searching or text mining • stop words always have a large number of hits Data Mining -Volinsky - 2011 - Columbia University
Stemming • Techniques used to find out the root/stem of a word: • E.g., • user engineering • users engineered • used engineer • using • stem: use engineer Usefulness • improving effectiveness of retrieval and text mining • matching similar words • reducing indexing size • combing words with same roots may reduce indexing size as much as 40-50%. Data Mining -Volinsky - 2011 - Columbia University
Basic stemming methods • remove ending • if a word ends with a consonant other than s, followed by an s, then delete s. • if a word ends in es, drop the s. • if a word ends in ing, delete the ing unless the remaining word consists only of one letter or of th. • If a word ends with ed, preceded by a consonant, delete the ed unless this leaves only a single letter. • …... • transform words • if a word ends with “ies” but not “eies” or “aies” then “ies --> y.” Data Mining -Volinsky - 2011 - Columbia University
Feature Selection • Performance of text classification algorithms can be optimized by selecting only a subset of the discriminative terms • Even after stemming and stopword removal. • Greedy search • Start from full set and delete one at a time • Find the least important variable • Can use Gini index for this if a classification problem • Often performance does not degrade even with orders of magnitude reductions • Chakrabarti, Chapter 5: Patent data: 9600 patents in communcation, electricity and electronics. • Only 140 out of 20,000 terms needed for classification! Data Mining -Volinsky - 2011 - Columbia University
Distances in TD matrices • Given a term doc matrix represetnation, now we can define distances between documents (or terms!) • Elements of matrix can be 0,1 or term frequencies (sometimes normalized) • Can use Euclidean or cosine distance • Cosine distance is the angle between the two vectors • Not intuitive, but has been proven to work well • If docs are the same, dc =1, if nothing in common dc=0 Data Mining -Volinsky - 2011 - Columbia University
We can calculate cosine and Euclidean distance for this matrix • What would you want the distances to look like? Data Mining -Volinsky - 2011 - Columbia University
Document distance • Pairwise distances between documents • Image plots of cosine distance, Euclidean, and scaled Euclidean R function: ‘image’ Data Mining -Volinsky - 2011 - Columbia University
Weighting in TD space • Not all phrases are of equal importance • E.g. David less important than Beckham • If a term occurs frequently in many documents it has less discriminatory power • One way to correct for this is inverse-document frequency (IDF). • Term importance = Term Frequency (TF) x IDF • Nj= # of docs containing the term • N = total # of docs • A term is “important” if it has a high TF and/or a high IDF. • TF x IDF is a common measure of term importance Data Mining -Volinsky - 2011 - Columbia University
TF IDF Data Mining -Volinsky - 2011 - Columbia University
Queries • A query is a representation of the user’s information needs • Normally a list of words. • Once we have a TD matrix, queries can be represented as a vector in the same space • “Database Index” = (1,0,1,0,0,0) • Query can be a simple question in natural language • Calculate cosine distance between query and the TF x IDF version of the TD matrix • Returns a ranked vector of documents Data Mining -Volinsky - 2011 - Columbia University
Latent Semantic Indexing • Criticism: queries can be posed in many ways, but still mean the same • Data mining and knowledge discovery • Car and automobile • Beet and beetroot • Semantically, these are the same, and documents with either term are relevant. • Using synonym lists or thesauri are solutions, but messy and difficult. • Latent Semantic Indexing (LSI): tries to extract hidden semantic structure in the documents • Search what I meant, not what I said! Data Mining -Volinsky - 2011 - Columbia University
LSI • Approximate the T-dimensional term space using principle components calculated from the TD matrix • The first k PC directions provide the best set of k orthogonal basis vectors - these explain the most variance in the data. • Data is reduced to an N x k matrix, without much loss of information • Each “direction” is a linear combination of the input terms, and define a clustering of “topics” in the data. • What does this mean for our toy example? Data Mining -Volinsky - 2011 - Columbia University
LSI • Typically done using Singular Value Decomposition (SVD) to find principal components New orthogonal basis for the data (PC directions) - TD matrix Diagonal matrix of eigenvalues Term weighting by document - 10 x 6 For our example: S = (77.4,69.5,22.9,13.5,12.1,4.8) Fraction of the variance explained (PC1&2) = = 92.5% Data Mining -Volinsky - 2011 - Columbia University
LSI Top 2 PC make new pseudo-terms to define documents… Also, can look at first two Principal components: (0.74,0.49, 0.27,0.28,0.18,0.19) -> emphasizes first two terms (-0.28,-0.24,-0.12,0.74,0.37,0.31) -> separates the two clusters Note how distance from the origin shows number of terms, And angle (from the origin) shows similarity as well Data Mining -Volinsky - 2011 - Columbia University
LSI • Here we show the same plot, but with two new documents, one with the term “SQL” 50 times, another with the term “Databases” 50 times. • Even though they have no phrases in common, they are close in LSI space Data Mining -Volinsky - 2011 - Columbia University
Textual analysis • Once we have the data into a nice matrix representation (TD, TDxIDF, or LSI), we can throw the data mining toolbox at it: • Classification of documents • If we have training data for classes • Clustering of documents • unsupervised Data Mining -Volinsky - 2011 - Columbia University
Automatic document classification • Motivation • Automatic classification for the tremendous number of on-line text documents (Web pages, e-mails, etc.) • Customer comments: Requests for info, complaints, inquiries • A classification problem • Training set: Human experts generate a training data set • Classification: The computer system discovers the classification rules • Application: The discovered rules can be applied to classify new/unknown documents • Techniques • Linear/logistic regression, naïve Bayes • Trees not so good here due to massive dimension, few interactions Data Mining -Volinsky - 2011 - Columbia University
Naïve Bayes Classifier for Text • Naïve Bayes classifier = conditional independence model • Also called “multivariate Bernoulli” • Assumes conditional independence assumption given the class: p( x | ck ) = Pp( xj | ck ) • Note that we model each term xj as a discrete random variable In other words, the probability that a bunch of words comes from a given class equals the product of the individual probabilities of those words. Data Mining -Volinsky - 2011 - Columbia University
Multinomial Classifier for Text . • Multinomial Classification model • Assumes that the data are generated by a p-sided die (multinomial model) • where Nx = number of terms (total count) in document x • xj = number of times term j occurs in the document • ck = class = k • Based on training data, each class has its own multinomial probability across all words. Data Mining -Volinsky - 2011 - Columbia University
Naïve Bayes vs. Multinomial • Many extensions and adaptations of both • Text mining classification models usually a version of one of these • Example: Web pages • Classify webpages from CS departments into: • student, faculty, course,project • Train on ~5,000 hand-labeled web pages from Cornell, Washington, U.Texas, Wisconsin • Crawl and classify a new site (CMU) Data Mining -Volinsky - 2011 - Columbia University
NB vs. multinomial Data Mining -Volinsky - 2011 - Columbia University
Highest Probability Terms in Multinomial Distributions Classifying web pages at a University: Data Mining -Volinsky - 2011 - Columbia University
Document Clustering • Can also do clustering, or unsupervised learning of docs. • Automatically group related documents based on their content. • Require no training sets or predetermined taxonomies. • Major steps • Preprocessing • Remove stop words, stem, feature extraction, lexical analysis, … • Hierarchical clustering • Compute similarities applying clustering algorithms, … • Slicing • Fan out controls, flatten the tree to desired number of levels. • Like all clustering examples, success is relative Data Mining -Volinsky - 2011 - Columbia University
Document Clustering • To Cluster: • Can use LSI • Another model: Latent Dirichlet Allocation (LDA) • LDA is a generative probabilistic model of a corpus. Documents are represented as random mixtures over latent topics, where a topic is characterized by a distribution over words. • LDA: • Three concepts: words, topics, and documents • Documents are a collection of words and have a probability distribution over topics • Topics have a probability distribution over words • Fully Bayesian Model Data Mining -Volinsky - 2011 - Columbia University
LDA • Assume data was generated by a generative process: • q is a document - made up from topics from a probability distribution • z is a topic made up from words from a probability distribution • w is a word, the only real observables (N=number of words in all documents) • Then, the LDA equations are specified in a fully Bayesian model: a=per-document topic distributions Data Mining -Volinsky - 2011 - Columbia University
Which can be solved via advance computational techniques see Blei, et al 2003 Data Mining -Volinsky - 2011 - Columbia University
LDA output • The result can be an often-useful classification of documents into topics, and a distribution of each topic across words: Data Mining -Volinsky - 2011 - Columbia University
Another Look at LDA • Model: Topics made up of words used to generate documents Data Mining -Volinsky - 2011 - Columbia University
Another Look at LDA • Reality: Documents observed, infer topics Data Mining -Volinsky - 2011 - Columbia University