350 likes | 642 Views
ICS 278: Data Mining Lecture 14: Text Mining and Information Retrieval. Padhraic Smyth Department of Information and Computer Science University of California, Irvine. Lecture Topics in Text Mining. Information Retrieval Text Classification Text Clustering Information Extraction.
E N D
ICS 278: Data MiningLecture 14: Text Mining and Information Retrieval Padhraic Smyth Department of Information and Computer Science University of California, Irvine Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Lecture Topics in Text Mining • Information Retrieval • Text Classification • Text Clustering • Information Extraction Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Text Mining Applications • Information Retrieval • Query-based search of large text archives, e.g., the Web • Text Classification • Automated assignment of topics to Web pages, e.g., Yahoo, Google • Automated classification of email into spam and non-spam • Text Clustering • Automated organization of search results in real-time into categories • Discovery clusters and trends in technical literature (e.g. CiteSeer) • Information Extraction • Extracting standard fields from free-text • extracting names and places from reports, newspapers (e.g., military applications) • extracting resume information automatically from resumes • Extracting protein interaction information from biology papers Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Text Mining • Information Retrieval • Text Classification • Text Clustering • Information Extraction Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
General concepts in Information Retrieval • Representation language • typically a vector of d attribute values, e.g., • set of color, intensity, texture, features characterizing images • word counts for text documents • Data set D of N objects • Typically represented as an N x d matrix • Query Q: • User poses a query to search D • Query is typically expressed in the same representation language as the data, e.g., • each text document is a set of words that occur in the document • Query Q is also expressed as a set of words, e.g.,”data” and “mining” Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Query by Content • traditional DB query: exact matches • e.g. query Q = [level = MANAGER] AND [age < 30] • or, Boolean match on text • query = “Irvine” AND “fun”: return all docs with “Irvine” and “fun” • Not useful when there are many matches • E.g., “data mining” in Google returns 60 million documents • query-by-content query: more general / less precise • e.g. what record is most similar to a query Q? • for text data, often called “information retrieval (IR)” • can also be used for images, sequences, video, etc • Q can itself be an object (e.g., a document) or a shorter version (e.g., 1 word) • Goal • Match query Q to the N objects in the database • Return a ranked list (typically) of the most similar/relevant objects in the data set D given Q Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Issues in Query by Content • What representation language to use • How to measure similarity between Q and each object in D • How to compute the results in real-time (for interactive querying) • How to rank the results for the user • Allowing user feedback (query modification) • How to evaluate and compare different IR algorithms/systems Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
The Standard Approach • fixed-length (d dimensional) vector representation • for query (1-by-d Q) and and database (n-by-d X) objects • use domain-specific higher-level features (vs raw) • image • “bag of features”: color (e.g. RGB), texture (e.g. Gabor, Fourier coeffs), … • text • “bag of words”: freq count for each word in each document, … • Also known as the “vector-space” model • compute distances between vectorized representation • use k-NN to find k vectors in X closest to Q Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Text Retrieval • document: book, paper, WWW page, ... • term: word, word-pair, phrase, … (often: 50,000+) • query Q = set of terms, e.g., “data” + “mining” • NLP (natural language processing) too hard, so … • want (vector) representation for text which • retains maximum useful semantics • supports efficient distance computes between docs and Q • term weights • Boolean (e.g. term in document or not); “bag of words” • real-valued (e.g. freq term in doc; relative to all docs) ... • notice: loses word order, sentence structure, etc. Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Practical Issues • Tokenization • Convert document to word counts • word token = “any nonempty sequence of characters” • for HTML (etc) need to remove formatting • Canonical forms, Stopwords, Stemming • Remove capitalization • Stopwords: • remove very frequent words (a, the, and…) – can use standard list • Can also remove very rare words • Stemming (next slide) • Data representation • e.g., 3 column: <docid termid position> • Inverted index (faster) • List of sorted <termid docid> pairs: useful for finding docs containing certain terms • Equivalent to a sparse representation of term x doc matrix Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Stemming • Want to reduce all morphological variants of a word to a single index term • e.g. a document containing words like fish and fisher may not be retrieved by a query containing fishing (no fishing explicitly contained in the document) • Stemming - reduce words to their root form • e.g. fish – becomes a new index term • Porter stemming algorithm (1980) • relies on a preconstructed suffix list with associated rules • e.g. if suffix=IZATION and prefix contains at least one vowel followed by a consonant, replace with suffix=IZE • BINARIZATION => BINARIZE • Not always desirable: e.g., {university, universal} -> univers (in Porter’s) • WordNet: dictionary-based approach Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Toy example of a document-term matrix Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Document Similarity • Measuring similarity between 2 documents x and y: • wide variety of distance metrics: • Euclidean (L2) = sqrt(i(xi - yi)2) • L1 = I |xi - yi | • ... • weighted L2 = sqrt(i(wixi - wiyi)2) • Cosine distance between docs • often gives better results than Euclidean • normalizes relative to document length Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Distance matrices for toy document-term data Euclidean Distances TF doc-term matrix t1 t2 t3 t4 t5 t6 d1 24 21 9 0 0 3 d2 32 10 5 0 3 0 d3 12 16 5 0 0 0 d4 6 7 2 0 0 0 d5 43 31 20 0 3 0 d6 2 0 0 18 7 16 d7 0 0 1 32 12 0 d8 3 0 0 22 4 2 d9 1 0 0 34 27 25 d10 6 0 0 17 4 23 Cosine Distances Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
TF-IDF Term Weighting Schemes • Not all terms in a query or document may be equally important... • TF (term freq): term weight = number of times in that document • problem: term common to many docs => low discrimination • IDF (inverse-document frequency of a term) • nj documents contain term j, N documents in total • IDF = log(N/nj) • Favors terms that occur in relatively few documents • TF-IDF: TF(term)*IDF(term) • No real theoretical basis, but works well empirically and widely used Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
TF-IDF Example TF doc-term matrix t1 t2 t3 t4 t5 t6 d124 21 9 0 0 3 d2 32 10 5 0 3 0 d3 12 16 5 0 0 0 d4 6 7 2 0 0 0 d5 43 31 20 0 3 0 d6 2 0 0 18 7 16 d7 0 0 1 32 12 0 d8 3 0 0 22 4 2 d9 1 0 0 34 27 25 d10 6 0 0 17 4 23 TF-IDF(t1 in D1) = TF*IDF = 24 * log(10/9) TF-IDF doc-term mat t1 t2 t3 t4 t5 t6 d12.5 14.6 4.6 0 0 2.1 d2 3.4 6.9 2.6 0 1.1 0 d3 1.3 11.1 2.6 0 0 0 d4 0.6 4.9 1.0 0 0 0 d5 4.5 21.5 10.2 0 1.1 0 ... IDF weights are (0.1, 0.7, 0.5, 0.7, 0.4, 0.7) Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Baseline Document Querying System • Queries Q = binary term vectors • Documents represented by TF-IDF weights • Cosine distance used for retrieval and ranking Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Baseline Document Querying System TF doc-term matrix t1 t2 t3 t4 t5 t6 d124 21 9 0 0 3 d2 32 10 5 0 3 0 d3 12 16 5 0 0 0 d4 6 7 2 0 0 0 d5 43 31 20 0 3 0 d6 2 0 0 18 7 16 d7 0 0 1 32 12 0 d8 3 0 0 22 4 2 d9 1 0 0 34 27 25 d10 6 0 0 17 4 23 TF-IDF doc-term mat t1 t2 t3 t4 t5 t6 d12.5 14.6 4.6 0 0 2.1 d2 3.4 6.9 2.6 0 1.1 0 d3 1.3 11.1 2.6 0 0 0 d4 0.6 4.9 1.0 0 0 0 d5 4.5 21.5 10.2 0 1.1 0 ... Q=(1,0,1,0,0,0) TFTF-IDF d1 0.70 0.32 d2 0.77 0.51 d3 0.58 0.24 d4 0.60 0.23 d50.79 0.43 ... Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Synonymy and Polysemy • Synonymy • the same concept can be expressed using different sets of terms • e.g. bandit, brigand, thief • negatively affects recall • Polysemy • identical terms can be used in very different semantic contexts • e.g. bank • repository where important material is saved • the slope beside a body of water • negatively affects precision Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Latent Semantic Indexing • Approximate data in the original d-dimensional space by data in a k-dimensional space, where k << d • Find the k linear projections of the data that contain the most variance • Principal components analysis or SVD • Also known as “latent semantic indexing” when applied to text • Captures dependencies among terms • In effect represents original d-dimensional basis with a k-dimensional basis • e.g., terms like SQL, indexing, query, could be approximated as coming from a single “hidden” term • Why is this useful? • Query contains “automobile”, document contains “vehicle” • can still match Q to the document since the 2 terms will be close in k-space (but not in original space), i.e., addresses synonymy problem Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Toy example of a document-term matrix Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
SVD • M = U S VT • M = n x d = original document-term matrix (the data) • U = n x d , each row = vector of weights for each document • S = d x d diagonal matrix of eigenvalues • Columns of VT = new orthogonal basis for the data • Each eigenvalue represents how much information is of the new “basis” vectors • Typically select just the first k basis vectors, k << d Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Example of SVD Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
v1 = [0.74, 0.49, 0.27, 0.28, 0.18, 0.19] v2 = [-0.28, -0.24 -0.12, 0.74, 0.37, 0.31] D1 = database x 50 D2 = SQL x 50 Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Probabilistic Approaches to Retrieval • Compute P(q | d) for each document d • Intuition: relevance of d to q is related to how likely it is that q was generated by d, or “how likely is q under a model for d?” • Simple model for P(q|d) • Pe(q|d) = empirical frequency of words in document d • “tuned” to d, but likely to be sparse (will contain many zeros) • 2-stage probabilistic model (or linear interpolation model) • P(q|d) = l Pe (q | d) + (1- l ) Pe (q | corpus) • l can be fixed, e.g., tuned to a particular data set • Or it can depend on d, e.g., l = nd/ (nd + m)where nd = number of words in doc d, and m = a constant (e.g., 1000) • Can also use more sophisticated models for P(q|d), e.g., topic-based models Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Evaluating Retrieval Methods • predictive models (classify/regress) objective • score = accuracy on unseen test data • evaluation more complex for query by content • real score = how “useful” is retrieved info (subjective) • e.g. how would you define real score for Google’s top 10 hits? • towards objectivity, assume: • 1) each object is “relevant” or “irrelevant” • simplification: binary and same for all users (e.g. committee vote) • 2) each object labelled by objective/consistent oracle • these assumptions suggest classifier approach possible • rather different goals: want nearest to Q, not separability per se • but would require learning classifier at query time (Q = pos class) • which is why k-NN type approach seems so appropriate … Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Precision versus Recall • Rank documents (numerically) with respect to query • Compute precision and recall by thresholding the rankings • precision • fraction of retrieved objects that are relevant • recall • fraction of retrieved relevant objects / total relevant objects • Tradeoff: high precision -> low recall, and vice-versa • Very similar to ROC in concept • For multiple queries, precision for specific ranges of recall can be averaged (so-called “interpolated precision”). Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Precision-Recall Curve (form of ROC) alternative (point) values: precision where recall=precision or precision for fixed number of retrievals or average precision over multiple recall levels C is universally worse than A & B Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
TREC evaluations • Text Retrieval Conference (TReC) • Web site: trec.nist.gov • Annual impartial evaluation of IR systems • e.g., D = 1 million documents • TREC organizers supply contestants with several hundred queries Q • Each competing system provides its ranked list of documents • Union of top 100 ranked documents or so from each system is then manually judged to be relevant or non-relevant for each query Q • Precision, recall, etc, then calculated and systems compared Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Other Examples of Evaluation Data Sets • Cranfield data • Number of documents = 1400 • 225 Queries, “medium length”, manually constructed “test questions” • Relevance = determined by expert committee (from 1968) • Newsgroups • Articles from 20 Usenet newsgroups • Queries = randomly selected documents • Relevance: is the document d in the same category as the query doc? Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Performance on Cranfield Document Set Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Performance on Newsgroups Data Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Related Types of Data • Sparse high-dimensional data sets with counts, like document-term matrices, are common in data mining, e.g., • “transaction data” • Rows = customers • Columns = products • Web log data (ignoring sequence) • Rows = Web surfers • Columns = Web pages • Recommender systems • Given some products from user i, suggest other products to the user • e.g., Amazon.com’s book recommender • Collaborative filtering: • use k-nearest-individuals as the basis for predictions • Many similarities with querying and information retrieval • e.g., use of cosine distance to normalize vectors Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Web-based Retrieval • Additional information in Web documents • Link structure (e.g., PageRank: to be discussed later) • HTML structure • Link/anchor text • Title text • Etc • Can be leveraged for better retrieval • Additional issues in Web retrieval • Scalability: size of “corpus” is huge (10 to 100 billion docs) • Constantly changing: • Crawlers to update document-term information • need schemes for efficient updating indices • Evaluation is more difficult – how is relevance measured? How many documents in total are relevant? Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Further Reading • Text: Chapter 14 • General reference on text and language modeling • Foundations of Statistical Language Processing, C. Manning and H. Schutze, MIT Press, 1999. • Very useful reference on indexing and searching text: • Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edition, Morgan Kaufmann,1999, by Ian H. Witten, Alistair Moffat, and Timothy C. Bell, • Web-related Document Search: • An excellent resource on Web-related search is Chapter 3, Web Search and Information Retrieval, in Mining the Web: Discovering Knowledge from Hypertext Data, S. Chakrabarti, Morgan Kaufmann, 2003. • Information on how real Web search engines work: • http://searchenginewatch.com/ • Latent Semantic Analysis • Applied to grading of essays: The debate on automated grading, IEEE Intelligent Systems, September/October 2000. Online athttp://www.k-a-t.com/papers/IEEEdebate.pdf Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine