ICS 278: Data Mining Lecture 12: Text Mining

ICS 278: Data MiningLecture 12: Text Mining Padhraic Smyth Department of Information and Computer Science University of California, Irvine Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Text Mining • Information Retrieval • Text Classification • Text Clustering • Information Extraction Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Text Mining Applications • Information Retrieval • Query-based search of large text archives, e.g., the Web • Text Classification • Automated assignment of topics to Web pages, e.g., Yahoo, Google • Automated classification of email into spam and non-spam • Text Clustering • Automated organization of search results in real-time into categories • Discovery clusters and trends in technical literature (e.g. CiteSeer) • Information Extraction • Extracting standard fields from free-text • extracting names and places from reports, newspapers (e.g., military applications) • extracting resume information automatically from resumes • Extracting protein interaction information from biology papers Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Text Mining • Information Retrieval • Text Classification • Text Clustering • Information Extraction Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

General concepts in Information Retrieval • Representation language • typically a vector of p attribute values, e.g., • set of color, intensity, texture, features characterizing images • word counts for text documents • Data set D of N objects • Typically represented as an N x p matrix • Query Q: • User poses a query to search D • Query is typically expressed in the same representation language as the data, e.g., • each text document is a set of words that occur in the document • Query Q is also expressed as a set of words, e.g.,”data” and “mining” Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Query by Content • traditional DB query: exact matches • e.g. query Q = [level = MANAGER] & [age < 30] • query-by-content query: more general / less precise • e.g. Q = what historic record most similar to new one? • for text data, often called “information retrieval (IR)” • Goal • Match query Q to the N objects in the database • Return a ranked list (typically) of the most similar/relevant objects in the data set D given Q Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Issues in Query by Content • What representation language to use • How to measure similarity between Q and each object in D • How to compute the results in real-time (for interactive querying) • How to rank the results for the user • Allowing user feedback (query modification) • How to evaluate and compare different IR algorithms/systems Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

The Standard Approach • fixed-length (d dimensional) vector representation • for query (d-by-1 Q) and and database (d-by-n X) objects • use domain-specific higher-level features (vs raw) • image • color (e.g. RGB), texture (e.g. Gabor, Fourier coeffs), … • text • “bag of words”: freq count for each word in each document, … • compute distances between vectorized representation • use k-NN to find k vectors in X closest to Q Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Evaluating Retrieval Methods • predictive models (classify/regress) objective • score = accuracy on unseen test data • evaluation more complex for query by content • real score = how “useful” is retrieved info (subjective) • e.g. how would you define real score for Google’s top 10 hits? • towards objectivity, assume: • 1) each object is “relevant” or “irrelevant” • simplification: binary and same for all users (e.g. committee vote) • 2) each object labelled by objective/consistent oracle • these assumptions suggest classifier approach possible • rather different goals: want nearests to Q, not separability per se • but would require learning classifier at query time (Q = pos class) • which is why k-NN type approach seems so appropriate … Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Precision versus Recall • DQ = Q’s ranked retrievals (smallest distance first) • DQT = those with distance < threshold • Threshold ~0: few false positives (FP) (say relevant, but not), many false neg (FN) • large threshold: few false negative (FP), many false pos (FP) • precision = TP / (TP+FP) • fraction of retrieved objects that are relevant • recall = TP / (TP + FN) • fraction of retrieved objects / total relevant objects • Tradeoff: high precision -> low recall, and vice-versa • For multiple queries, precision for specific ranges of recall can be averaged (so-called “interpolated precision”). Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Precision-Recall Curve (form of ROC) alternative (point) values: precision where recall=precision or precision for fixed number of retrievals or average precision over multiple recall levels C is universally worse than A & B Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

TREC evaluations • Text Retrieval Conference (TReC) • Web site: trec.nist.gov • Annual impartial evaluation of IR systems • e.g., D = 1 million documents • TREC organizers supply contestants with several hundred queries Q • Each competing system provides its ranked list of queries • Union of top 100 queries or so from each system is then manually judges to be relevant or non-relevant for each query Q • Precision, recall, etc, then calculated and systems compared Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Text Retrieval • document: book, paper, WWW page, ... • term: word, word-pair, phrase, … (often: 50,000+) • query Q = set of terms, e.g., “data” + “mining” • NLP (natural language processing) too hard, so … • want (vector) representation for text which • retains maximum useful semantics • supports efficient distance computes between docs and Q • term weights • Boolean (e.g. term in document or not); “bag of words” • real-valued (e.g. freq term in doc; relative to all docs) ... • notice: loses word order, sentence structure, etc. Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Toy example of a document-term matrix Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Distances between Documents • Measuring distance between 2 documents: • wide variety of distance metrics: • Euclidean (L2) = sqrt(i(xi - yi)2) • L1 = I |xi - yi | • ... • weighted L2 = sqrt(i(wixi - wiyi)2) • Cosine distance between docs Di = (di1,…,diT) • dc(Di,Dj) = k=1…T dikdjk / sqrt( k=1…T dik2 k=1…T dik2) • Can give better results than Euclidean • because it normalizes relative to document length Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Distance matrices for toy document-term data Euclidean Distances TF doc-term matrix t1 t2 t3 t4 t5 t6 d1 24 21 9 0 0 3 d2 32 10 5 0 3 0 d3 12 16 5 0 0 0 d4 6 7 2 0 0 0 d5 43 31 20 0 3 0 d6 2 0 0 18 7 16 d7 0 0 1 32 12 0 d8 3 0 0 22 4 2 d9 1 0 0 34 27 25 d10 6 0 0 17 4 23 Cosine Distances Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

TF-IDF Term Weighting Schemes • binary weights favor larger documents, so ... • TF (term freq): term weight = number of times in that document • problem: term common to many docs => low discrimination • IDF (inverse-document frequency of a term) • nj documents contain term j, N documents in total • IDF = log(N/nj) • Favors terms that occur in relatively few documents • TF-IDF: TF(term)*IDF(term) • No real theoretical basis, but works well empirically and widely used Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

TF-IDF Example TF doc-term matrix t1 t2 t3 t4 t5 t6 d124 21 9 0 0 3 d2 32 10 5 0 3 0 d3 12 16 5 0 0 0 d4 6 7 2 0 0 0 d5 43 31 20 0 3 0 d6 2 0 0 18 7 16 d7 0 0 1 32 12 0 d8 3 0 0 22 4 2 d9 1 0 0 34 27 25 d10 6 0 0 17 4 23 TF-IDF(t1 in D1) = TF*IDF = 24 * log(10/9) TF-IDF doc-term mat t1 t2 t3 t4 t5 t6 d12.5 14.6 4.6 0 0 2.1 d2 3.4 6.9 2.6 0 1.1 0 d3 1.3 11.1 2.6 0 0 0 d4 0.6 4.9 1.0 0 0 0 d5 4.5 21.5 10.2 0 1.1 0 ... IDF weights are (0.1, 0.7, 0.5, 0.7, 0.4, 0.7) Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Typical Document Querying System • Queries Q = binary term vectors • Documents represented by TF-IDF weights • Cosine distance used for retrieval and ranking TF doc-term matrix t1 t2 t3 t4 t5 t6 d124 21 9 0 0 3 d2 32 10 5 0 3 0 d3 12 16 5 0 0 0 d4 6 7 2 0 0 0 d5 43 31 20 0 3 0 d6 2 0 0 18 7 16 d7 0 0 1 32 12 0 d8 3 0 0 22 4 2 d9 1 0 0 34 27 25 d10 6 0 0 17 4 23 TF TF-IDF d1 0.70 0.32 d2 0.77 0.51 d3 0.58 0.24 d4 0.60 0.23 d50.79 0.43 ... Q=(1,0,1,0,0,0) Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Synonymy and Polysemy • Synonymy • the same concept can be expressed using different sets of terms • e.g. bandit, brigand, thief • negatively affects recall • Polysemy • identical terms can be used in very different semantic contexts • e.g. bank • repository where important material is saved • the slope beside a body of water • negatively affects precision Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Latent Semantic Indexing • Approximate data in the original d-dimensional space by data in a k-dimensional space, where k << d • Find the k linear projections of the data that contain the most variance • Principal components analysis or SVD • Also known as “latent semantic indexing” when applied to text • Captures dependencies among terms • In effect represents original d-dimensional basis with a k-dimensional basis • e.g., terms like SQL, indexing, query, could be approximated as coming from a single “hidden” term • Why is this useful? • Query contains “automobile”, document contains “vehicle” • can still match Q to the document since the 2 terms will be close in k-space (but not in original space), i.e., addresses synonymy problem Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Toy example of a document-term matrix Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

SVD • M = U S VT • M = n x d = original document-term matrix (the data) • U = n x d , each row = vector of weights for each document • S = d x d diagonal matrix of eigenvalues • Columns of VT = new orthogonal basis for the data • Each eigenvalue represents how much information is of the new “basis” vectors Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Example of SVD Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

v1 = [0.74, 0.49, 0.27, 0.28, 0.18, 0.19] v2 = [-0.28, -0.24 -0.12, 0.74, 0.37, 0.31] D1 = database x 50 D2 = SQL x 50 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Another LSI Example • A collection of documents: d1: Indian government goes for open-sourcesoftware d2: Debian 3.0 Woody released d3: Wine 2.0 released with fixes for Gentoo 1.4 and Debian 3.0 d4: gnuPOD released: iPOD on Linux… with GPLed software d5: Gentoo servers running at open-source mySQL database d6: Dolly the sheep not totally identical clone d7: DNA news: introduced low-cost human genomeDNA chip d8: Malaria-parasite genomedatabase on the Web d9: UK sets up genome bank to protect rare sheep breeds d10: Dolly’sDNA damaged Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

LSI Example (continued) • The term-document matrix X d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 open-source 1 0 0 0 1 0 0 0 0 0 software 1 0 0 1 0 0 0 0 0 0 Linux 0 0 0 1 0 0 0 0 0 0 released 0 1 1 1 0 0 0 0 0 0 Debian 0 1 1 0 0 0 0 0 0 0 Gentoo 0 0 1 0 1 0 0 0 0 0 database 0 0 0 0 1 0 0 1 0 0 Dolly 0 0 0 0 0 1 0 0 0 1 sheep 0 0 0 0 0 1 0 0 0 0 genome 0 0 0 0 0 0 1 1 1 0 DNA 0 0 0 0 0 0 2 0 0 1 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

LSI Example • The reconstructed term-document matrix after projecting on a subspace of dimension K=2 •  = diag(2.57, 2.49, 1.99, 1.9, 1.68, 1.53, 0.94, 0.66, 0.36, 0.10) d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 open-source 0.34 0.28 0.38 0.42 0.24 0.00 0.04 0.07 0.02 0.01 software 0.44 0.37 0.50 0.55 0.31 -0.01 -0.03 0.06 0.00 -0.02 Linux 0.44 0.37 0.50 0.55 0.31 -0.01 -0.03 0.06 0.00 -0.02 released 0.63 0.53 0.72 0.79 0.45 -0.01 -0.05 0.09 -0.00 -0.04 Debian 0.39 0.33 0.44 0.48 0.28 -0.01 -0.03 0.06 0.00 -0.02 Gentoo 0.36 0.30 0.41 0.45 0.26 0.00 0.03 0.07 0.02 0.01 database 0.17 0.14 0.19 0.21 0.14 0.04 0.25 0.11 0.09 0.12 Dolly -0.01 -0.01 -0.01 -0.02 0.03 0.08 0.45 0.13 0.14 0.21 sheep -0.00 -0.00 -0.00 -0.01 0.03 0.06 0.34 0.10 0.11 0.16 genome 0.02 0.01 0.02 0.01 0.10 0.19 1.11 0.34 0.36 0.53 DNA -0.03 -0.04 -0.04 -0.06 0.11 0.30 1.70 0.51 0.55 0.81 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Further Reading • Text: Chapter 14 • Web-related Document Search: • An excellent resource on Web-related search is Chapter 3, Web Search and Information Retrieval, in Mining the Web: Discovering Knowledge from Hypertext Data, S. Chakrabarti, Morgan Kaufmann, 2003, is also excellent. • Information on how real Web search engines work: • http://searchenginewatch.com/ • Latent Semantic Analysis • Applied to grading of essays: The debate on automated grading, IEEE Intelligent Systems, September/October 2000. Online athttp://www.k-a-t.com/papers/IEEEdebate.pdf Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Next up …. • Information Retrieval • Text Classification • Text Clustering • Information Extraction Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

ICS 278: Data Mining Lecture 12: Text Mining