Challenges in Text Mining Documents: Representation, Storage, and Processing

COMP527: Data Mining COMP527: Data Mining M. Sulaiman Khan • (mskhan@liv.ac.uk)‏ • Dept. of Computer Science • University of Liverpool • 2009 Text Mining: Challenges, Basics March 24, 2009 Slide 1

COMP527: Data Mining COMP527: Data Mining Introduction to the Course Introduction to Data Mining Introduction to Text Mining General Data Mining Issues Data Warehousing Classification: Challenges, Basics Classification: Rules Classification: Trees Classification: Trees 2 Classification: Bayes Classification: Neural Networks Classification: SVM Classification: Evaluation Classification: Evaluation 2 Regression, Prediction Input Preprocessing Attribute Selection Association Rule Mining ARM: A Priori and Data Structures ARM: Improvements ARM: Advanced Techniques Clustering: Challenges, Basics Clustering: Improvements Clustering: Advanced Algorithms Hybrid Approaches Graph Mining, Web Mining Text Mining: Challenges, Basics Text Mining: Text-as-Data Text Mining: Text-as-Language Revision for Exam Text Mining: Challenges, Basics March 24, 2009 Slide 2

Today's Topics COMP527: Data Mining Text Representation What is a word? Dimensionality Reduction Text Mining vs Data Mining on Text Text Mining: Challenges, Basics March 24, 2009 Slide 3

Representation of Documents COMP527: Data Mining Basic Goal: Data Mining on Documents Each document must be an instance. First problem: What are the attributes of a document? Easy attributes: format, length in bytes, potentially some metadata extractable from headers or file properties (author, date etc)‏ Harder attributes: How to usefully represent the text? Basic idea is to treat each word as an attribute – either boolean (is present/is not present) or numeric (number of times word occurs)‏ Text Mining: Challenges, Basics March 24, 2009 Slide 4

Representation of Documents COMP527: Data Mining Second Problem: We will have a LOT of false/0 attribute values in our instances. Very sparse matrix, requiring a LOT of storage space. 1,000,000 documents, 500,000 different words = 500,000,000,000 entries ~= 60 gigabytes if store each entry as a single bit ~= 950 gigabytes if store each entry as short integer (2 bytes)‏ Google's dictionary has 5 million words, times 18 billion web pages... (Process that, WEKA!)‏ Text Mining: Challenges, Basics March 24, 2009 Slide 5

Representation of Documents COMP527: Data Mining Store only true values: (1,4,100,212,13948)‏ Or true values with frequency: (1:3, 4:1, 100:1, 212:3, 13948:4)‏ For compressed storage, we can maintain the differences: • Attribs in order: 1,2,4,5,7,10,15,18, ... 2348651, ... • Intervals: 1,1,2,1,2, 3, 5, 3, ... 6, ... • With frequency: 1,4,1,3,2,5,1,3,2,6,3,3,... Can always store in short integer. Regular compression algorithms on this sequence will be efficient. Reorder attributes based on frequency rather than alphabetical. Text Mining: Challenges, Basics March 24, 2009 Slide 6

Input? COMP527: Data Mining That's nice... but WEKA needs ARFF! Problems with toolkits: Won't accept a sparse input format Classification Algorithms: • Rules: Possible, but unlikely • Trees: Less likely than unlikely • Bayes: Fine, especially Multinomial Bayes • Bayesian Networks: Maybe... but too many possible networks • SVM: Fine • NN: Not Fine! Tooooo many nodes • Perceptron/Winnow: See NN, but more feasible as no hidden layer • KNN: Very slow without data structures due to no. of comparisons Text Mining: Challenges, Basics March 24, 2009 Slide 7

Input? COMP527: Data Mining Overall problem for text classification: Accurate models with high dimensionality impossible to understand by humans (eg SVM, Multinomial Naive Bayes). Association Rule Mining: Fine for presence of word, but how to represent word frequency? Classification Association Rule Mining possible good solution for understandability? Clustering: Very high dimensionality a problem for many algorithms, especially with lots of comparisons (eg partitioning algorithms). Text Mining: Challenges, Basics March 24, 2009 Slide 8

Document Types COMP527: Data Mining First problem: Need to be able to extract data from the file. Very different processing for different file types, eg: XML, HTML, RSS, Word, Open Document, PDF, RTF, LaTeX, ... May want to treat different parts of document separately. eg: • title vs authors vs abstract vs text vs references Want to normalise texts into semantic areas across different formats – eg abstract in PDF is several lines, but in ODF is an XML element, in LaTeX surrounded by one or more \verb{} commands... Text Mining: Challenges, Basics March 24, 2009 Slide 9

Term Extraction COMP527: Data Mining Requirement: Extract words from text. What is a 'word'? • Obvious 'words' (eg consecutive non space characters)‏ • Number (1,000 55.6 $10 1012 64.000 vs 64.0)‏ • Hyphenated (book case vs book-case vs bookcase)but also for ranges: 32-45 or "New York-New Jersey" • URI http://www.liv.ac.uk/ and more complicated • Punctuation ( Rob's vs 'Robs' vs Robs' vs Robs)‏ • Dates as single token? • Non-alphanumeric characters: AT&T, Yahoo! • ... Text Mining: Challenges, Basics March 24, 2009 Slide 10

Term Extraction COMP527: Data Mining Requirement: Extract 'words' from text. • Period character problematic: End of sentence, end of abbreviation, internal to acronyms (but not always present), internal to numbers (with 2 different meanings), dotted quad notation (eg: 138.253.81.72)‏ • Emoticons :( :) >:( => • Need extra processing for diacritics? eg: é ë ç etc. • Might want to use phrases as attributes 'with respect to', 'data mining' etc. but complicated to determine appropriate phrases. • Expand abbreviations? Expand acronyms? • Expand ranges? (1999-2007 means all years, not just end points)‏ Text Mining: Challenges, Basics March 24, 2009 Slide 11

Dimensionality Reduction COMP527: Data Mining Requirement: Reduce number of words. (Dimensionality reduction)‏ Many words are useless for distinguishing a document. Don't want to store non-useful words... • a, an, the, these, those, them, they... • of, with, to, in, towards, on... • while, however, because, also, who, when, where ... Long list of words to ignore. Called 'stopwords'. BUT ... “The Who” -- Band? Stopwords? Part of speech filtering more accurate but more expensive. Text Mining: Challenges, Basics March 24, 2009 Slide 12

Dimensionality Reduction COMP527: Data Mining Requirement: Normalise terms for consistency and dimensionality reduction. • Normally want to ignore case. eg 'computer' and 'Computer' should be the same attribute. But acronyms different: ram vs RAM, us vs US • Normally want to use word stems. eg 'computer' and 'computers' should be the same attribute. Porter algorithm relies on prefix/suffix matching, but note 'ram' could be noun or verb... Also, stems can be meaningless: "datum mine" Text Mining: Challenges, Basics March 24, 2009 Slide 13

Dimensionality Reduction COMP527: Data Mining Can use simple statistics to reduce dimensionality: If a word appears in all classes evenly, then it doesn't distinguish any particular class, and is not useful for classification and can be ignored. eg 'the' Equally, a word that appears in only one document will be perfectly discriminating, but also probably over-fitting. Words that appear in most documents (regardless of class distribution) are also unlikely to be useful. Text Mining: Challenges, Basics March 24, 2009 Slide 14

Text Mining vs Data Mining COMP527: Data Mining Data Mining: Discover hidden models to describe the data. Text Mining: Discover hidden facts within bodies of text. Completely different approaches: DM tries to generalise all of the data into a single model without getting caught up in over-fitting.TM tries to understand the details, and cross reference between individual instances. Text Mining: Challenges, Basics March 24, 2009 Slide 15

Text Mining vs Data Mining COMP527: Data Mining Text Mining uses Natural Language Processing techniques to 'understand' the data. Tries to understand the semantics of the text (information) rather than treating it as a big bag of sequences of characters. Major processes: • Part of Speech Tagging • Phrase Chunking • Deep Parsing • Named Entity Recognition • Information Extraction Text Mining: Challenges, Basics March 24, 2009 Slide 16

Text Mining vs Data Mining COMP527: Data Mining Part of Speech Tagging: • Tag each word with its part of speech (noun, verb, adjective etc.)‏ • Classification problem, but essential to understand the text, especially the verbs. Phrase Chunking: • Discover sequences of words that constitute phrases. eg Noun phrase, verb phrase, prepositional phrase. • Also essential, to discover clauses, rather than working with individual words. Text Mining: Challenges, Basics March 24, 2009 Slide 17

Text Mining vs Data Mining COMP527: Data Mining Deep Parsing: • Discover the structure of the clauses and participants in verbs etc. eg Dog bites man, not man bites dog. • Essential as the first step where the semantics are really used. Named Entity Recognition: • Discover 'entities' within the text and tag them with the same identifier. eg Magnesium and Mg are the same. Bush, President Bush, G.W. Bush, Dubya, the President, are all the same. • Essential for correlation of entities. Text Mining: Challenges, Basics March 24, 2009 Slide 18

Text Mining vs Data Mining COMP527: Data Mining Information Extraction: • With all the previous information, find all of the information about each entity from all occurrences within all clauses. • Remove duplicates and find correlations. • Look for interesting correlations, perhaps according to some set of rules for what is interesting. Actually, this is an impossibly large task given a reasonable set of text, and the interestingness of 'new' facts is often very low. Text Mining: Challenges, Basics March 24, 2009 Slide 19

Text Mining vs Data Mining COMP527: Data Mining DM crucial for TM, eg correct classification of part of speech. But TM processes also important for accurate dimensionality reduction in DM on Texts. Eg: Every word: average of 100 attributes per vector, 85.7% accuracy over 10 classes with SVM Same data, with linguistic stems and filtered for noun, verb and adjective: average of 64 attributes per vector, 87.2% accuracy. Text Mining: Challenges, Basics March 24, 2009 Slide 20

Further Reading COMP527: Data Mining • Baeza-Yates, Modern Information Retrieval • Weiss, Chapters 2,4 • Berry, Survey of Text Mining, Chapter 5(He gets around, doesn't he?!)‏ • Konchady • Witten, Managing Gigabytes Text Mining: Challenges, Basics March 24, 2009 Slide 21

Challenges in Text Mining Documents: Representation, Storage, and Processing

Challenges in Text Mining Documents: Representation, Storage, and Processing

Presentation Transcript