1 / 27

Using TF-IDF Anomalies to Cluster Documents on Subject Matter

Natural Language Processing And Computational Linguistics. Using TF-IDF Anomalies to Cluster Documents on Subject Matter. An Analysis using Word, Simple Noun Phrase, and Complex Noun Phrase Frequencies. Whitney St.Charles Research Alliance in Math and Science 2007 Mentors:

ohio
Download Presentation

Using TF-IDF Anomalies to Cluster Documents on Subject Matter

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Natural Language Processing And Computational Linguistics Using TF-IDF Anomalies to Cluster Documents on Subject Matter An Analysis using Word, Simple Noun Phrase, and Complex Noun Phrase Frequencies Whitney St.Charles Research Alliance in Math and Science 2007 Mentors: Yu (Cathy) Jiao, Ph.D. Robert Patton, Ph.D. Computational Sciences and Engineering Division

  2. Purposes of document clustering • Data overabundance • YouTube generates 200 terabytes of data per day • How do we sift through those kinds of quantities? • Searching • Reduces the set tremendously • Document Clustering • Is a knowledge discovery technique • Categorizes results into meaningful groups • Allows the user to browse quickly to the target

  3. Document clustering users • Financial analysts • Identify certain trends to develop forecasts about a particular company • Business Intelligence • Identify products that are associated with or dependent upon one another • Military • Identify terrorist cells from blog activity and movement of materials • You! • Narrow down hundreds of thousands of internet search results to find the kinds of sites you want

  4. Current document clustering technique • A word-by-word comparison of each document is made to determine similarity • Unfortunately, this method… • Does not handle context very well • Compares several hundred/ several thousand words for each document • Is very computationally expensive • Requires expensive SIMD machines

  5. Contributions to the field • Identify only those words which are more indicative of the subject matter • If airline occurs 20% more than is “normal,” it has something to do with the subject • Examine both simple and complex noun phrases to address the context of the document • Generate much smaller vectors, containing an average of 82% fewer terms! • Cluster more accurately because only “important” words are chosen

  6. Our method

  7. Establishing the baseline • Train the program to recognize what is “normal” for a given term • Need an entire English language corpus • Corpus: a large, structured set of texts compiled to be representative of a language • uses hundreds of thousands of words in every allowable way • Using a corpus, the program can • Establish usage statistics • Learn linguistic rules Example: The Brown Corpus http://www.edict.com.hk/concordance/WWWConcappE.htm

  8. Extracting words and phrases

  9. Part-of-speech tagging • Tags every word in the sentence with the correct part-of-speech • Achieves an accuracy of 97.24% • Is necessary because token extraction methods are each dependent upon correct tagging • Passes the tagged sentence to the token extractor The/dtdesperate/adjsummer/n intern/n tried/vbdto/to keep/vb everyone/n awake/adj.

  10. Token extractor • Extracts • Words • Simple noun phrases • Complex noun phrases Document Words Noun phrases

  11. Word extraction • Uses POS tagged data to identify only adjectives, verbs, and nouns • Uses the Porter stemmer to identify unique words • cut common suffixes such as –ing, -tion, -e, -es, -s • Example: “recreation” and “recreational” are both identified as “recreat”

  12. Why nouns? • Are named entities • Answer the question “What” • Are less ambiguous than verbs • Example: “cook up a good meal” or “cook up a new solution”

  13. Simple noun phrase extraction • Accepts only consecutive nouns • Example: summer intern, union representative • Provides a set of short, highly descriptive phrases

  14. Complex noun phrase extraction techniques • Static Rule-based/ Finite State Automata • Rely on the aptitude of linguist formulating rule set • Machine Learning • Rely on the “completeness” of the training set

  15. noun/ pronoun/ determiner determiner/adjective noun/ pronoun NP S0 S1 adjective Relative clause/ Prepositional phrase/ noun Static rule-based extraction • Establishes a list of linguistic rules • A determiner preceding a noun marks the beginning of a noun phrase • A determiner may not precede a noun phrase

  16. Static extraction shortcomings • Unanticipated rules • The subjective nature of language • Difficulty finding non-recursive, base NP’s • [The man [whose red hat [I borrowed yesterday]RC ]RC [in the street]PP [that is next to my house]RC]NPlives [next door]NP. • [The man]NPwhose [red hat]NPI borrowed [yesterday]NPin[the street]NPthat is next to [my house]NPlives [next door]NP. • Structural ambiguity

  17. Structural ambiguity example “I saw the man with the telescope.”

  18. Machine learning extraction TRAINING • Is all about • Uses a corpus • Is based on statistics • The more it sees a particular occurrence, the more likely it is to prefer it • Makes better educated guesses about structural ambiguity • Discovers thousands of unanticipated rules

  19. Transformation-based complex noun phrase extraction An ‘error-driven’ approach for learning an ordered set of rules 1. Generate all rules that correct at least one error. 2. For each rule: (a) Apply to a copy of the most recent state of the training set. (b) Score result 3. Select rule with best score. 4. Update training set by applying selected rule. 5. Stop if score is smaller than some pre-set threshold T; otherwise repeat from step 1.

  20. Determining anomaly sets • TF-IDF: Term Frequency – Inverse Document Frequency • Number of local occurrences of term multiplied by uniqueness measure of term in document set • TF-ICF: Term Frequency – Inverse Corpus Frequency • Average number of corpus occurrences of term multiplied by uniqueness measure of term in the corpus

  21. Each document has its own anomaly vector

  22. Clustering the data • Unweighted Pair Group Method with Average means

  23. Performance Metrics Used • Precision = number of correct responses number of responses • Recall = number of correct responses number correct in key • F-measure = 2RP R+ P

  24. RESULTS 80% 89% With 82% fewer comparisons!

  25. Future Work • Determine clustering results for both simple and complex noun phrases • Could be applied to other clustering techniques, such as swarming

  26. Acknowledgements • The Research Alliance in Math and Science program • Computational Sciences and Engineering Division, Office of Advanced Scientific Computing Research, U.S. Department of Energy. • Dr. Cathy Jiao • Dr. Robert Patton • Dr. Thomas Potok

  27. QUESTIONS?

More Related