290 likes | 419 Views
Preserving Semantic Content in Text Mining Using Multigrams. Yasmin H. Said Department of Computational and Data Sciences George Mason University QMDNS 2010 - May 26, 2010 This is joint work with Edward J. Wegman. Outline. Background on Text Mining Bigrams
E N D
Preserving Semantic Content in Text Mining Using Multigrams Yasmin H. Said Department of Computational and Data Sciences George Mason University QMDNS 2010 - May 26, 2010 This is joint work with Edward J. Wegman
Outline • Background on Text Mining • Bigrams • Term-Document and Bigram-Document Matrices • Term-Term and Document-Document Associations • Example using 15,863 Documents To read between the lines is easier than to follow the text. -Henry James
Text Data Mining • Synthesis of … • Information Retrieval • Focuses on retrieving documents from a fixed database • May be multimedia including text, images, video, audio • Natural Language Processing • Usually more challenging questions • Bag-of-words methods • Vector space models • Statistical Data Mining • Pattern recognition, classification, clustering
Natural Language Processing • Key elements are: • Morphology (grammar of word forms) • Syntax (grammar of word combinations to form sentences) • Semantics (meaning of word or sentence) • Lexicon (vocabulary or set of words) • Time flies like an arrow • Time passes speedily like an arrow passes speedily or • Measure the speed of a fly like you would measure the speed of an arrow • Ambiguity of nouns and verbs • Ambiguity of meaning
Text Mining Tasks • Text Classification • Assigning a document to one of several pre-specified classes • Text Clustering • Unsupervised learning • Text Summarization • Extracting a summary for a document • Based on syntax and semantics • Author Identification/Determination • Based on stylistics, syntax, and semantics • Automatic Translation • Based on morphology, syntax, semantics, and lexicon • Cross Corpus Discovery • Also known as Literature Based Discovery
Preprocessing • Denoising • Means removing stopper words … words with little semantic meaning such as the, an, and, of, by, that and so on. • Stopper words may be context dependent, e.g. Theorem and Proof in a mathematics document • Stemming • Means removal suffixes, prefixes and infixes to root • An example: wake, waking, awake, woke wake
Bigrams and Trigrams • A bigram is a word pair where the order of words is preserved. • The first word is the reference word. • The second is the neighbor word. • A trigram is a word triple where order is preserved. • Bigrams and trigrams are useful because they can capture semantic content.
Example • Hell hath no fury like a woman scorned. • Denoised: Hell hath no fury like woman scorned. • Stemmed: Hell has no fury like woman scorn. • Bigrams: • Hell has, has no, no fury, fury like, like woman, woman scorn, scorn . • Note that the “.” (any sentence ending punctuation) is treated as a word
Bigram Proximity Matrix • The bigram proximity matrix (BPM) is computed for an entire document • Entries in the matrix may be either binary or a frequency count • The BPM is a mathematical representation of a document with some claim to capturing semantics • Because bigrams capture noun-verb, adjective-noun, verb-adverb, verb-subject structures • Martinez (2002)
Vector Space Methods • The classic structure in vector space text mining methods is a term-document matrix where • Rows correspond to terms, columns correspond to documents, and • Entries may be binary or frequency counts • A simple and obvious generalization is a bigram-document matrix where • Rows correspond to bigrams, columns to documents, and again entries are either binary or frequency counts
Example Data • The text data were collected by the Linguistic Data Consortium in 1997 and were originally used in Martinez (2002) • The data consisted of 15,863 news reports collected from Reuters and CNN from July 1, 1994 to June 30, 1995 • The full lexicon for the text database included 68,354 distinct words • In all 313 stopper words are removed • after denoising and stemming, there remain 45,021 words in the lexicon • The example that I report here is based on the full set of 15,863 documents. This is the same basic data set that Dr. Wegman reported on in his keynote talk although he considered a subset of 503 documents.
Vector Space Methods • A document corpus we have worked with has 45,021 denoised and stemmed entries in its lexicon and 1,834,123 bigrams • Thus the TDM is 45,021 by 15,863 and the BDM is 1,834,123 by 15,863 • The term vector is 45,021 dimensional and the bigram vector is 1,834,123 dimensional • The BPM for each document is 1,834,123 by 1,834,123 and, of course, very sparse.
Term-Document Matrix Analysis Zipf’s Law
Text Example - Clusters • A portion of the hierarchical agglomerative tree for the clusters
Text Example - Clusters Cluster 0, Size: 157, ISim: 0.142, ESim: 0.008 Descriptive: ireland 12.2%, ira 9.1%, northern.ireland 7.6%, irish 5.5%, fein 5.0%, sinn 5.0%, sinn.fein 5.0%, northern 3.2%, british 3.2%, adam 2.4% Discriminating: ireland 7.7%, ira 5.9%, northern.ireland 4.9%, irish 3.5%, fein 3.2%, sinn 3.2%, sinn.fein 3.2%, northern 1.6%, british 1.5%, adam 1.5% Phrases 1: ireland 121, northern 119, british 116, irish 111, ira 110, peac 107, minist 104, govern 104, polit 104, talk 102 Phrases 2: northern.ireland 115, sinn.fein 95, irish.republican 94, republican.armi 91, ceas.fire 87, polit.wing 76, prime.minist 71, peac.process 66, gerri.adam 59, british.govern 50 Phrases 3: irish.republican.armi 91, prime.minist.john 47, minist.john.major 43, ira.ceas.fire 35, ira.polit.wing 34, british.prime.minist 34, sinn.fein.leader 30, rule.northern.ireland 27, british.rule.northern 27, declar.ceas.fire 26
Text Example - Clusters Cluster 1, Size: 323, ISim: 0.128, ESim: 0.008 Descriptive: korea 19.8%, north 13.2%, korean 11.2%, north.korea 10.8%, kim 5.8%, north.korean 3.7%, nuclear 3.5%, pyongyang 2.0%, south 1.9%, south.korea 1.5% Discriminating: korea 12.7%, north 7.4%, korean 7.2%, north.korea 7.0%, kim 3.8%, north.korean 2.4%, nuclear 1.7%, pyongyang 1.3%, south.korea 1.0%, simpson 0.8% Phrases 1: korea 305, north 303, korean 285, south 243, unit 215, nuclear 204, offici 196, pyongyang 179, presid 167, talk 165 Phrases 2: north.korea 291, north.korean 233, south.korea 204, south.korean 147, kim.sung 108, presid.kim 83, nuclear.program 79, kim.jong 74, light.water 71, presid.clinton 69 Phrases 3: light.water.reactor 56, unit.north.korea 55, north.korea.nuclear 53, chief.warrant.offic 49, presid.kim.sung 46, leader.kim.sung 39, presid.kim.sam 37, north.korean.offici 36, warrant.offic.bobbi 35, bobbi.wayn.hall 29
Text Example - Clusters Cluster 24, Size: 1788, ISim: 0.012, ESim: 0.007 Descriptive: school 2.2%, film 1.3%, children 1.2%, student 1.0%, percent 0.8%, compani 0.7%, kid 0.7%, peopl 0.7%, movi 0.7%, music 0.6% Discriminating: school 2.3%, simpson 1.8%, film 1.7%, student 1.1%, presid 1.0%, serb 0.9%, children 0.8%, clinton 0.8%, movi 0.8%, music 0.8% Phrases 1: cnn 1034, peopl 920, time 893, report 807, don 680, dai 650, look 630, call 588, live 535, lot 498 Phrases 2: littl.bit 99, lot.peopl 90, lo.angel 85, world.war 71, thank.join 67, million.dollar 60, 000.peopl 54, york.citi 50, garsten.cnn 48, san.francisco 47 Phrases 3: jeann.moo.cnn 41, cnn.entertain.new 36, cnn.jeann.moo 32, norma.quarl.cnn 30, cnn.norma.quarl 28, cnn.jeff.flock 28, jeff.flock.cnn 27, brian.cabel.cnn 26, pope.john.paul 25, lisa.price.cnn 25
Bigrams Cluster 1 Bigrams
Cluster Identities • Cluster 02: Comet Shoemaker Levy Crashing into Jupiter. • Cluster 08: Oklahoma City Bombing. • Cluster 11: Bosnian-Serb Conflict. • Cluster 12: Court-Law, O.J. Simpson Case. • Cluster 15: Cessna Plane Crashed onto South Lawn White House. • Cluster 19: American Army Helicopter Emergency Landing in North Korea. • Cluster 24: Death of North Korean Leader (Kim il Sung) and North Korea’s Nuclear Ambitions. • Cluster 26: Shootings at Abortion Clinics in Boston. • Cluster 28: Two Americans Detained in Iraq. • Cluster 30: Earthquake that Hit Japan.
Closing Remarks • Text mining presents great challenges, but is amenable to statistical/mathematical approaches • Text mining using vector space methods challenges both the mathematical and visualization issues • especially in terms of dimensionality, sparsity, and scalability.
Acknowledgments • Dr. Angel Martinez • Dr. Jeff Solka and Avory Bryant • Dr. Walid Sharabati • Funding Sources • National Institute on Alcohol Abuse and Alcoholism (Grant Number F32AA015876) • Army Research Office (Contract W911NF-04-1-0447) • Army Research Laboratory (Contract W911NF-07-1-0059) • Isaac Newton Institute
Contact Information Yasmin H. Said Department of Computational and Data Sciences Email: ysaid99@hotmail.com Phone: 301-538-7478 The length of this document defends it well against the risk of its being read. -Winston Churchill