1 / 26

Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A

NYU. Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A. Ralph Grishman. Bag of Words Models. do we really need elaborate linguistic analysis? look at text mining applications document retrieval opinion mining association mining

said
Download Presentation

Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NYU Bag-of-Words MethodsforText MiningCSCI-GA.2590 – Lecture 2A Ralph Grishman

  2. Bag of Words Models • do we really need elaborate linguistic analysis? • look at text mining applications • document retrieval • opinion mining • association mining • see how far we can get with document-level bag-of-words models • and introduce some of our mathematical approaches NYU

  3. document retrieval • opinion mining • association mining NYU

  4. Information Retrieval • Task: given query = list of keywords, identify and rank relevant documents from collection • Basic idea: find documents whose set of words most closely matches words in query NYU

  5. Topic Vector • Suppose the document collection has n distinct words, w1, …, wn • Each document is characterized by an n-dimensional vector whose ith component is the frequency of word wi in the document Example • D1 = [The cat chased the mouse.] • D2 = [The dog chased the cat.] • W = [The, chased, dog, cat, mouse] (n = 5) • V1 = [ 2 , 1 , 0 , 1 , 1 ] • V2 = [ 2 , 1 , 1 , 1 , 0 ] NYU

  6. Weighting the components • Unusual words like elephant determine the topic much more than common words such as “the” or “have” • can ignore words on a stop list or • weight each term frequency tfi by its inverse document frequency idfi • where N = size of collection and ni = number of documents containing term i NYU

  7. Cosine similarity metric Define a similarity metric between topic vectors A common choice is cosine similarity (dot product): NYU

  8. Cosine similarity metric • the cosine similarity metric is the cosine of the angle between the term vectors: w2 w1 NYU

  9. Verdict: a success • For heterogenous text collections, the vector space model, tf-idf weighting, and cosine similarity have been the basis for successful document retrieval for over 50 years

  10. document retrieval • opinion mining • association mining NYU

  11. Opinion Mining • Task: judge whether a document expresses a positive or negative opinion (or no opinion) about an object or topic • classification task • valuable for producers/marketers of all sorts of products • Simple strategy: bag-of-words • make lists of positive and negative words • see which predominate in a given document(and mark as ‘no opinion’ if there are few words of either type • problem: hard to make such lists • lists will differ for different topics NYU

  12. Training a Model • Instead, label the documents in a corpus and then train a classifier from the corpus • labeling is easier than thinking up words • in effect, learn positive and negative words from the corpus • probabilistic classifier • identify most likely classs = argmax P ( t | W)tє {pos, neg} NYU

  13. Using Bayes’ Rule • The last step is based on the naïve assumption of independence of the word probabilities NYU

  14. Training • We now estimate these probabilities from the training corpus (of N documents) using maximum likelihood estimators • P ( t ) = count ( docs labeled t) / N • P ( wi | t ) = count ( docs labeled t containing wi ) count ( docs labeled t) NYU

  15. A Problem • Suppose a glowing review GR (with lots of positive words) includes one word, “mathematical”, previously seen only in negative reviews • P ( positive | GR ) = ? NYU

  16. P ( positive | GR ) = 0 because P ( “mathematical” | positive ) = 0 The maximum likelihood estimate is poor when there is very little data We need to ‘smooth’ the probabilities to avoid this problem NYU

  17. Laplace Smoothing • A simple remedy is to add 1 to each count • to keep them as probabilities we increase the denominator N by the number of outcomes (values of t) (2 for ‘positive’ and ‘negative’) • for the conditional probabilities P( w | t ) there are similarly two outcomes (w is present or absent) NYU

  18. An NLTK Demo Sentiment Analysis with Python NLTK Text Classification • http://text-processing.com/demo/sentiment/ NLTK Code (simplified classifier) • http://streamhacker.com/2010/05/10/text-classification-sentiment-analysis-naive-bayes-classifier NYU

  19. Is “low” a positive or a negative term? NYU

  20. Ambiguous terms “low” can be positive “low price” or negative “low quality” NYU

  21. Verdict: mixed • A simple bag-of-words strategy with a NB model works quite well for simple reviews referring to a single item, but fails • for ambiguous terms • for comparative reviews • to reveal aspects of an opinion • the car looked great and handled well, but the wheels kept falling off NYU

  22. document retrieval • opinion mining • association mining NYU

  23. Association Mining • Goal: find interesting relationships among attributes of an object in a large collection … objects with attribute A also have attribute B • e.g., “people who bought A also bought B” • For text: documents with term A also have term B • widely used in scientific and medical literature NYU

  24. Bag-of-words • Simplest approach • look for words A and B for whichfrequency (A and B in same document) >>frequency of A x frequency of B • doesn’t work well • want to find names (of companies, products, genes), not individual words • interested in specific types of terms • want to learn from a few examples • need contexts to avoid noise NYU

  25. Needed Ingredients • Effective text association mining needs • name recognition • term classification • preferably: ability to learn patterns • preferably: a good GUI • demo: www.coremine.com NYU

  26. Conclusion • Some tasks can be handled effectively (and very simply) by bag-of-words models, • but most benefit from an analysis of language structure NYU

More Related