1 / 47

Lecture 12 - Classifiers Part 2: Maximum Entropy Markov Models and Evaluation

This lecture covers topics such as classifiers, maxent classifiers, maximum entropy Markov models, information extraction and chunking, and evaluation of classifiers.

darlaj
Download Presentation

Lecture 12 - Classifiers Part 2: Maximum Entropy Markov Models and Evaluation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 12 Classifiers Part 2 CSCE 771 Natural Language Processing • Topics • Classifiers • Maxent Classifiers • Maximum Entropy Markov Models • Information Extraction and chunking intro • Readings: Chapter Chapter 6, 7.1 February 25, 2013

  2. Overview • Last Time • Confusion Matrix • Brill Demo • NLTK Ch 6 - Text Classification • Today • Confusion Matrix • Brill Demo • NLTK Ch 6 - Text Classification • Readings • NLTK Ch 6

  3. Evaluation of classifiers again • Last time • Recall • Precision • F value • Accuracy

  4. Reuters Data set • 21578 documents • 118 categories • document can be in multiple classes • 118 binary classifiers

  5. Confusion matrix • Cij – documents that are really Ci that are classified as Cj. • Cii– documents that are really Ci that correctly classified

  6. Micro averaging vs Macro Averaging • Macro Averaging – average performance of individual classifiers (average of averages) • Micro averaging sum up all correct and all fp and fn

  7. Training, Development and Test Sets

  8. Code_consecutive_pos_tagger.pyrevisited to trace history development • defpos_features(sentence, i, history): # [_consec-pos-tag-features] • if debug == 1 : print "pos_features*********************************" • if debug == 1 : print " sentence=", sentence • if debug == 1 : print " i=", i • if debug == 1 : print " history=", history • features = {"suffix(1)": sentence[i][-1:], • "suffix(2)": sentence[i][-2:], • "suffix(3)": sentence[i][-3:]} • if i == 0: • features["prev-word"] = "<START>" • features["prev-tag"] = "<START>" • else: • features["prev-word"] = sentence[i-1] • features["prev-tag"] = history[i-1] • if debug == 1 : print "pos_features features=", features • return features

  9. Trace of one sentence - SIGINT to interrupt • sentence= ['Rookie', 'southpaw', 'George', 'Stepanovich', 'relieved', 'Hyde', 'at', 'the', 'start', 'of', 'the', 'ninth', 'and', 'gave', 'up', 'the', "A's", 'fifth', 'tally', 'on', 'a', 'walk', 'to', 'second', 'baseman', 'Dick', 'Howser', ',', 'a', 'wild', 'pitch', ',', 'and', 'Frank', "Cipriani's", 'single', 'under', 'Shortstop', 'Jerry', "Adair's", 'glove', 'into', 'center', '.'] i= 0 • history= [ ] • pos_features features= {'suffix(3)': 'kie', • 'prev-word': '<START>', • 'suffix(2)': 'ie', • 'prev-tag': '<START>', • 'suffix(1)': 'e'}

  10. Trace continued • pos_features ************************************* • sentence= ['Rookie', …'.'] • i= 1 • history= ['NN'] • pos_features features= {'suffix(3)': 'paw', 'prev-word': 'Rookie', 'suffix(2)': 'aw', 'prev-tag': 'NN', 'suffix(1)': 'w'} • pos_features ************************************* • sentence= ['Rookie', 'southpaw', … '.'] • i= 2 • history= ['NN', 'NN'] • pos_features features= {'suffix(3)': 'rge', 'prev-word': 'southpaw', 'suffix(2)': 'ge', 'prev-tag': 'NN', 'suffix(1)': 'e'}

  11. nltk.tag • Classes • AffixTaggerBigramTaggerBrillTaggerBrillTaggerTrainerDefaultTaggerFastBrillTaggerTrainerHiddenMarkovModelTaggerHiddenMarkovModelTrainerNgramTaggerRegexpTaggerTaggerITrigramTaggerUnigramTagger • Functions • batch_pos_tagpos_taguntag

  12. Module nltk.tag.hmm • Source Code for Module nltk.tag.hmm • import nltk • nltk.tag.hmm.demo() • nltk.tag.hmm.demo_pos() • nltk.tag.hmm.demo_pos_bw()

  13. HMM demo • import nltk • nltk.tag.hmm.demo() • nltk.tag.hmm.demo_pos() • nltk.tag.hmm.demo_pos_bw()

  14. Common Suffixes • from nltk.corpus import brown • suffix_fdist = nltk.FreqDist() • for word in brown.words(): • word = word.lower() • suffix_fdist.inc(word[-1:]) • suffix_fdist.inc(word[-2:]) • suffix_fdist.inc(word[-3:]) • common_suffixes = suffix_fdist.keys()[:100] • print common_suffixes

  15. rtepair = nltk.corpus.rte.pairs(['rte3_dev.xml'])[33] • extractor = nltk.RTEFeatureExtractor(rtepair) • print extractor.text_words • set(['Russia', 'Organisation', 'Shanghai', … • print extractor.hyp_words • set(['member', 'SCO', 'China']) • print extractor.overlap('word') • set([ ]) • print extractor.overlap('ne') • set(['SCO', 'China']) • print extractor.hyp_extra('word') • set(['member'])

  16. tagged_sents = list(brown.tagged_sents(categories='news')) • random.shuffle(tagged_sents) • size = int(len(tagged_sents) * 0.1) • train_set, test_set = tagged_sents[size:], tagged_sents[:size] • file_ids = brown.fileids(categories='news') • size = int(len(file_ids) * 0.1) • train_set = brown.tagged_sents(file_ids[size:]) • test_set = brown.tagged_sents(file_ids[:size]) • train_set = brown.tagged_sents(categories='news') • test_set = brown.tagged_sents(categories='fiction') • classifier = nltk.NaiveBayesClassifier.train(train_set)

  17. Traceback (most recent call last): • File "C:\Users\mmm\Documents\Courses\771\Python771\ch06\ch06d.py", line 80, in <module> • classifier = nltk.NaiveBayesClassifier.train(train_set) • File "C:\Python27\lib\site-packages\nltk\classify\naivebayes.py", line 191, in train • for featureset, label in labeled_featuresets: • ValueError: too many values to unpack

  18. from nltk.corpus import brown • brown_tagged_sents = brown.tagged_sents(categories='news') • size = int(len(brown_tagged_sents) * 0.9) • train_sents = brown_tagged_sents[:size] • test_sents = brown_tagged_sents[size:] • t0 = nltk.DefaultTagger('NN') • t1 = nltk.UnigramTagger(train_sents, backoff=t0) • t2 = nltk.BigramTagger(train_sents, backoff=t1)

  19. deftag_list(tagged_sents): • return [tag for sent in tagged_sents for (word, tag) in sent] • defapply_tagger(tagger, corpus): • return [tagger.tag(nltk.tag.untag(sent)) for sent in corpus] • gold = tag_list(brown.tagged_sents(categories='editorial')) • test = tag_list(apply_tagger(t2, brown.tagged_sents(categories='editorial'))) • cm = nltk.ConfusionMatrix(gold, test) • print cm.pp(sort_by_count=True, show_percents=True, truncate=9)

  20. | N | • | N I A J N V N | • | N N T J . S , B P | • ----+----------------------------------------------------------------+ • NN | <11.8%> 0.0% . 0.2% . 0.0% . 0.3% 0.0% | • IN | 0.0% <9.0%> . . . 0.0% . . . | • AT | . . <8.6%> . . . . . . | • JJ | 1.7% . . <3.9%> . . . 0.0% 0.0% | • . | . . . . <4.8%> . . . . | • NNS | 1.5% . . . . <3.2%> . . 0.0% | • , | . . . . . . <4.4%> . . | • VB | 0.9% . . 0.0% . . . <2.4%> . | • NP | 1.0% . . 0.0% . . . . <1.8%>| • ----+----------------------------------------------------------------+ • (row = reference; col = test)

  21. Entropy • import math • def entropy(labels): • freqdist = nltk.FreqDist(labels) • probs = [freqdist.freq(l) for l in nltk.FreqDist(labels)] • return -sum([p * math.log(p,2) for p in probs])

  22. print entropy(['male', 'male', 'male', 'male']) • -0.0 • print entropy(['male', 'female', 'male', 'male']) • 0.811278124459 • print entropy(['female', 'male', 'female', 'male']) • 1.0 • print entropy(['female', 'female', 'male', 'female']) • 0.811278124459 • print entropy(['female', 'female', 'female', 'female']) • -0.0

  23. The Rest of NLTK Chapter 06 • 6.5 Naïve Bayes Classifiers • 6.6 Maximum Entropy Classifiers • nltk.classify.maxent.BinaryMaxentFeatureEncoding(labels, mapping, unseen_features=False, alwayson_features=False) • 6.7 Modeling Linguistic Patterns • 6.8 Summary • But no more Code?!?

  24. Maximum Entropy Models (again) • features are elements of evidence that connect observations d with categories c • f: C X D  R • Example feature • f(c,d) = { c = LOCATION & w-1 = IN & is Capitalized(w)} • An “input-feature” is a property of an unlabeled token. • A “joint-feature” is a property of a labeled token.

  25. Feature-Based Liner Classifiers • p(c |d, lambda)=

  26. Maxent Model revisited

  27. Maximum Entropy Markov Models (MEMM) • repeatedly use Maxent classifier to iteratively apply to a sequence

  28. Named Entity Recognition (NER) • enities – • a:being, existence; especially: independent, separate, or self-contained existence b : the existence of a thing as contrasted with its attributes • : something that has separate and distinct existence and objective or conceptual reality • : an organization (as a business or governmental unit) that has an identity separate from those of its members • one of those with a name • http://nlp.stanford.edu/software/CRF-NER.shtml

  29. Classes of Named Entities • Person (PERS) • Location (LOC) • Organization (ORG) • DATE • Example: Jim bought 300 shares of Acme Corp. in 2006. And producing an annotated block of text, such as this one: • <ENAMEX TYPE="PERSON">Jim</ENAMEX> bought <NUMEX TYPE="QUANTITY">300</NUMEX>shares of<ENAMEX TYPE="ORGANIZATION">Acme Corp.</ENAMEX> in <TIMEX TYPE="DATE">2006</TIMEX>. http://nlp.stanford.edu/software/CRF-NER.shtml

  30. IOB tagging • B – beginning a chunk, e.g., B LOC • I – in a chunk • O – outside chunk • Example • text = ''' • he PRP B-NP • accepted VBD B-VP • the DT B-NP • position NN I-NP • of IN B-PP • vice NN B-NP • chairman NN I-NP • , , O

  31. .

  32. Chunking - partial parsing

  33. NLTK ch07.py • defie_preprocess(document): • sentences = nltk.sent_tokenize(document) • sentences = [nltk.word_tokenize(sent) for sent in sentences] • sentences = [nltk.pos_tag(sent) for sent in sentences] • sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), # [_chunkex-sent] ("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")] • grammar = "NP: {<DT>?<JJ>*<NN>}" # [_chunkex-grammar] • cp = nltk.RegexpParser(grammar) • result = cp.parse(sentence) • print result

  34. (S • (NP the/DT little/JJ yellow/JJ dog/NN) • barked/VBD • at/IN • (NP the/DT cat/NN)) • (S • (NP the/DT little/JJ yellow/JJ dog/NN) • barked/VBD • at/IN • (NP the/DT cat/NN)) • (S (NP money/NN market/NN) fund/NN)

  35. chunkex-draw • grammar = "NP: {<DT>?<JJ>*<NN>}" # [_chunkex-grammar] • cp = nltk.RegexpParser(grammar) # [_chunkex-cp] • result = cp.parse(sentence) # [_chunkex-test] • print result # [_chunkex-print] • result.draw()

  36. Chunk two consecutive nouns • nouns = [("money", "NN"), ("market", "NN"), ("fund", "NN")] • grammar = "NP: {<NN><NN>} # Chunk two consecutive nouns" • cp = nltk.RegexpParser(grammar) • print cp.parse(nouns) • (S (NP money/NN market/NN) fund/NN)

  37. cp = nltk.RegexpParser('CHUNK: {<V.*> <TO> <V.*>}') • brown = nltk.corpus.brown • for sent in brown.tagged_sents(): • tree = cp.parse(sent) • for subtree in tree.subtrees(): • if subtree.node == 'CHUNK': print subtree • (CHUNK combined/VBN to/TO achieve/VB) • … • (CHUNK serve/VB to/TO protect/VB) • (CHUNK wanted/VBD to/TO wait/VB) • …

  38. nltk.chunk.accuracy example • from nltk.corpus import conll2000 • test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP']) • print nltk.chunk.accuracy(cp, test_sents) • 0.41745994892

  39. First attempt ?!? • from nltk.corpus import conll2000 • cp = nltk.RegexpParser("") • test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP']) • print cp.evaluate(test_sents) • ChunkParse score: • IOB Accuracy: 43.4% • Precision: 0.0% • Recall: 0.0% • F-Measure: 0.0%

  40. from nltk.corpus import conll2000 • test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP']) • from nltk.corpus import conll2000 • test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP']) • print nltk.chunk.accuracy(cp, test_sents) • 0.41745994892 • Carlyle NNP B-NP • Group NNP I-NP • , , O • a DT B-NP • merchant NN I-NP • banking NN I-NP • concern NN I-NP • . . O • ''' • nltk.chunk.conllstr2tree(text, chunk_types=['NP']).draw() • from nltk.corpus import conll2000 • print conll2000.chunked_sents('train.txt')[99]Group NNP I-NP • , , O • a DT B-NP • merchant NN I-NP • banking NN I-NP • concern NN I-NP • . . O • ''' • nltk.chunk.conllstr2tree(text, chunk_types=['NP']).draw() • from nltk.corpus import conll2000 • print conll2000.chunked_sents('train.txt')[99]

  41. Chunking using connll2000 • text = ''' • he PRP B-NP • accepted VBD B-VP • the DT B-NP • position NN I-NP • of IN B-PP • vice NN B-NP • chairman NN I-NP • … • . . O • ''' • nltk.chunk.conllstr2tree( text, chunk_types=['NP']).draw() • from nltk.corpus import conll2000 • print conll2000.chunked_sents('train.txt')[99]

  42. . (S • (PP Over/IN) • (NP a/DT cup/NN) • (PP of/IN) • (NP coffee/NN) • ,/, • (NP Mr./NNP Stone/NNP) • (VP told/VBD) • (NP his/PRP$ story/NN) • ./.)

  43. A Real Attempt • grammar = r"NP: {<[CDJNP].*>+}" • cp = nltk.RegexpParser(grammar) • print cp.evaluate(test_sents) • ChunkParse score: • IOB Accuracy: 87.7% • Precision: 70.6% • Recall: 67.8% • F-Measure: 69.2%

  44. Information extraction • Step towards understanding • Find named entities • Figure out what is being said about them; actually just relations of named entities http://en.wikipedia.org/wiki/Information_extraction

  45. Outline of natural language processing • 1 What is NLP ? • 2 Prerequisite technologies • 3 Subfields of NLP • 4 Related fields • 5 Processes of NLP: Applications, Components • 6 History of NLP • 6.1 Timeline of NLP software • 7 General NLP concepts • 8 NLP software • 8.1 Chatterbots • 8.2 NLP toolkits • 8.3 Translation software • 9 NLP organizations • 10 NLP publications: Books, Journals • 11 Persons 12 See also13 References14 External links http://en.wikipedia.org/wiki/Outline_of_natural_language_processing

  46. Persons influential in NLP • Alan Turing – originator of the Turing Test. • Noam Chomsky – author of the seminal work Syntactic Structures, which revolutionized Linguistics with 'universal grammar', a rule based system of syntactic structures.[15] • Daniel Bobrow – • Joseph Weizenbaum – author of the ELIZAchatterbot. • Roger Schank – introduced the conceptual dependency theory for natural language understanding.[16] • – • Terry Winograd – • Kenneth Colby – • Rollo Carpenter – • David Ferrucci – principal investigator of the team that created Watson, IBM's AI computer that won the quiz show Jeopardy! • William Aaron Woods

More Related