470 likes | 477 Views
This lecture covers topics such as classifiers, maxent classifiers, maximum entropy Markov models, information extraction and chunking, and evaluation of classifiers.
E N D
Lecture 12 Classifiers Part 2 CSCE 771 Natural Language Processing • Topics • Classifiers • Maxent Classifiers • Maximum Entropy Markov Models • Information Extraction and chunking intro • Readings: Chapter Chapter 6, 7.1 February 25, 2013
Overview • Last Time • Confusion Matrix • Brill Demo • NLTK Ch 6 - Text Classification • Today • Confusion Matrix • Brill Demo • NLTK Ch 6 - Text Classification • Readings • NLTK Ch 6
Evaluation of classifiers again • Last time • Recall • Precision • F value • Accuracy
Reuters Data set • 21578 documents • 118 categories • document can be in multiple classes • 118 binary classifiers
Confusion matrix • Cij – documents that are really Ci that are classified as Cj. • Cii– documents that are really Ci that correctly classified
Micro averaging vs Macro Averaging • Macro Averaging – average performance of individual classifiers (average of averages) • Micro averaging sum up all correct and all fp and fn
Code_consecutive_pos_tagger.pyrevisited to trace history development • defpos_features(sentence, i, history): # [_consec-pos-tag-features] • if debug == 1 : print "pos_features*********************************" • if debug == 1 : print " sentence=", sentence • if debug == 1 : print " i=", i • if debug == 1 : print " history=", history • features = {"suffix(1)": sentence[i][-1:], • "suffix(2)": sentence[i][-2:], • "suffix(3)": sentence[i][-3:]} • if i == 0: • features["prev-word"] = "<START>" • features["prev-tag"] = "<START>" • else: • features["prev-word"] = sentence[i-1] • features["prev-tag"] = history[i-1] • if debug == 1 : print "pos_features features=", features • return features
Trace of one sentence - SIGINT to interrupt • sentence= ['Rookie', 'southpaw', 'George', 'Stepanovich', 'relieved', 'Hyde', 'at', 'the', 'start', 'of', 'the', 'ninth', 'and', 'gave', 'up', 'the', "A's", 'fifth', 'tally', 'on', 'a', 'walk', 'to', 'second', 'baseman', 'Dick', 'Howser', ',', 'a', 'wild', 'pitch', ',', 'and', 'Frank', "Cipriani's", 'single', 'under', 'Shortstop', 'Jerry', "Adair's", 'glove', 'into', 'center', '.'] i= 0 • history= [ ] • pos_features features= {'suffix(3)': 'kie', • 'prev-word': '<START>', • 'suffix(2)': 'ie', • 'prev-tag': '<START>', • 'suffix(1)': 'e'}
Trace continued • pos_features ************************************* • sentence= ['Rookie', …'.'] • i= 1 • history= ['NN'] • pos_features features= {'suffix(3)': 'paw', 'prev-word': 'Rookie', 'suffix(2)': 'aw', 'prev-tag': 'NN', 'suffix(1)': 'w'} • pos_features ************************************* • sentence= ['Rookie', 'southpaw', … '.'] • i= 2 • history= ['NN', 'NN'] • pos_features features= {'suffix(3)': 'rge', 'prev-word': 'southpaw', 'suffix(2)': 'ge', 'prev-tag': 'NN', 'suffix(1)': 'e'}
nltk.tag • Classes • AffixTaggerBigramTaggerBrillTaggerBrillTaggerTrainerDefaultTaggerFastBrillTaggerTrainerHiddenMarkovModelTaggerHiddenMarkovModelTrainerNgramTaggerRegexpTaggerTaggerITrigramTaggerUnigramTagger • Functions • batch_pos_tagpos_taguntag
Module nltk.tag.hmm • Source Code for Module nltk.tag.hmm • import nltk • nltk.tag.hmm.demo() • nltk.tag.hmm.demo_pos() • nltk.tag.hmm.demo_pos_bw()
HMM demo • import nltk • nltk.tag.hmm.demo() • nltk.tag.hmm.demo_pos() • nltk.tag.hmm.demo_pos_bw()
Common Suffixes • from nltk.corpus import brown • suffix_fdist = nltk.FreqDist() • for word in brown.words(): • word = word.lower() • suffix_fdist.inc(word[-1:]) • suffix_fdist.inc(word[-2:]) • suffix_fdist.inc(word[-3:]) • common_suffixes = suffix_fdist.keys()[:100] • print common_suffixes
rtepair = nltk.corpus.rte.pairs(['rte3_dev.xml'])[33] • extractor = nltk.RTEFeatureExtractor(rtepair) • print extractor.text_words • set(['Russia', 'Organisation', 'Shanghai', … • print extractor.hyp_words • set(['member', 'SCO', 'China']) • print extractor.overlap('word') • set([ ]) • print extractor.overlap('ne') • set(['SCO', 'China']) • print extractor.hyp_extra('word') • set(['member'])
tagged_sents = list(brown.tagged_sents(categories='news')) • random.shuffle(tagged_sents) • size = int(len(tagged_sents) * 0.1) • train_set, test_set = tagged_sents[size:], tagged_sents[:size] • file_ids = brown.fileids(categories='news') • size = int(len(file_ids) * 0.1) • train_set = brown.tagged_sents(file_ids[size:]) • test_set = brown.tagged_sents(file_ids[:size]) • train_set = brown.tagged_sents(categories='news') • test_set = brown.tagged_sents(categories='fiction') • classifier = nltk.NaiveBayesClassifier.train(train_set)
Traceback (most recent call last): • File "C:\Users\mmm\Documents\Courses\771\Python771\ch06\ch06d.py", line 80, in <module> • classifier = nltk.NaiveBayesClassifier.train(train_set) • File "C:\Python27\lib\site-packages\nltk\classify\naivebayes.py", line 191, in train • for featureset, label in labeled_featuresets: • ValueError: too many values to unpack
from nltk.corpus import brown • brown_tagged_sents = brown.tagged_sents(categories='news') • size = int(len(brown_tagged_sents) * 0.9) • train_sents = brown_tagged_sents[:size] • test_sents = brown_tagged_sents[size:] • t0 = nltk.DefaultTagger('NN') • t1 = nltk.UnigramTagger(train_sents, backoff=t0) • t2 = nltk.BigramTagger(train_sents, backoff=t1)
deftag_list(tagged_sents): • return [tag for sent in tagged_sents for (word, tag) in sent] • defapply_tagger(tagger, corpus): • return [tagger.tag(nltk.tag.untag(sent)) for sent in corpus] • gold = tag_list(brown.tagged_sents(categories='editorial')) • test = tag_list(apply_tagger(t2, brown.tagged_sents(categories='editorial'))) • cm = nltk.ConfusionMatrix(gold, test) • print cm.pp(sort_by_count=True, show_percents=True, truncate=9)
| N | • | N I A J N V N | • | N N T J . S , B P | • ----+----------------------------------------------------------------+ • NN | <11.8%> 0.0% . 0.2% . 0.0% . 0.3% 0.0% | • IN | 0.0% <9.0%> . . . 0.0% . . . | • AT | . . <8.6%> . . . . . . | • JJ | 1.7% . . <3.9%> . . . 0.0% 0.0% | • . | . . . . <4.8%> . . . . | • NNS | 1.5% . . . . <3.2%> . . 0.0% | • , | . . . . . . <4.4%> . . | • VB | 0.9% . . 0.0% . . . <2.4%> . | • NP | 1.0% . . 0.0% . . . . <1.8%>| • ----+----------------------------------------------------------------+ • (row = reference; col = test)
Entropy • import math • def entropy(labels): • freqdist = nltk.FreqDist(labels) • probs = [freqdist.freq(l) for l in nltk.FreqDist(labels)] • return -sum([p * math.log(p,2) for p in probs])
print entropy(['male', 'male', 'male', 'male']) • -0.0 • print entropy(['male', 'female', 'male', 'male']) • 0.811278124459 • print entropy(['female', 'male', 'female', 'male']) • 1.0 • print entropy(['female', 'female', 'male', 'female']) • 0.811278124459 • print entropy(['female', 'female', 'female', 'female']) • -0.0
The Rest of NLTK Chapter 06 • 6.5 Naïve Bayes Classifiers • 6.6 Maximum Entropy Classifiers • nltk.classify.maxent.BinaryMaxentFeatureEncoding(labels, mapping, unseen_features=False, alwayson_features=False) • 6.7 Modeling Linguistic Patterns • 6.8 Summary • But no more Code?!?
Maximum Entropy Models (again) • features are elements of evidence that connect observations d with categories c • f: C X D R • Example feature • f(c,d) = { c = LOCATION & w-1 = IN & is Capitalized(w)} • An “input-feature” is a property of an unlabeled token. • A “joint-feature” is a property of a labeled token.
Feature-Based Liner Classifiers • p(c |d, lambda)=
Maximum Entropy Markov Models (MEMM) • repeatedly use Maxent classifier to iteratively apply to a sequence
Named Entity Recognition (NER) • enities – • a:being, existence; especially: independent, separate, or self-contained existence b : the existence of a thing as contrasted with its attributes • : something that has separate and distinct existence and objective or conceptual reality • : an organization (as a business or governmental unit) that has an identity separate from those of its members • one of those with a name • http://nlp.stanford.edu/software/CRF-NER.shtml
Classes of Named Entities • Person (PERS) • Location (LOC) • Organization (ORG) • DATE • Example: Jim bought 300 shares of Acme Corp. in 2006. And producing an annotated block of text, such as this one: • <ENAMEX TYPE="PERSON">Jim</ENAMEX> bought <NUMEX TYPE="QUANTITY">300</NUMEX>shares of<ENAMEX TYPE="ORGANIZATION">Acme Corp.</ENAMEX> in <TIMEX TYPE="DATE">2006</TIMEX>. http://nlp.stanford.edu/software/CRF-NER.shtml
IOB tagging • B – beginning a chunk, e.g., B LOC • I – in a chunk • O – outside chunk • Example • text = ''' • he PRP B-NP • accepted VBD B-VP • the DT B-NP • position NN I-NP • of IN B-PP • vice NN B-NP • chairman NN I-NP • , , O
NLTK ch07.py • defie_preprocess(document): • sentences = nltk.sent_tokenize(document) • sentences = [nltk.word_tokenize(sent) for sent in sentences] • sentences = [nltk.pos_tag(sent) for sent in sentences] • sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), # [_chunkex-sent] ("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")] • grammar = "NP: {<DT>?<JJ>*<NN>}" # [_chunkex-grammar] • cp = nltk.RegexpParser(grammar) • result = cp.parse(sentence) • print result
(S • (NP the/DT little/JJ yellow/JJ dog/NN) • barked/VBD • at/IN • (NP the/DT cat/NN)) • (S • (NP the/DT little/JJ yellow/JJ dog/NN) • barked/VBD • at/IN • (NP the/DT cat/NN)) • (S (NP money/NN market/NN) fund/NN)
chunkex-draw • grammar = "NP: {<DT>?<JJ>*<NN>}" # [_chunkex-grammar] • cp = nltk.RegexpParser(grammar) # [_chunkex-cp] • result = cp.parse(sentence) # [_chunkex-test] • print result # [_chunkex-print] • result.draw()
Chunk two consecutive nouns • nouns = [("money", "NN"), ("market", "NN"), ("fund", "NN")] • grammar = "NP: {<NN><NN>} # Chunk two consecutive nouns" • cp = nltk.RegexpParser(grammar) • print cp.parse(nouns) • (S (NP money/NN market/NN) fund/NN)
cp = nltk.RegexpParser('CHUNK: {<V.*> <TO> <V.*>}') • brown = nltk.corpus.brown • for sent in brown.tagged_sents(): • tree = cp.parse(sent) • for subtree in tree.subtrees(): • if subtree.node == 'CHUNK': print subtree • (CHUNK combined/VBN to/TO achieve/VB) • … • (CHUNK serve/VB to/TO protect/VB) • (CHUNK wanted/VBD to/TO wait/VB) • …
nltk.chunk.accuracy example • from nltk.corpus import conll2000 • test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP']) • print nltk.chunk.accuracy(cp, test_sents) • 0.41745994892
First attempt ?!? • from nltk.corpus import conll2000 • cp = nltk.RegexpParser("") • test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP']) • print cp.evaluate(test_sents) • ChunkParse score: • IOB Accuracy: 43.4% • Precision: 0.0% • Recall: 0.0% • F-Measure: 0.0%
from nltk.corpus import conll2000 • test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP']) • from nltk.corpus import conll2000 • test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP']) • print nltk.chunk.accuracy(cp, test_sents) • 0.41745994892 • Carlyle NNP B-NP • Group NNP I-NP • , , O • a DT B-NP • merchant NN I-NP • banking NN I-NP • concern NN I-NP • . . O • ''' • nltk.chunk.conllstr2tree(text, chunk_types=['NP']).draw() • from nltk.corpus import conll2000 • print conll2000.chunked_sents('train.txt')[99]Group NNP I-NP • , , O • a DT B-NP • merchant NN I-NP • banking NN I-NP • concern NN I-NP • . . O • ''' • nltk.chunk.conllstr2tree(text, chunk_types=['NP']).draw() • from nltk.corpus import conll2000 • print conll2000.chunked_sents('train.txt')[99]
Chunking using connll2000 • text = ''' • he PRP B-NP • accepted VBD B-VP • the DT B-NP • position NN I-NP • of IN B-PP • vice NN B-NP • chairman NN I-NP • … • . . O • ''' • nltk.chunk.conllstr2tree( text, chunk_types=['NP']).draw() • from nltk.corpus import conll2000 • print conll2000.chunked_sents('train.txt')[99]
. (S • (PP Over/IN) • (NP a/DT cup/NN) • (PP of/IN) • (NP coffee/NN) • ,/, • (NP Mr./NNP Stone/NNP) • (VP told/VBD) • (NP his/PRP$ story/NN) • ./.)
A Real Attempt • grammar = r"NP: {<[CDJNP].*>+}" • cp = nltk.RegexpParser(grammar) • print cp.evaluate(test_sents) • ChunkParse score: • IOB Accuracy: 87.7% • Precision: 70.6% • Recall: 67.8% • F-Measure: 69.2%
Information extraction • Step towards understanding • Find named entities • Figure out what is being said about them; actually just relations of named entities http://en.wikipedia.org/wiki/Information_extraction
Outline of natural language processing • 1 What is NLP ? • 2 Prerequisite technologies • 3 Subfields of NLP • 4 Related fields • 5 Processes of NLP: Applications, Components • 6 History of NLP • 6.1 Timeline of NLP software • 7 General NLP concepts • 8 NLP software • 8.1 Chatterbots • 8.2 NLP toolkits • 8.3 Translation software • 9 NLP organizations • 10 NLP publications: Books, Journals • 11 Persons 12 See also13 References14 External links http://en.wikipedia.org/wiki/Outline_of_natural_language_processing
Persons influential in NLP • Alan Turing – originator of the Turing Test. • Noam Chomsky – author of the seminal work Syntactic Structures, which revolutionized Linguistics with 'universal grammar', a rule based system of syntactic structures.[15] • Daniel Bobrow – • Joseph Weizenbaum – author of the ELIZAchatterbot. • Roger Schank – introduced the conceptual dependency theory for natural language understanding.[16] • – • Terry Winograd – • Kenneth Colby – • Rollo Carpenter – • David Ferrucci – principal investigator of the team that created Watson, IBM's AI computer that won the quiz show Jeopardy! • William Aaron Woods