360 likes | 377 Views
Lecture 12 Classifiers Part 2. CSCE 771 Natural Language Processing. Topics Classifiers Maxent Classifiers Maximum Entropy Markov Models Information Extraction and chunking intro Readings: Chapter C hapter 6, 7.1. February 25, 2013. Overview. Last Time Confusion Matrix
E N D
Lecture 12 Classifiers Part 2 CSCE 771 Natural Language Processing • Topics • Classifiers • Maxent Classifiers • Maximum Entropy Markov Models • Information Extraction and chunking intro • Readings: Chapter Chapter 6, 7.1 February 25, 2013
Overview • Last Time • Confusion Matrix • Brill Demo • NLTK Ch 6 - Text Classification • Today • Confusion Matrix • Brill Demo • NLTK Ch 6 - Text Classification • Readings • NLTK Ch 6
Evaluation of classifiers again • Last time • Recall • Precision • F value • Accuracy
Reuters Data set • 21578 documents • 118 categories • document can be in multiple classes • 118 binary classifiers
Confusion matrix • Cij – documents that are really Ci that are classified as Cj. • Cii– documents that are really Ci that correctly classified
Micro averaging vs Macro Averaging • Macro Averaging – average performance of individual classifiers (average of averages) • Micro averaging sum up all correct and all fp and fn
nltk.tag • Classes • AffixTaggerBigramTaggerBrillTaggerBrillTaggerTrainerDefaultTaggerFastBrillTaggerTrainerHiddenMarkovModelTaggerHiddenMarkovModelTrainerNgramTaggerRegexpTaggerTaggerITrigramTaggerUnigramTagger • Functions • batch_pos_tagpos_taguntag
Module nltk.tag.hmm • Source Code for Module nltk.tag.hmm • import nltk • nltk.tag.hmm.demo() • nltk.tag.hmm.demo_pos() • nltk.tag.hmm.demo_pos_bw()
HMM demo • import nltk • nltk.tag.hmm.demo() • nltk.tag.hmm.demo_pos() • nltk.tag.hmm.demo_pos_bw()
Common Suffixes • from nltk.corpus import brown • suffix_fdist = nltk.FreqDist() • for word in brown.words(): • word = word.lower() • suffix_fdist.inc(word[-1:]) • suffix_fdist.inc(word[-2:]) • suffix_fdist.inc(word[-3:]) • common_suffixes = suffix_fdist.keys()[:100] • print common_suffixes
rtepair = nltk.corpus.rte.pairs(['rte3_dev.xml'])[33] • extractor = nltk.RTEFeatureExtractor(rtepair) • print extractor.text_words • set(['Russia', 'Organisation', 'Shanghai', … • print extractor.hyp_words • set(['member', 'SCO', 'China']) • print extractor.overlap('word') • set([ ]) • print extractor.overlap('ne') • set(['SCO', 'China']) • print extractor.hyp_extra('word') • set(['member'])
tagged_sents = list(brown.tagged_sents(categories='news')) • random.shuffle(tagged_sents) • size = int(len(tagged_sents) * 0.1) • train_set, test_set = tagged_sents[size:], tagged_sents[:size] • file_ids = brown.fileids(categories='news') • size = int(len(file_ids) * 0.1) • train_set = brown.tagged_sents(file_ids[size:]) • test_set = brown.tagged_sents(file_ids[:size]) • train_set = brown.tagged_sents(categories='news') • test_set = brown.tagged_sents(categories='fiction') • classifier = nltk.NaiveBayesClassifier.train(train_set)
Traceback (most recent call last): • File "C:\Users\mmm\Documents\Courses\771\Python771\ch06\ch06d.py", line 80, in <module> • classifier = nltk.NaiveBayesClassifier.train(train_set) • File "C:\Python27\lib\site-packages\nltk\classify\naivebayes.py", line 191, in train • for featureset, label in labeled_featuresets: • ValueError: too many values to unpack
from nltk.corpus import brown • brown_tagged_sents = brown.tagged_sents(categories='news') • size = int(len(brown_tagged_sents) * 0.9) • train_sents = brown_tagged_sents[:size] • test_sents = brown_tagged_sents[size:] • t0 = nltk.DefaultTagger('NN') • t1 = nltk.UnigramTagger(train_sents, backoff=t0) • t2 = nltk.BigramTagger(train_sents, backoff=t1)
deftag_list(tagged_sents): • return [tag for sent in tagged_sents for (word, tag) in sent] • defapply_tagger(tagger, corpus): • return [tagger.tag(nltk.tag.untag(sent)) for sent in corpus] • gold = tag_list(brown.tagged_sents(categories='editorial')) • test = tag_list(apply_tagger(t2, brown.tagged_sents(categories='editorial'))) • cm = nltk.ConfusionMatrix(gold, test) • print cm.pp(sort_by_count=True, show_percents=True, truncate=9)
| N | • | N I A J N V N | • | N N T J . S , B P | • ----+----------------------------------------------------------------+ • NN | <11.8%> 0.0% . 0.2% . 0.0% . 0.3% 0.0% | • IN | 0.0% <9.0%> . . . 0.0% . . . | • AT | . . <8.6%> . . . . . . | • JJ | 1.7% . . <3.9%> . . . 0.0% 0.0% | • . | . . . . <4.8%> . . . . | • NNS | 1.5% . . . . <3.2%> . . 0.0% | • , | . . . . . . <4.4%> . . | • VB | 0.9% . . 0.0% . . . <2.4%> . | • NP | 1.0% . . 0.0% . . . . <1.8%>| • ----+----------------------------------------------------------------+ • (row = reference; col = test)
Entropy • import math • def entropy(labels): • freqdist = nltk.FreqDist(labels) • probs = [freqdist.freq(l) for l in nltk.FreqDist(labels)] • return -sum([p * math.log(p,2) for p in probs])
print entropy(['male', 'male', 'male', 'male']) • -0.0 • print entropy(['male', 'female', 'male', 'male']) • 0.811278124459 • print entropy(['female', 'male', 'female', 'male']) • 1.0 • print entropy(['female', 'female', 'male', 'female']) • 0.811278124459 • print entropy(['female', 'female', 'female', 'female']) • -0.0
The Rest of NLTK Chapter 06 • 6.5 Naïve Bayes Classifiers • 6.6 Maximum Entropy Classifiers • nltk.classify.maxent.BinaryMaxentFeatureEncoding(labels, mapping, unseen_features=False, alwayson_features=False) • 6.7 Modeling Linguistic Patterns • 6.8 Summary • But no more Code?!?
Maximum Entropy Models (again) • features are elements of evidence that connect observations d with categories c • f: C X D R • Example feature • f(c,d) = { c = LOCATION & w-1 = IN & is Capitalized(w)} • An “input-feature” is a property of an unlabeled token. • A “joint-feature” is a property of a labeled token.
Feature-Based Liner Classifiers • p(c |d, lambda)=
Maximum Entropy Markov Models (MEMM) • repeatedly use Maxent classifier to iteratively apply to a sequence
Named Entity Recognition (NER) • enities – • a:being, existence; especially: independent, separate, or self-contained existence b : the existence of a thing as contrasted with its attributes • : something that has separate and distinct existence and objective or conceptual reality • : an organization (as a business or governmental unit) that has an identity separate from those of its members • one of those with a name • http://nlp.stanford.edu/software/CRF-NER.shtml
Classes of Named Entities • Person (PERS) • Location (LOC) • Organization (ORG) • DATE • Example: Jim bought 300 shares of Acme Corp. in 2006. And producing an annotated block of text, such as this one: • <ENAMEX TYPE="PERSON">Jim</ENAMEX> bought <NUMEX TYPE="QUANTITY">300</NUMEX>shares of<ENAMEX TYPE="ORGANIZATION">Acme Corp.</ENAMEX> in <TIMEX TYPE="DATE">2006</TIMEX>. http://nlp.stanford.edu/software/CRF-NER.shtml
NLTK ch07.py • defie_preprocess(document): • sentences = nltk.sent_tokenize(document) • sentences = [nltk.word_tokenize(sent) for sent in sentences] • sentences = [nltk.pos_tag(sent) for sent in sentences] • sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), # [_chunkex-sent] ("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")] • grammar = "NP: {<DT>?<JJ>*<NN>}" # [_chunkex-grammar] • cp = nltk.RegexpParser(grammar) • result = cp.parse(sentence) • print result
(S • (NP the/DT little/JJ yellow/JJ dog/NN) • barked/VBD • at/IN • (NP the/DT cat/NN)) • (S • (NP the/DT little/JJ yellow/JJ dog/NN) • barked/VBD • at/IN • (NP the/DT cat/NN)) • (S (NP money/NN market/NN) fund/NN)
(CHUNK combined/VBN to/TO achieve/VB) • (CHUNK continue/VB to/TO place/VB) • (CHUNK serve/VB to/TO protect/VB) • (CHUNK wanted/VBD to/TO wait/VB)
from nltk.corpus import conll2000 • print conll2000.chunked_sents('train.txt')[99] • print " B********************************************" • print conll2000.chunked_sents('train.txt', chunk_types=['NP'])[99] • print " C********************************************" • from nltk.corpus import conll2000 • cp = nltk.RegexpParser("") • test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP']) • print cp.evaluate(test_sents)
Information extraction • Step towards understanding • Find named entities • Figure out what is being said about them; actually just relations of named entities http://en.wikipedia.org/wiki/Information_extraction
Outline of natural language processing http://en.wikipedia.org/wiki/Natural_language_processing_toolkits