760 likes | 922 Views
Ling 570 Day 16 : Sequence modeling Named Entity Recognition. Sequence Labeling. Goal: Find most probable labeling of a sequence Many sequence labeling tasks POS tagging Word segmentation Named entity tagging Story/spoken sentence segmentation Pitch accent detection Dialog act tagging.
E N D
Sequence Labeling • Goal: Find most probable labeling of a sequence • Many sequence labeling tasks • POS tagging • Word segmentation • Named entity tagging • Story/spoken sentence segmentation • Pitch accent detection • Dialog act tagging
HMM search space N N N N N V V V V V P P P P P DT DT DT DT DT time flies like an arrow
N N V DT N Find max in last column, Follow back-pointer chains to recover that best sequence
Viterbi • Initialization: • Recursion: • Termination:
Decoding • Goal: Identify highest probability tag sequence • Issues: • Features include tags from previous words • Not immediately available • Uses tag history • Just knowing highest probability preceding tag insufficient
Decoding • Approach: Retain multiple candidate tag sequences • Essentially search through tagging choices • Which sequences? • We can’t look at all of them – exponentially many! • Instead, use top K highest probability sequences
Breadth-First Search <s> time flies like an arrow BOS
Breadth-First Search <s> time flies like an arrow N BOS V
Breadth-First Search <s> time flies like an arrow N N V BOS N V V
Breadth-First Search <s> time flies like an arrow P N V N P V V BOS P N V V V P V
Breadth-First Search <s> time flies like an arrow P N V N P V V BOS P N V V V P V
Breadth-First Search <s> time flies like an arrow P N V N P V V BOS P N V V V P V
Breadth-first Search • Is breadth-first search efficient?
Breadth-first Search • Is it efficient? • No, it tries everything
Beam Search • Intuition: • Breadth-first search explores all paths • Lots of paths are (pretty obviously) bad • Why explore bad paths? • Restrict to (apparently best) paths • Approach: • Perform breadth-first search, but • Retain only k ‘best’ paths thus far • k: beam width
Beam Search, k=3 <s> time flies like an arrow BOS
Beam Search, k=3 <s> time flies like an arrow N BOS V
Beam Search, k=3 <s> time flies like an arrow N N V BOS N V V
Beam Search, k=3 <s> time flies like an arrow P N V N P V V BOS P N V V V
Beam Search, k=3 <s> time flies like an arrow P N V N P V V BOS P N 56 V V V
Beam Search • W={w1,w2,…,wn}: test sentence • sij: jth highest prob. sequence up to & inc. word wi • Generate tags for w1, keep top k, set s1j accordingly • for i=2 to n: • Extension: add tags for wito each s(i-1)j • Beam selection: • Sort sequences by probability • Keep only top k sequences • Return highest probability sequence sn1
POS Tagging • Overall accuracy: 96.3+% • Unseen word accuracy: 86.2% • Comparable to HMM tagging accuracy or TBL • Provides • Probabilistic framework • Better able to model different info sources • Topline accuracy 96-97% • Consistency issues
Beam Search • Beam search decoding: • Variant of breadth first search • At each layer, keep only top k sequences • Advantages: • Efficient in practice: beam 3-5 near optimal • Empirically, beam 5-10% of search space; prunes 90-95% • Simple to implement • Just extensions + sorting, no dynamic programming • Running time: O(kT) [vs. O(NT)] • Disadvantage: Not guaranteed optimal (or complete)
Viterbi Decoding • Viterbi search: • Exploits dynamic programming, memoization • Requires small history window • Efficient search: O(N2T) • Advantage: • Exact: optimal solution is returned • Disadvantage: • Limited window of context
Beam vs Viterbi • Dynamic programming vs heuristic search • Guaranteed optimal vs no guarantee • Different context window
MaxEnt POS Tagging • Part of speech tagging by classification: • Feature design • word and tag context features • orthographic features for rare words • Sequence classification problems: • Tag features depend on prior classification • Beam search decoding • Efficient, but inexact • Near optimal in practice
Roadmap • Named Entity Recognition • Definition • Motivation • Challenges • Common Approach
Named Entity Recognition • Task: Identify Named Entities in (typically) unstructured text • Typical entities: • Person names • Locations • Organizations • Dates • Times
Example • Lady Gaga is playing a concert for the Bushes in Texas next September
Example • Lady Gaga is playing a concert for the Bushes in Texas next September person person location time
Example from financial news • Ray Dalio’s Bridgewater Associates is an extremely large and extremely successful hedge fund. • Based in Westport and known for its strong -- some would say cultish -- culture, it has grown to well over $100 billion in assets under management with little negative impact on its returns. person organization location value
Entity types may differ by applicaiton • News: • People, countries, organizations, dates, etc. • Medical records: • Diseases, medications, organisms, organs, etc.
Named Entity Types • Common categories
Named Entity Examples • For common categories:
Why NER? • Machine translation: • Lady Gaga is playing a concert for the Bushes in Texas next September • La señora Gagaestoca un concierto para los arbustos … • Number: • 9/11: Date vs ratio • 911: Emergency phone number, simple number
Why NER? • Information extraction: • MUC task: Joint ventures/mergers • Focus on Company names, Person Names (CEO), valuations • Information retrieval: • Named entities focus of retrieval • In some data sets, 60+% queries target NEs • Text-to-speech: • 206-616-5728 • Phone numbers (vs other digit strings) , differ by language
Challenges • Ambiguity • Washington chose • D.C., State, George, etc • Most digit strings • cat: (95 results) • CAT(erpillar) stock ticker • Computerized Axial Tomography • Chloramphenicol Acetyl Transferase • small furry mammal
Evaluation • Precision • Recall • F-measure
Resources • Online: • Name lists • Baby name, who’s who, newswire services, census.gov • Gazetteers • SEC listings of companies • Tools • Lingpipe • OpenNLP • Stanford NLP toolkit
Approaches to NER • Rule/Regex-based: • Match names/entities in lists • Regex: e.g \d\d/\d\d/\d\d: 11/23/11 • Currency: $\d+\.\d+ • Machine Learning via Sequence Labeling: • Better for names, organizations • Hybrid
NER as Classification Task • Instance:
NER as Classification Task • Instance: token • Labels:
NER as Classification Task • Instance: token • Labels: • Position: B(eginning), I(nside), Outside
NER as Classification Task • Instance: token • Labels: • Position: B(eginning), I(nside), Outside • NER types: PER, ORG, LOC, NUM
NER as Classification Task • Instance: token • Labels: • Position: B(eginning), I(nside), Outside • NER types: PER, ORG, LOC, NUM • Label: Type-Position, e.g. PER-B, PER-I, O, … • How many tags?
NER as Classification Task • Instance: token • Labels: • Position: B(eginning), I(nside), Outside • NER types: PER, ORG, LOC, NUM • Label: Type-Position, e.g. PER-B, PER-I, O, … • How many tags? • (|NER Types|x 2) + 1