Ling 570 Day 16 : Sequence modeling Named Entity Recognition

Ling 570 Day 16:Sequence modelingNamed Entity Recognition

Sequence Labeling • Goal: Find most probable labeling of a sequence • Many sequence labeling tasks • POS tagging • Word segmentation • Named entity tagging • Story/spoken sentence segmentation • Pitch accent detection • Dialog act tagging

HMM search space N N N N N V V V V V P P P P P DT DT DT DT DT time flies like an arrow

N N V DT N Find max in last column, Follow back-pointer chains to recover that best sequence

Viterbi • Initialization: • Recursion: • Termination:

Decoding • Goal: Identify highest probability tag sequence • Issues: • Features include tags from previous words • Not immediately available • Uses tag history • Just knowing highest probability preceding tag insufficient

Decoding • Approach: Retain multiple candidate tag sequences • Essentially search through tagging choices • Which sequences? • We can’t look at all of them – exponentially many! • Instead, use top K highest probability sequences

Breadth-First Search <s> time flies like an arrow BOS

Breadth-First Search <s> time flies like an arrow N BOS V

Breadth-First Search <s> time flies like an arrow N N V BOS N V V

Breadth-First Search <s> time flies like an arrow P N V N P V V BOS P N V V V P V

Breadth-first Search • Is breadth-first search efficient?

Breadth-first Search • Is it efficient? • No, it tries everything

Beam Search • Intuition: • Breadth-first search explores all paths • Lots of paths are (pretty obviously) bad • Why explore bad paths? • Restrict to (apparently best) paths • Approach: • Perform breadth-first search, but • Retain only k ‘best’ paths thus far • k: beam width

Beam Search, k=3 <s> time flies like an arrow BOS

Beam Search, k=3 <s> time flies like an arrow N BOS V

Beam Search, k=3 <s> time flies like an arrow N N V BOS N V V

Beam Search, k=3 <s> time flies like an arrow P N V N P V V BOS P N V V V

Beam Search, k=3 <s> time flies like an arrow P N V N P V V BOS P N 56 V V V

Beam Search • W={w1,w2,…,wn}: test sentence • sij: jth highest prob. sequence up to & inc. word wi • Generate tags for w1, keep top k, set s1j accordingly • for i=2 to n: • Extension: add tags for wito each s(i-1)j • Beam selection: • Sort sequences by probability • Keep only top k sequences • Return highest probability sequence sn1

POS Tagging • Overall accuracy: 96.3+% • Unseen word accuracy: 86.2% • Comparable to HMM tagging accuracy or TBL • Provides • Probabilistic framework • Better able to model different info sources • Topline accuracy 96-97% • Consistency issues

Beam Search • Beam search decoding: • Variant of breadth first search • At each layer, keep only top k sequences • Advantages: • Efficient in practice: beam 3-5 near optimal • Empirically, beam 5-10% of search space; prunes 90-95% • Simple to implement • Just extensions + sorting, no dynamic programming • Running time: O(kT) [vs. O(NT)] • Disadvantage: Not guaranteed optimal (or complete)

Viterbi Decoding • Viterbi search: • Exploits dynamic programming, memoization • Requires small history window • Efficient search: O(N2T) • Advantage: • Exact: optimal solution is returned • Disadvantage: • Limited window of context

Beam vs Viterbi • Dynamic programming vs heuristic search • Guaranteed optimal vs no guarantee • Different context window

MaxEnt POS Tagging • Part of speech tagging by classification: • Feature design • word and tag context features • orthographic features for rare words • Sequence classification problems: • Tag features depend on prior classification • Beam search decoding • Efficient, but inexact • Near optimal in practice

Named Entity Recognition

Roadmap • Named Entity Recognition • Definition • Motivation • Challenges • Common Approach

Named Entity Recognition • Task: Identify Named Entities in (typically) unstructured text • Typical entities: • Person names • Locations • Organizations • Dates • Times

Example • Lady Gaga is playing a concert for the Bushes in Texas next September

Example • Lady Gaga is playing a concert for the Bushes in Texas next September person person location time

Example from financial news • Ray Dalio’s Bridgewater Associates is an extremely large and extremely successful hedge fund. • Based in Westport and known for its strong -- some would say cultish -- culture, it has grown to well over $100 billion in assets under management with little negative impact on its returns. person organization location value

Entity types may differ by applicaiton • News: • People, countries, organizations, dates, etc. • Medical records: • Diseases, medications, organisms, organs, etc.

Named Entity Types • Common categories

Named Entity Examples • For common categories:

Why NER? • Machine translation: • Lady Gaga is playing a concert for the Bushes in Texas next September • La señora Gagaestoca un concierto para los arbustos … • Number: • 9/11: Date vs ratio • 911: Emergency phone number, simple number

Why NER? • Information extraction: • MUC task: Joint ventures/mergers • Focus on Company names, Person Names (CEO), valuations • Information retrieval: • Named entities focus of retrieval • In some data sets, 60+% queries target NEs • Text-to-speech: • 206-616-5728 • Phone numbers (vs other digit strings) , differ by language

Challenges • Ambiguity • Washington chose • D.C., State, George, etc • Most digit strings • cat: (95 results) • CAT(erpillar) stock ticker • Computerized Axial Tomography • Chloramphenicol Acetyl Transferase • small furry mammal

Context & Ambiguity

Evaluation • Precision • Recall • F-measure

Resources • Online: • Name lists • Baby name, who’s who, newswire services, census.gov • Gazetteers • SEC listings of companies • Tools • Lingpipe • OpenNLP • Stanford NLP toolkit

Approaches to NER • Rule/Regex-based: • Match names/entities in lists • Regex: e.g \d\d/\d\d/\d\d: 11/23/11 • Currency: $\d+\.\d+ • Machine Learning via Sequence Labeling: • Better for names, organizations • Hybrid

NER as Sequence Labeling

NER as Classification Task • Instance:

NER as Classification Task • Instance: token • Labels:

NER as Classification Task • Instance: token • Labels: • Position: B(eginning), I(nside), Outside

NER as Classification Task • Instance: token • Labels: • Position: B(eginning), I(nside), Outside • NER types: PER, ORG, LOC, NUM

NER as Classification Task • Instance: token • Labels: • Position: B(eginning), I(nside), Outside • NER types: PER, ORG, LOC, NUM • Label: Type-Position, e.g. PER-B, PER-I, O, … • How many tags?

NER as Classification Task • Instance: token • Labels: • Position: B(eginning), I(nside), Outside • NER types: PER, ORG, LOC, NUM • Label: Type-Position, e.g. PER-B, PER-I, O, … • How many tags? • (|NER Types|x 2) + 1

Ling 570 Day 16 : Sequence modeling Named Entity Recognition