IE by Candidate Classification: Jansche & Abney, Cohen et al

IE by Candidate Classification:Jansche & Abney, Cohen et al William Cohen 1/19/03

SCAN: Search & Summarization for Audio Collections (AT&T Labs)

Why IE from personal voicemail • Unified interface for email, voicemail, fax, … requires uniform headers: • Sender, Time, Subject, … • Headers are key for uniform interface • Independently, voicemail access is slow: • useful to have fast access to important parts of message (contact number, caller)

Why else to read this paper • Robust information extraction • Generalizing from manual transcripts (i.e., human-produced written version of voicemail) to automatic (ASR) transcripts • Place of hand-codingvs learning in information extraction • How to break up task • Where and how to use engineering Candidate Generator Candidate phrase Learned filter Extracted phrase

Voicemail corpus • About 10,000 manually transcribed and annotated voice messages. • 1869 used for evaluation

Observation: caller phrases are short and near the beginning of the message.

Caller-phrase extraction • Propose start positions i1,…,iN • Use a learned decision tree to pick the best i • Propose end positions i+j1,i+j2,…,i+jM • Use a learned decision tree to pick the best j

Baseline (HZP, Col log-linear) • IE as tagging: • Pr(tag i|word i,word i-1,…,word i+1,…,tag i-1,…) estimated via MAXENT model • Beam search to find best tag sequence given word sequence • Features of model are words, word pairs, word pair+tag trigrams, ….

Performance

Observation: caller names are reallyshort and near the beginningof the message.

What about ASR transcripts?

Extracting phone numbers • Phase 1: hand-coded grammer proposes candidate phone numbers • Not too hard, due to limited vocabulary • Optimize recall (96%) not precision (30%) • Phase 2: a learned decision tree filters candidates • Use length, position, context, …

Results

Their Conclusions

Cohen, Wang, Murphy • Another paper with a similar flavor: • IE for a particular task • IE using similar propose-and-filter approach • When and how to you engineer, and when and how to you use learning?

Background – subcellular localization The most important tool for studying protein localizations is fluorescence microscopy. New image processing techniques can automatically produce a quantitative description of subcellular localization.

Two golgi proteins that cannot be distinguished by eye Background – subcellular localization

Background – subcellular localization Entrez: “a new 376kD Golgi complex outher membrane protein” SWISSProt: “INTEGRAL MEMBRANE PROTEIN. GOLGI MEMBRANE” Entrez: “GPP130; type II Golgi membrane protein” SWISSProt: nothing

Overview of SLIF: image analysis of existing images from online publications Image On-line paper Panel Splitter Figure finder Panel Classifier Fl. Micr. Panel Scale Finder Figure Micr. Scale

Overview of SLIF: image analysis of existing images from online publications End result: collection of on-line fluorescence microscope images, with quantitative description of localization. E.g.: we know this figure section shows a tubulin-like protein… …but not which one!

Background – overview of SLIF2.0 Image Caption Image Pointer Finder Panel Splitter Panel Label Matcher Panel Classifier Scope Finder Fl. Micr. Panel Scale Finder Name Finder Protein Name Micr. Scale Cell Type

BY-2 U2B 0-GFP p80-coilin anti-p80 coilin An old issue: entity recognition Background – overview of SLIF2.0 Figure 1. (A) Single confocal optical section of BY-2 cells expressing U2B 0-GFP, double labeled with GFP (left panel) and autoantibody against p80 coilin (right panel). Three nuclei are shown, and the bright GFP spots colocalize with bright foci of anti-coilin labeling. There is some labeling of the cytoplasm by anti-p80 coilin. (B) Single confocal optical section of BY-2 cells expressing U2B 0 -GFP, double labeled with GFP (left panel) and 4G3 antibody (right panel). Three nuclei are shown. Most coiled bodies are in the nucleoplasm, but occasionally are seen in the nucleolus (arrows). All coiled bodies that contain U2B 0 also express the U2B 0-GFP fusion. Bars, 5 m m. Movement of Coiled Bodies Vol. 10, July 1999 2299 A new issue: “caption understanding” - where are the entities in the image?

Why caption understanding? - Location proteomics. - Remove extraneous junk from caption text for “ordinary” IE, NLP, indexing, … - Better text- or content-based image retrieval for scientific images. Figure 1. (A) Single confocal optical section of BY-2 cells expressing U2B 0-GFP, double labeled with GFP (left panel) and autoantibody against p80 coilin (right panel). Three nuclei are shown, and the bright GFP spots colocalize with bright foci of anti-coilin labeling. There is some labeling of the cytoplasm by anti-p80 coilin. (B) Single confocal optical section of BY-2 cells expressing U2B 0 -GFP, double labeled with GFP (left panel) and 4G3 antibody (right panel). Three nuclei are shown. Most coiled bodies are in the nucleoplasm, but occasionally are seen in the nucleolus (arrows). All coiled bodies that contain U2B 0 also express the U2B 0-GFP fusion. Bars, 5 m m. Movement of Coiled Bodies Vol. 10, July 1999 2299

Identify image pointers: Substrings that refer to parts of the image Will focus on text issues, not matching Figure 1. (A) Single confocal optical section of BY-2 cells expressing U2B 0-GFP, double labeled with GFP (left panel) and autoantibody against p80 coilin (right panel). Three nuclei are shown, and the bright GFP spots colocalize with bright foci of anti-coilin labeling. There is some labeling of the cytoplasm by anti-p80 coilin. (B) Single confocal optical section of BY-2 cells expressing U2B 0 -GFP, double labeled with GFP (left panel) and 4G3 antibody (right panel). Three nuclei are shown. Most coiled bodies are in the nucleoplasm, but occasionally are seen in the nucleolus (arrows). All coiled bodies that contain U2B 0 also express the U2B 0-GFP fusion. Bars, 5 m m. Movement of Coiled Bodies Vol. 10, July 1999 2299

Identifyimage pointers: Substrings that refer to parts of the image Classify image pointers as citation-styleor bullet-style. Figure 1. (A) Single confocal optical section of BY-2 cells expressing U2B 0-GFP, double labeled with GFP (left panel) and autoantibody against p80 coilin (right panel). Three nuclei are shown, and the bright GFP spots colocalize with bright foci of anti-coilin labeling. There is some labeling of the cytoplasm by anti-p80 coilin. (B) Single confocal optical section of BY-2 cells expressing U2B 0 -GFP, double labeled with GFP (left panel) and 4G3 antibody (right panel). Three nuclei are shown. Most coiled bodies are in the nucleoplasm, but occasionally are seen in the nucleolus (arrows). All coiled bodies that contain U2B 0 also express the U2B 0-GFP fusion. Bars, 5 m m. Movement of Coiled Bodies Vol. 10, July 1999 2299

Compute scopes: - The scope of a bullet-style image pointer is all words between it and the next “bullet” scope of (A) scope of (B) Classify image pointers as citation-styleor bullet-style. Figure 1. (A) Single confocal optical section of BY-2 cells expressing U2B 0-GFP, double labeled with GFP (left panel) and autoantibody against p80 coilin (right panel). Three nuclei are shown, and the bright GFP spots colocalize with bright foci of anti-coilin labeling. There is some labeling of the cytoplasm by anti-p80 coilin. (B) Single confocal optical section of BY-2 cells expressing U2B 0 -GFP, double labeled with GFP (left panel) and 4G3 antibody (right panel). Three nuclei are shown. Most coiled bodies are in the nucleoplasm, but occasionally are seen in the nucleolus (arrows). All coiled bodies that contain U2B 0 also express the U2B 0-GFP fusion. Bars, 5 m m. Movement of Coiled Bodies Vol. 10, July 1999 2299

Compute scopes: - The scope of a bullet-style image pointer is all words after it, but before next “bullet” - The scope of a citation-style image pointer is some set of words nearby it (heuristically determined by separating words and punctuation) Figure 1. (A) Single confocal optical section of BY-2 cells expressing U2B 0-GFP, double labeled with GFP (left panel) and autoantibody against p80 coilin (right panel). Three nuclei are shown, and the bright GFP spots colocalize with bright foci of anti-coilin labeling. There is some labeling of the cytoplasm by anti-p80 coilin. (B) Single confocal optical section of BY-2 cells expressing U2B 0 -GFP, double labeled with GFP (left panel) and 4G3 antibody (right panel). Three nuclei are shown. Most coiled bodies are in the nucleoplasm, but occasionally are seen in the nucleolus (arrows). All coiled bodies that contain U2B 0 also express the U2B 0-GFP fusion. Bars, 5 m m. Movement of Coiled Bodies Vol. 10, July 1999 2299

Image pointers share all entities in their “scope”. Entities are assigned to panels based on matches of image-pointers to annotations in panels. Figure 1. (A) Single confocal optical section of BY-2 cells expressing U2B 0-GFP, double labeled with GFP (left panel) and autoantibody against p80 coilin (right panel). Three nuclei are shown, and the bright GFP spots colocalize with bright foci of anti-coilin labeling. There is some labeling of the cytoplasm by anti-p80 coilin. (B) Single confocal optical section of BY-2 cells expressing U2B 0 -GFP, double labeled with GFP (left panel) and 4G3 antibody (right panel). Three nuclei are shown. Most coiled bodies are in the nucleoplasm, but occasionally are seen in the nucleolus (arrows). All coiled bodies that contain U2B 0 also express the U2B 0-GFP fusion. Bars, 5 m m. Movement of Coiled Bodies Vol. 10, July 1999 2299

Outline • Details on caption understanding • Baseline hand-coded methods • Learning methods • Experimental results

Task • Identify image pointers in captions. • Classify image pointers: • bullet-style, citation-style, or NP-style • E.g., “Panels A and C show the …” • Won’t talk about scoping • Will focus first on extracting image pointers—i.e., binary classification of substrings “is this an image pointer” • Data: 100 captions from 100 papers—about 600 positive examples.

Baseline methods • Labeled 100 sample figure captions. • HANDCODE-1: patterns like (A), (B-E), (c and d), etc. • HANDCODE-2: all short parenthesized expressions & patterns like “panel A” or “in B-C” Some plausible tricks (like filtering HC-2) don’t help much…

How hard is the problem? Some citation-style image pointers

How hard is the problem? NP-style non-image pointers The difficulty of the task suggests using a learning approach

Another use of propose-and-filter Note that Hand-Code2 (recall 98%) is a natural candidate generator. We’ll start with “off the shelf” features… Candidate Generator Candidate phrase Learned filter Extracted phrase

Learning methods: boosting Generalized version of AdaBoost (Singer&Schapire, 99) Allows “real-valued” predictions for each “base hypothesis”—including value of zero.

Constraint: W+ > W- where and caret is smoothing Learning methods: boosting rules • Weak learner: to find weak hypothesis t: • Split Data into Growing and Pruning sets • Let Rt be an empty conjunction • Greedily add conditions to Rt guided by Growing set: • Greedily remove conditions from Rt guided by Pruning set: • Convert to weak hypothesis:

Learning methods: boosting rules SLIPPER also produces fairly compact rule sets.

Learning methods: BWI • Boosted wrapper induction (BWI) learns to extract substrings from a document. • Learns three concepts: firstToken(x), lastToken(x), substringLength(k) • Conditions are tests on tokens before/after x • E.g., toki-2=‘from’, isNumber(toki+1) • SLIPPER weak learner, no pruning. • Greedy search extends window size by at most L in each iteration, uses lookahead L, no fixed limit on window size. • Good results in (Kushmeric and Frietag, 2000)

Learning methods: ABWI • “Almost boosted wrapper induction” (ABWI) learns to extract substrings: • Learns to filter candidate substrings (HandCode2) • Conditions are the same tests on tokens near x: • E.g., toki-2=‘from’, isNumber(toki+1) • SLIPPER weak learner, no pruning. • Greedy search extends window size any amount, uses no lookahead, has fixed limit on window size. • Optimal window sizes for this problem seem to be small…

Learning methods • Features: W tokens before/after, all tokens inside. • Learner: 100 rounds boosting conjunctions of feature tests • Inspired by BWI (Frietag & Kushmeric) • Implemented with SLIPPER learner

Other learning methods All learning methods are competitive with hand-coded methods

Additional features • Check if candidate contains certain “special” substrings: • Matches color name: labeled color • Matches HANDCODE-1 pattern: handcode1 • Matches “mm”, “mg”, etc: measure • Matches 1980,…,2003, “et al”: citation • Matches “top”, “left”, etc: place • Added “sentence boundary” substrings: • Feature is “distance to boundary”.

Learning with expanded feature set Many new features are inversely correlated with class (e.g. citation), but ABWI looks only for positively-correlated patterns.

Learning with expanded feature set SABWI is a symmetric version of ABWI: can use rules and/or conditions negatively or positively correlated with the class

Task • Identify image pointers in captions. • Classify image pointers: • bullet-style, citation-style, or NP-style • Combine these to get a four-class problem: • bullet-style, citation-style, or NP-style, other • no hand-coded baseline methods

Four-class extraction results

Further improvement is probable with additional labeled data

IE by Candidate Classification: Jansche & Abney, Cohen et al