350 likes | 536 Views
Learning Hidden Markov Model Structure for Information Extraction. Kristie Seymour, Andrew McCullum, & Ronald Rosenfeld. Hidden Markov Model Structures. Machine learning tool applied to Information Extraction Part of speech tagging (Kupiec 1992)
E N D
Learning Hidden Markov Model Structure for Information Extraction Kristie Seymour, Andrew McCullum, & Ronald Rosenfeld
Hidden Markov Model Structures • Machine learning tool applied to Information Extraction • Part of speech tagging (Kupiec 1992) • Topic detection & tracking (Yamron et al 1998) • Dialog act modeling (Stolcke, Shriberg, & others 1998)
HMM in Information Extraction • Gene names and locations (Luek 1997) • Named-entity extraction (Nymble system – Friberg & McCallum 1999) • Information Extraction Strategy • 1 HMM = 1 Field • 1 state / class • Hand-built models using human data inspection
HMM Advantages • Strong statistical foundations • Used well in Natural Language programming • Handles new data robustly • Uses established training algorithms which are computationally efficient to develop and evaluate
HMM Disadvantages • Require a priori notion of model topology • Need large amounts of training data to use
Authors’ Contribution • Automatically determined model structure from data • One HMM to extract all information • Introduced DISTANTLY-LABELED DATA
OUTLINE • Information Extraction basics with HMM • Learning model structure from data • Training data • Experiment results • Model selection • Error breakdown • Conclusions • Future work
Information Extraction basics with HMM • OBJECT – to code every word of CS research paper headers • Title • Author • Date • Keyword • Etc. • 1 HMM / 1 Header • Initial state to Final state
Discrete output, First-order HMM • Q – set of states • qI – initial state • qF – final state in transition • ∑ = {σ1, σ2, . . . , σm} - discrete output vocabulary • X = x1 x2 . . . xi - output string PROCESS • Initital state -> new state -> emit output symbol -> • another state -> new state -> emit another output symbol -> • . . . FINAL STATE PARAMETERS • P(q -> q’) – transition probabilities • P(q ↑ σ) – emission probabilities
The probability of string x being emitted by an HMM M is computed as a sum over all possible paths where q0 and ql+1 are restricted to be qI and qF respectively, and xl+1 is an end-of-string token (uses Forward algorithm)
The output is observable, but the underlying state sequence is HIDDEN
To recover the state sequence V(x|M)that has the highest probability of having produced an observation sequence (uses Viterbi algorithm)
HMM application • Each state has a class (i.e. title, author) • Each word in the header is an observation • Each state emits words from header with associated CLASS TAG • This is learned from TRAINING DATA
Learning model structure from data • Decide on states and associated transition states • Set up labeled training data • Use MERGE techniques • Neighbor merge (link all adjacent words in title) • V-merging - 2 states with same label and transitions (one transition to title and out) • Apply Bayesian model merging to maximize result accuracy
Bayesian model merging seeks to find the model structure that maximizes the probability of the model (M) given some training data (D), by iteratively merging states until an optimal tradeoff between fit to the data and model size has been reached
Three types of training data • Labeled data • Unlabeled data • Distantly-labeled data
Labeled data • manual and expensive • Provides COUNTS function c() estimates model parameters
Formulas for deriving parametersusing counts c()(4) Transition Probabilities(5) Emission Probabilities
Unlabeled Data • Needs estimated parameters from labeled data • Use Baum-Welch training algorithm • Iterative expectation-maximization algorithm which adjusts model parameters to locally maximize results from unlabeled data • Sensitive to initial parameters
Distantly-labeled data • Data labeled for another purpose • Partially applied to this domain for training • EXAMPLE - CS research headers – BibTeX bibliographic labeled citations
Experiment results • Prepare text using computer program • Header- beginning to INTRODUCTION or end of 1st page • Remove punctuation, case, & newlines • Label • +ABSTRACT+ Abstract • +INTRO+ Introduction • +PAGE+ End of 1st page • Manually label 1000 headers • Minus 65 discarded due to poor format • Derive fixed word vocabularies from training
Model selection • MODELS 1-4 - 1 state / class • MODEL 1 – fully connected HMM model with uniform transition estimates between states • MODEL 2 – maximum likelihood transition estimate with others uniform • MODEL 3 – all likelihood transitions estimates BASELINE used for HMM model • MODEL 4 – adds smoothing – no zero results
ACCURACY OF MODELS (by % word classification accuracy)L Labeled dataL+D Labeled and Distantly-labeled
Multiple states / class- hand distantly-labeled+ automatic distantly-labeled
UNLABELED DATA & TRAININGINITIAL L + D + Uλ = 0.5 0.5 each emission distributionλ varies optimum distributionPP includes smoothing
Error breakdown • Errors by CLASS TAG • BOLD – distantly-labeled data tags
Conclusions • Research paper headers work • Improvement factors • Multi-state classes • Distantly-labeled data (10%) • Distantly-labeled data can reduce labeled data
Future work • Use Bayesian model merging to completely automate model learning • Also describe layout by position on page • Model internal state structure
Model of Internal State StructureFirst 2 words – explicitMultiple affiliations possibleLast 2 words - explicit
My Assessment • Highly mathematical and complex • Even unlabeled data is in a preset order • Model requires work setting up training data • Change in target data will completely change model • Valuable experiments with heuristics and smoothing impacting results • Wish they had included a sample 1st page