Learning Hidden Markov Model Structure for Information Extraction

Learning Hidden Markov Model Structure for Information Extraction Kristie Seymour, Andrew McCullum, & Ronald Rosenfeld

Hidden Markov Model Structures • Machine learning tool applied to Information Extraction • Part of speech tagging (Kupiec 1992) • Topic detection & tracking (Yamron et al 1998) • Dialog act modeling (Stolcke, Shriberg, & others 1998)

HMM in Information Extraction • Gene names and locations (Luek 1997) • Named-entity extraction (Nymble system – Friberg & McCallum 1999) • Information Extraction Strategy • 1 HMM = 1 Field • 1 state / class • Hand-built models using human data inspection

HMM Advantages • Strong statistical foundations • Used well in Natural Language programming • Handles new data robustly • Uses established training algorithms which are computationally efficient to develop and evaluate

HMM Disadvantages • Require a priori notion of model topology • Need large amounts of training data to use

Authors’ Contribution • Automatically determined model structure from data • One HMM to extract all information • Introduced DISTANTLY-LABELED DATA

OUTLINE • Information Extraction basics with HMM • Learning model structure from data • Training data • Experiment results • Model selection • Error breakdown • Conclusions • Future work

Information Extraction basics with HMM • OBJECT – to code every word of CS research paper headers • Title • Author • Date • Keyword • Etc. • 1 HMM / 1 Header • Initial state to Final state

Discrete output, First-order HMM • Q – set of states • qI – initial state • qF – final state in transition • ∑ = {σ1, σ2, . . . , σm} - discrete output vocabulary • X = x1 x2 . . . xi - output string PROCESS • Initital state -> new state -> emit output symbol -> • another state -> new state -> emit another output symbol -> • . . . FINAL STATE PARAMETERS • P(q -> q’) – transition probabilities • P(q ↑ σ) – emission probabilities

The probability of string x being emitted by an HMM M is computed as a sum over all possible paths where q0 and ql+1 are restricted to be qI and qF respectively, and xl+1 is an end-of-string token (uses Forward algorithm)

The output is observable, but the underlying state sequence is HIDDEN

To recover the state sequence V(x|M)that has the highest probability of having produced an observation sequence (uses Viterbi algorithm)

HMM application • Each state has a class (i.e. title, author) • Each word in the header is an observation • Each state emits words from header with associated CLASS TAG • This is learned from TRAINING DATA

Learning model structure from data • Decide on states and associated transition states • Set up labeled training data • Use MERGE techniques • Neighbor merge (link all adjacent words in title) • V-merging - 2 states with same label and transitions (one transition to title and out) • Apply Bayesian model merging to maximize result accuracy

Example Hidden Markov Model

Bayesian model merging seeks to find the model structure that maximizes the probability of the model (M) given some training data (D), by iteratively merging states until an optimal tradeoff between fit to the data and model size has been reached

Three types of training data • Labeled data • Unlabeled data • Distantly-labeled data

Labeled data • manual and expensive • Provides COUNTS function c() estimates model parameters

Formulas for deriving parametersusing counts c()(4) Transition Probabilities(5) Emission Probabilities

Unlabeled Data • Needs estimated parameters from labeled data • Use Baum-Welch training algorithm • Iterative expectation-maximization algorithm which adjusts model parameters to locally maximize results from unlabeled data • Sensitive to initial parameters

Distantly-labeled data • Data labeled for another purpose • Partially applied to this domain for training • EXAMPLE - CS research headers – BibTeX bibliographic labeled citations

Experiment results • Prepare text using computer program • Header- beginning to INTRODUCTION or end of 1st page • Remove punctuation, case, & newlines • Label • +ABSTRACT+ Abstract • +INTRO+ Introduction • +PAGE+ End of 1st page • Manually label 1000 headers • Minus 65 discarded due to poor format • Derive fixed word vocabularies from training

Sources & Amounts of Training Data

Model selection • MODELS 1-4 - 1 state / class • MODEL 1 – fully connected HMM model with uniform transition estimates between states • MODEL 2 – maximum likelihood transition estimate with others uniform • MODEL 3 – all likelihood transitions estimates BASELINE used for HMM model • MODEL 4 – adds smoothing – no zero results

ACCURACY OF MODELS (by % word classification accuracy)L Labeled dataL+D Labeled and Distantly-labeled

Multiple states / class- hand distantly-labeled+ automatic distantly-labeled

Compared BASELINE to best MULTI-STATE to V-MERGED models

UNLABELED DATA & TRAININGINITIAL L + D + Uλ = 0.5 0.5 each emission distributionλ varies optimum distributionPP includes smoothing

Error breakdown • Errors by CLASS TAG • BOLD – distantly-labeled data tags

Conclusions • Research paper headers work • Improvement factors • Multi-state classes • Distantly-labeled data (10%) • Distantly-labeled data can reduce labeled data

Future work • Use Bayesian model merging to completely automate model learning • Also describe layout by position on page • Model internal state structure

Model of Internal State StructureFirst 2 words – explicitMultiple affiliations possibleLast 2 words - explicit

My Assessment • Highly mathematical and complex • Even unlabeled data is in a preset order • Model requires work setting up training data • Change in target data will completely change model • Valuable experiments with heuristics and smoothing impacting results • Wish they had included a sample 1st page

QUESTIONS

Learning Hidden Markov Model Structure for Information Extraction

Learning Hidden Markov Model Structure for Information Extraction

Presentation Transcript

Hidden Markov Model

Hidden Markov Models Applied to Information Extraction

Machine Learning Hidden Markov Model

Hidden Markov Model

Hidden Markov Model

Hidden Markov Model

Hidden Markov Model

Hidden Markov Model

Hidden Markov model

Hidden Markov Models for Information Extraction

Hidden Markov Model

Hidden Markov Model Continues …

A Hidden Markov Model for Protein Secondary Structure Prediction

Hidden Markov Model

Hidden Markov Model

Hidden Markov Model

Hidden Markov Model

Hidden Markov Models (HMMs) for Information Extraction

Hidden Markov Model

Hidden Markov Models (HMMs) for Information Extraction

Hidden Markov Model