1 / 35

Learning Hidden Markov Model Structure for Information Extraction

Learning Hidden Markov Model Structure for Information Extraction. Kristie Seymour, Andrew McCullum, & Ronald Rosenfeld. Hidden Markov Model Structures. Machine learning tool applied to Information Extraction Part of speech tagging (Kupiec 1992)

vincent
Download Presentation

Learning Hidden Markov Model Structure for Information Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning Hidden Markov Model Structure for Information Extraction Kristie Seymour, Andrew McCullum, & Ronald Rosenfeld

  2. Hidden Markov Model Structures • Machine learning tool applied to Information Extraction • Part of speech tagging (Kupiec 1992) • Topic detection & tracking (Yamron et al 1998) • Dialog act modeling (Stolcke, Shriberg, & others 1998)

  3. HMM in Information Extraction • Gene names and locations (Luek 1997) • Named-entity extraction (Nymble system – Friberg & McCallum 1999) • Information Extraction Strategy • 1 HMM = 1 Field • 1 state / class • Hand-built models using human data inspection

  4. HMM Advantages • Strong statistical foundations • Used well in Natural Language programming • Handles new data robustly • Uses established training algorithms which are computationally efficient to develop and evaluate

  5. HMM Disadvantages • Require a priori notion of model topology • Need large amounts of training data to use

  6. Authors’ Contribution • Automatically determined model structure from data • One HMM to extract all information • Introduced DISTANTLY-LABELED DATA

  7. OUTLINE • Information Extraction basics with HMM • Learning model structure from data • Training data • Experiment results • Model selection • Error breakdown • Conclusions • Future work

  8. Information Extraction basics with HMM • OBJECT – to code every word of CS research paper headers • Title • Author • Date • Keyword • Etc. • 1 HMM / 1 Header • Initial state to Final state

  9. Discrete output, First-order HMM • Q – set of states • qI – initial state • qF – final state in transition • ∑ = {σ1, σ2, . . . , σm} - discrete output vocabulary • X = x1 x2 . . . xi - output string PROCESS • Initital state -> new state -> emit output symbol -> • another state -> new state -> emit another output symbol -> • . . . FINAL STATE PARAMETERS • P(q -> q’) – transition probabilities • P(q ↑ σ) – emission probabilities

  10. The probability of string x being emitted by an HMM M is computed as a sum over all possible paths where q0 and ql+1 are restricted to be qI and qF respectively, and xl+1 is an end-of-string token (uses Forward algorithm)

  11. The output is observable, but the underlying state sequence is HIDDEN

  12. To recover the state sequence V(x|M)that has the highest probability of having produced an observation sequence (uses Viterbi algorithm)

  13. HMM application • Each state has a class (i.e. title, author) • Each word in the header is an observation • Each state emits words from header with associated CLASS TAG • This is learned from TRAINING DATA

  14. Learning model structure from data • Decide on states and associated transition states • Set up labeled training data • Use MERGE techniques • Neighbor merge (link all adjacent words in title) • V-merging - 2 states with same label and transitions (one transition to title and out) • Apply Bayesian model merging to maximize result accuracy

  15. Example Hidden Markov Model

  16. Bayesian model merging seeks to find the model structure that maximizes the probability of the model (M) given some training data (D), by iteratively merging states until an optimal tradeoff between fit to the data and model size has been reached

  17. Three types of training data • Labeled data • Unlabeled data • Distantly-labeled data

  18. Labeled data • manual and expensive • Provides COUNTS function c() estimates model parameters

  19. Formulas for deriving parametersusing counts c()(4) Transition Probabilities(5) Emission Probabilities

  20. Unlabeled Data • Needs estimated parameters from labeled data • Use Baum-Welch training algorithm • Iterative expectation-maximization algorithm which adjusts model parameters to locally maximize results from unlabeled data • Sensitive to initial parameters

  21. Distantly-labeled data • Data labeled for another purpose • Partially applied to this domain for training • EXAMPLE - CS research headers – BibTeX bibliographic labeled citations

  22. Experiment results • Prepare text using computer program • Header- beginning to INTRODUCTION or end of 1st page • Remove punctuation, case, & newlines • Label • +ABSTRACT+ Abstract • +INTRO+ Introduction • +PAGE+ End of 1st page • Manually label 1000 headers • Minus 65 discarded due to poor format • Derive fixed word vocabularies from training

  23. Sources & Amounts of Training Data

  24. Model selection • MODELS 1-4 - 1 state / class • MODEL 1 – fully connected HMM model with uniform transition estimates between states • MODEL 2 – maximum likelihood transition estimate with others uniform • MODEL 3 – all likelihood transitions estimates BASELINE used for HMM model • MODEL 4 – adds smoothing – no zero results

  25. ACCURACY OF MODELS (by % word classification accuracy)L Labeled dataL+D Labeled and Distantly-labeled

  26. Multiple states / class- hand distantly-labeled+ automatic distantly-labeled

  27. Compared BASELINE to best MULTI-STATE to V-MERGED models

  28. UNLABELED DATA & TRAININGINITIAL L + D + Uλ = 0.5 0.5 each emission distributionλ varies optimum distributionPP includes smoothing

  29. Error breakdown • Errors by CLASS TAG • BOLD – distantly-labeled data tags

  30. Conclusions • Research paper headers work • Improvement factors • Multi-state classes • Distantly-labeled data (10%) • Distantly-labeled data can reduce labeled data

  31. Future work • Use Bayesian model merging to completely automate model learning • Also describe layout by position on page • Model internal state structure

  32. Model of Internal State StructureFirst 2 words – explicitMultiple affiliations possibleLast 2 words - explicit

  33. My Assessment • Highly mathematical and complex • Even unlabeled data is in a preset order • Model requires work setting up training data • Change in target data will completely change model • Valuable experiments with heuristics and smoothing impacting results • Wish they had included a sample 1st page

  34. QUESTIONS

More Related