370 likes | 775 Views
Hidden Markov Models Applied to Information Extraction. Part I: Concept HMM Tutorial Part II: Sample Application AutoBib: web information extraction. Larry Reeve INFO629: Artificial Intelligence Dr. Weber, Fall 2004. Part I: Concept HMM Motivation.
E N D
Hidden Markov ModelsApplied to Information Extraction • Part I: Concept • HMM Tutorial • Part II: Sample Application • AutoBib: web information extraction Larry Reeve INFO629: Artificial IntelligenceDr. Weber, Fall 2004
Part I: Concept HMM Motivation • Real-world has structures and processes which have (or produce) observable outputs • Usually sequential (process unfolds over time) • Cannot see the event producing the output • Example: speech signals • Problem: how to construct a model of the structure or process given only observations
HMM Background • Basic theory developed and published in 1960s and 70s • No widespread understanding and application until late 80s • Why? • Theory published in mathematic journals which were not widely read by practicing engineers • Insufficient tutorial material for readers to understand and apply concepts
HMM Uses • Uses • Speech recognition • Recognizing spoken words and phrases • Text processing • Parsing raw records into structured records • Bioinformatics • Protein sequence prediction • Financial • Stock market forecasts (price pattern prediction) • Comparison shopping services
HMM Overview • Machine learning method • Makes use of state machines • Based on probabilistic models • Useful in problems having sequential steps • Can only observe output from states, not the states themselves • Example: speech recognition • Observe: acoustic signals • Hidden States: phonemes (distinctive sounds of a language) State machine:
Observable Markov Model Example • Weather • Once each day weather is observed • State 1: rain • State 2: cloudy • State 3: sunny • What is the probability the weather for the next 7 days will be: • sun, sun, rain, rain, sun, cloudy, sun • Each state corresponds to a physical observable event State transition matrix
Hidden Markov Model Example • Coin toss: • Heads, tails sequence with 2 coins • You are in a room, with a wall • Person behind wall flips coin, tells result • Coin selection and toss is hidden • Cannot observe events, only output (heads, tails) from events • Problem is then to build a model to explain observed sequence of heads and tails
HMM Components • A set of states (x’s) • A set of possible output symbols (y’s) • A state transition matrix (a’s) • probability of making transition from one state to the next • Output emission matrix (b’s) • probability of a emitting/observing a symbol at a particular state • Initial probability vector • probability of starting at a particular state • Not shown, sometimes assumed to be 1
Common HMM Types • Ergodic (fully connected): • Every state of model can be reached in a single step from every other state of the model • Bakis (left-right): • As time increases, states proceed from left to right
HMM Core Problems • Three problems must be solved for HMMs to be useful in real-world applications 1) Evaluation 2) Decoding 3) Learning
HMM Evaluation Problem • Purpose: score how well a given model matches a given observation sequence • Example (Speech recognition): • Assume HMMs (models) have been built for words ‘home’ and ‘work’. • Given a speech signal, evaluation can determine the probability each model represents the utterance
HMM Decoding Problem • Given a model and a set of observations, what are the hidden states most likely to have generated the observations? • Useful to learn about internal model structure, determine state statistics, and so forth
HMM Learning Problem • Goal is to learn HMM parameters (training) • State transition probabilities • Observation probabilities at each state • Training is crucial: • it allows optimal adaptation of model parameters to observed training data using real-world phenomena • No known method for obtaining optimal parameters from data – only approximations • Can be a bottleneck in HMM usage
HMM Concept Summary • Build models representing the hidden states of a process or structure using only observations • Use the models to evaluate probability that a model represents a particular observation sequence • Use the evaluation information in an application to: recognize speech, parse addresses, and many other applications
Part II: Application AutoBib System • Provide a uniform view of several computer science bibliographic web data sources • An automated web information extraction system that requires little human input • Web pages designed differently from site-to-site • IE requires training samples • HMMs used to parse unstructured bibliographic records into a structured format: NLP
Approach 1) Provide seed database of structured records 2) Extract raw records from relevant Web pages 3) Match structured records to raw records • To build training samples 4) Train HMM-based parser 5) Parse unmatched raw recs into structured recs 6) Merge new structured records into database
Step 1 - Seeding • Provide seed database of structured records • Take small collection of BibTeX format records and insert into database • Cleaning step normalizes record fields • Examples: • “Proc.” “Proceedings” • “Jan” “January” • Manual step, executed once only
Step 2 – Extract Raw Records • Extract raw records from relevant Web pages • User specifies • Web pages to extract from • How to follow ‘next page’ links for multiple pages • Raw records are extracted • Uses record-boundary discovery techniques • Subtree of Interest = largest subtree of HTML tags • Record separators = frequent HTML tags
Tokenized Records (Replace all HTML tags with ^)
Step 3 - Matching • Match raw records R to structured records S • Apply 4 tests (heuristic-based) • Match at least author in R to an author in S • S.year must appear in R • If S.pages exists, R must contain it • S.title is ‘approximately contained’ in R Levenshtein edit distance – approximate string match
Step 4 – Parser Training • Train HMM-based parser • For each pair of R and S that match, annotate tokens in raw record with field names • Annotated raw records are fed into HMM parser in order to learn: • State transition probabilities • Symbol probabilities at each state
Parser Training, continued • Key consideration is HMM structure for navigating record fields (fields, delimiters) • Special states • start, end • Normal states • author, title, year, etc. • Best structure found: • Have multiple delimiter and tag states, • one for each normal state • Example: author-delimiter, author-tag
Sample HMM (Method 3) Source: http://www.cs.duke.edu/~geng/autobib/web/hmm.jpg
Step 5 - Conversion • Parse unmatched raw recs into structured recs using HMM parser • Matched raw records can be directly converted without parsing because they were annotated in matching step
Step 6 - Merging • Merge new structured records into database • Initial seed database has now grown • New records will be used for improved matching on the next run
Evaluation • Success rate: # of tokens labeled by HMM ------------------------------------- # of tokens labeled by person • DBLP: 98.9% • Computer Science Bibliography • CSWD: 93.4% • CompuScience WWW-Database
HMM Advantages / Disadvantages • Advantages • Effective • Can handle variations in record structure • Optional fields • Varying field ordering • Disadvantages • Requires training using annotated data • Not completely automatic • May require manual markup • Size of training data may be an issue
Other methods • Wrappers • Specification of areas of interest on Web page • Hand-crafted • Wrapper induction • Requires manual training • Not always accommodating to changing structure • Syntax-based; no semantic labeling
Application to Other Domains • E-Commerce • Comparison shopping sites • Extract product/pricing information from many sites • Convert information into structured format and store • Provide interface to look up product information and then display pricing information gathered from many sites • Saves users time • Rather than navigating to and searching many sites, users can consult a single site
References • Concept: • Rabiner, L. R. (1989). A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77(2), 257-285. • Application: • Geng, J. and Yang, J. (2004). Automatic Extraction of Bibliographic Information on the Web. Proceedings of the 8th International Database Engineering and Applications Symposium (IDEAS’04), 193-204.