Information Extraction

Information Extraction Entity Extraction: Statistical Methods SunitaSarawagi

What Are Statistical Methods? • “Statistical methods of entity extraction convert the extraction task to a problem of designing a decomposition of the unstructured text and then labeling various parts of the decomposition, either jointly or independently.” • Models • Token-level • Segment-level • Grammar-based • Training • Likelihood • Max-margin

Token-level Models • Sequence of tokens (characters, words, or n-grams) • Entity labels assigned to each token • Generalization of classification problem • Feature selection important

Features • Word features • Surface word itself is strong indicator of which label to use • Orthographic features • Capitalization patterns (cap-words) • Presence of special characters • Alphanumeric generalization of characters in the token • Dictionary lookup features f : (x,y, i) → R

Models for Labeling Tokens • Logistic classifier • Support Vector Machine (SVM) • Hidden Markov Models (HMMs) • Maximum entropy Markov Model (MEMM) • Conditional Markov Model (CMM) • Conditional Random Fields (CRFs) • Single joint distribution Pr(y|x) • Scoring function

Segment-level Models • Sequence of segments • Entity labels assigned to each segment • Features span multiple tokens

Entity-level Features • Exact segment match • Similarity function such as TF/IDF • Segment length

Global Segmentation Models • Probability distribution • Goal is to find segment s such that w·f(x,s) is maximized

Grammar-based Models • Production rule oriented • Produces parse trees • Scoring function for each production

Training Algorithms • Outputs some y • Sequence of labels for sequence models • Segmentation of x for segment-level models • Parse tree for grammar-based models • Argmax of s(y) = w·f(x,y) where f(x,y) is a feature vector • Two types of training methods • Likelihood-based training • Max-margin training

Likelihood Trainer • Probability distribution • Log probability distribution • Maximize weight vector w

Likelihood Trainer

Max-margin Training • “an extension of support vector machines for training structured models” • Find weight vector w

Max-margin Training

Inference Algorithms • Two kinds of inference queries • MAP labeling • Expected feature values • Both can be solved using dynamic programming

MAP for Sequential Labeling • Also known as the Viterbi algorithm • Find best label for x found by where n is the length of x • Runs in where m is the number of labels

MAP for Segmentations • Runs in where is size of the largest segment

MAP for Parse Trees • Best tree is where goes over all possible nonterminals • Runs in where is the total number of terminals and nonterminals

Expected Features Values for Sequential Labelings • Value at each node (dynamic programming) • Recursive algorithm • Backward recursive • Expected value of a feature

Summary • Most prominent models used • Maximum entropy taggers (MaxEnt) • Hidden Markov Models (HMMs) • Conditional Random Fields (CRFs) • CRFs are now established as state-of-the-art • Segment-level and grammar-based CRFs not as popular

Further Readings • Active learning • Bootstrapping from structured data • Transfer learning from domain adaptation • Collective inference

Information Extraction