610 likes | 777 Views
Learning structured ouputs A reinforcement learning approach. P. Gallinari F. Maes, L. Denoyer Patrick.gallinari@lip6.fr www-connex.lip6.fr University Pierre et Marie Curie – Paris – Fr Computer Science Lab. Outline. Motivation and examples Approaches for structured learning
E N D
Learning structured ouputsA reinforcement learning approach P. Gallinari F. Maes, L. Denoyer Patrick.gallinari@lip6.fr www-connex.lip6.fr University Pierre et Marie Curie – Paris – Fr Computer Science Lab.
Outline • Motivation and examples • Approaches for structured learning • Generative models • Discriminant models • Reinforcement learning for structured prediction • Experiments ANNPR 2008 - P. Gallinari
Machine learning and structured data • Different types of problems • Model, classify, cluster structured data • Predict structured outputs • Learn to associate structured representations • Structured data and applications in many domains • chemistry, biology, natural language, web, social networks, data bases, etc ANNPR 2008 - P. Gallinari
Coord. Conj. noun Verb 3rd pers adverb Noun plural determiner Verb gerund Verb plural adjective Sequence labeling: POS ANNPR 2008 - P. Gallinari
adverbial Phrase Noun Phrase Noun Phrase Verb Phrase Noun Phrase Segmentation + labeling: syntactic chunking (Washington Univ. tagger) ANNPR 2008 - P. Gallinari
Syntaxic Parsing (Stanford Parser) ANNPR 2008 - P. Gallinari
Tree mapping: • Problem • query heterogeneous XML collections • Web information extraction • Need to know the correspondence between the structured representations usually made by hand • Learn the correspondence between the different sources Labeled tree mapping problem ANNPR 2008 - P. Gallinari
Others • Taxonomies • Social networks • Adversial computing: Webspam, Blogspam, … • Translation • Biology • ….. ANNPR 2008 - P. Gallinari
Is structure really useful ?Can we make use of structure ? • Yes • Evidence from many domains or applications • Mandatory for many problems • e.g. 10 K classes classification problem • Yes but • Complex or long term dependencies often correspond to rare events • Practical evidence for large size problems • Simple models sometimes offer competitive results • Information retrieval • Speech recognition, etc ANNPR 2008 - P. Gallinari
Why is structure prediction difficult • Size of the output space • # of potential labeling for a sequence • # of potential joint labeling and segmentation • # of potential labeled trees for an input sequence • … • Exponential output space • Inference often amounts at a combinatorial search problem • Scaling issues in most models ANNPR 2008 - P. Gallinari
Structured learning • X, Y : input and output spaces • Structured output • y Ydecomposes into parts of variable size • y = (y1, y2,…, yT) • Dependencies • Relations between y parts • Local, long term, global • Cost function • O/ 1 loss: • Hamming loss: • F-score • BLEU etc ANNPR 2008 - P. Gallinari
General approach • Predictive approach: • where F : X x Y R is a score function used to rank potential outputs • F trained to optimize some loss function • Inference problem • |Y| sometimes exponential • Argmax is often intractable: hypothesis • decomposability of the score function over the parts of y • Restricted set of outputs ANNPR 2008 - P. Gallinari
Structured algorithms differ by: • Feature encoding • Hypothesis on the output structure • Hypothesis on the cost function • Type of structure prediction problem ANNPR 2008 - P. Gallinari
Three main families of SP models Generative models Discriminant models Search models
Generative models • 30 years of generative models • Hidden Markov Models • Many extensions • Hierarchical HMMs, Factorial HMMs, etc • Probabilistic Context Free grammars • Tree models, Graphs models ANNPR 2008 - P. Gallinari
Usual hypothesis • Features : “natural” encoding of the input • Local dependency hypothesis • On the outputs (Markov) • On the inputs • Cost function • Usually joint likelihood • Decomposes, e.g. sum of local cost on each subpart • Inference and learning usually use dynamic programming • HMMs • Decoding complexity O(n|Q|2) for a sequence of length n and |Q| states • PCFGs • Decoding Complexity: O(m3n3), n = length of the sentence, m = # non terminals in the grammar • Prohibitive for many problems ANNPR 2008 - P. Gallinari
Summary generative SP models • Well known approaches • Large number of existing models • Local hypothesis • Often imply using dynamic programming • Inference • learning ANNPR 2008 - P. Gallinari
Discriminant models • Structured Perceptron (Collins 2002) • Breakthrough • Large margin methods (Tsochantaridis et al. 2004, Taskar 2004) • Many others now • Considered as an extension of multi-class clasification ANNPR 2008 - P. Gallinari
Usual hypothesis • Joint representation of input – output Φ(x, y) • Encode potential dependencies among and between input and output • e.g. histogram of state transitions observed in training set, frequency of (xi,yj), POS tags, etc • Large feature sets (102 -> 104) • Linear score function: • Decomposability of features set (outputs) and of the loss function ANNPR 2008 - P. Gallinari
Extension of large margin methods • 2 problems • Classical 0/1 loss is meaningless with structured outputs • Structured outputs is not a simple extension of multiclass classification • Generalize max margin principle to other loss functions than 0/1 loss • Number of constraints for the QP problem is proportional to |Y|, i.e. potentially exponential • Different solutions • Consider only a polynomial number of “active” constraints -> bounds on the optimality of the solution • Learning require solving an Argmax problem at each iteration • i.e. Dynamic programming ANNPR 2008 - P. Gallinari
Summary: discriminant approaches • Hypothesis • Local dependencies for the output • Decomposability of the loss function • Long term dependencies in the input • Nice convergence properties + bounds • Complexity • Learning often does not scale ANNPR 2008 - P. Gallinari
Precursors • Incremental Parsing, Collins 2004 • SEARN, Daume et al. 2006 • Reinforcement Learning, Maes et al. 2007 ANNPR 2008 - P. Gallinari
General idea • The structured output will be built incrementally • ŷ = (ŷ1, ŷ2,…, ŷT) • Different from solving directly the prediction preblem • It will by build via Machine learning • Loss : • Goal • Construct incrementally ŷ so as to minimize the loss • Training: learn how to search the solution space • Inference: build ŷ as a sequence of successive decisions ANNPR 2008 - P. Gallinari
Example : sequence labelling Example 2 labels R et B Search space : (input sequence, {sequence of labels}) For a size 3 sequence x = x1x2x3 : A node represents a state in the search space ANNPR 2008 - P. Gallinari
Example:actions and decisions Sequence of size 3 target : Actions Π(s, a) Π(s, a) Π(s, a) Π(s, a) : decision function Π(s, a) ANNPR 2008 - P. Gallinari
Example : expected loss C=0 CT=1 Sequence of size 3 target : C=0 C=1 C=1 CT=2 C=0 CT=2 C=1 C=1 CT=3 C=0 CT=0 C=0 C=0 C=1 CT=1 Loss does not always separate !! C=0 C=1 CT=1 C=1 CT=2 ANNPR 2008 - P. Gallinari
Inference • Suppose we have a policy function F which decides at each step which action to take • Inference could be performed by computing • ŷ1= F(x,. ), ŷt= F(ŷ1,… ŷt-1),…, ŷT= F(ŷ1,… , ŷT-1) • ŷ = F(ŷ1,… , ŷT) • No Dynamic programming needed ANNPR 2008 - P. Gallinari
Example: state space exploration guided by local costs C=0 CT=1 Sequence of size 3 target : C=0 C=1 C=1 CT=2 C=0 CT=2 C=1 C=1 CT=3 C=0 CT=0 C=0 Goal: generalize to unseen situation C=0 C=1 CT=1 C=0 C=1 CT=1 C=1 CT=2 ANNPR 2008 - P. Gallinari
Training • Learn a policyF • From a set of (input, output) pairs • Which takes the optimal action at each state • Which is able to generalize to unknown situations ANNPR 2008 - P. Gallinari
Reinforcement learning for SP • Formalizes search based ideas using Markov Decision Processes as model and Reinforcement Learning for training • Provides a general framework for this approach • Many RL algorithm could be used for training • Differences with classical use of RL • Specific hypothesis • Deterministic environment • Problem size • State : up to 106 variables • Generalization properties ANNPR 2008 - P. Gallinari
Markov Decision Process • A MDP is a tuple (S, A, P, R) • S is the State space, • A is the action space, • P is a transition function describing the dynamic of the environment • P(s, a, s’) = P(st+1 = s’| st = s, at = a) • R is a reward function • R(s, a, s’) = E[rt|st+1 = s’, st = s, at = a) ANNPR 2008 - P. Gallinari
Example: Prediction problem ANNPR 2008 - P. Gallinari
Example: State space • Deterministic MDP • States: input + partial output • Actions: elementary modifications of the partial output • Rewards: quality of prediction ANNPR 2008 - P. Gallinari
Example: partially observable reward • The reward function is only partially observable : limited set of training (input, output) ANNPR 2008 - P. Gallinari
Learning the MPDP • The reward function cannot be computed on the whole MDP • Approximated RL • Learn the policy on a subspace • Generalize to the whole space • The policy is learned as a linear function • Φ(s, a) is a joint description of (s, a) • θ is a ,parameter vector • Learning algorithm • Sarsa, Q-learning, etc ANNPR 2008 - P. Gallinari
Inference • State at step t • Initial input • Partial output • Once the MDP is learned • Using the state at step t, compute state t+1 • Until a solution is reached • Greedy approach • Only the best action is chosen at each time ANNPR 2008 - P. Gallinari
Structured outputs and MDP • State st • Input x + partial output ŷt • Initial state : (x,) • Actions • Task dependent • POS: new tag for the current word • Tree mapping: insert a new path in a partial tree • Inference • Greedy approach • Only the best action is chosen at each time • Reward • Final: • Heuristic: ANNPR 2008 - P. Gallinari
Exemple: sequence labeling • Left Right model • Actions: label • Order free model • Actions: label + position • Loss : Hamming cost or F score • Tasks • Named entity recognition (Shared task at CoNNL 2002 - 8 000 train, 1500 test) • Chunking – Noun Phrases (CoNNL 2002) • Handwriten word recognition (5000 train, 1000 test) • Complexity of inference • O(sequence size * number of labels) ANNPR 2008 - P. Gallinari
Tree mapping: Document mapping • Problem Labeled tree mapping problem Different instances Flat text to XML HTML to XML XML to XML…. ANNPR 2008 - P. Gallinari
Document mapping problem • Central issue: Complexity • Large collections • Large feature space: 103 to 106 • Large search space (exponential) • Other SP methods fail ANNPR 2008 - P. Gallinari
XML structuration • Action • Attach path ending with the current leaf to a position in the current partial tree • Φ(a,s) encodes a series of potential (state, action) pairs • Loss: F-Score for trees ANNPR 2008 - P. Gallinari
HTML HEAD BODY TITLE IT H1 P FONT Example Francis MAES Title of the section Welcome to INEX This is a footnote Francis MAES Title of the section TITLE AUTHOR SECTION INPUT DOCUMENT Example Welcome to INEX TEXT TARGET DOCUMENT TITLE DOCUMENT ANNPR 2008 - P. Gallinari
HTML HEAD BODY TITLE IT H1 P FONT Example Francis MAES Title of the section Welcome to INEX This is a footnote ANNPR 2008 - P. Gallinari
HTML HEAD BODY TITLE IT H1 P FONT Example Francis MAES Title of the section Welcome to INEX This is a footnote ANNPR 2008 - P. Gallinari
HTML HEAD BODY TITLE IT H1 P FONT Example Francis MAES Title of the section Welcome to INEX This is a footnote Example TITLE DOCUMENT ANNPR 2008 - P. Gallinari
HTML HEAD BODY TITLE IT H1 P FONT Example Francis MAES Title of the section Welcome to INEX This is a footnote Example TITLE DOCUMENT ANNPR 2008 - P. Gallinari
HTML HEAD BODY TITLE IT H1 P FONT Example Francis MAES Title of the section Welcome to INEX This is a footnote Example TITLE DOCUMENT ANNPR 2008 - P. Gallinari
HTML HEAD BODY TITLE IT H1 P FONT Example Francis MAES Title of the section Welcome to INEX This is a footnote Example Francis MAES TITLE AUTHOR DOCUMENT ANNPR 2008 - P. Gallinari