1.58k likes | 1.89k Views
Markov Logic in Natural Language Processing. Hoifung Poon Dept. of Computer Science & Eng. University of Washington. Overview. Motivation Foundational areas Markov logic NLP applications Basics Supervised learning Unsupervised learning. Languages Are Structural. governments
E N D
Markov Logic in Natural Language Processing Hoifung Poon Dept. of Computer Science & Eng. University of Washington
Overview • Motivation • Foundational areas • Markov logic • NLP applications • Basics • Supervised learning • Unsupervised learning
Languages Are Structural governments lm$pxtm (according to their families)
Languages Are Structural S govern-ment-s l-m$px-t-m (according to their families) VP NP V NP IL-4 induces CD11B Involvement of p70(S6)-kinase activation in IL-10 up-regulation in human monocytes by gp41...... George Walker Bush was the 43rd President of the United States. …… Bush was the eldest son of President G. H. W. Bush and Babara Bush. ……. In November 1977, hemet Laura Welch at a barbecue. involvement Theme Cause up-regulation activation Theme Cause Site Theme human monocyte IL-10 gp41 p70(S6)-kinase
Languages Are Structural S govern-ment-s l-m$px-t-m (according to their families) VP NP V NP IL-4 induces CD11B Involvement of p70(S6)-kinase activation in IL-10 up-regulation in human monocytes by gp41...... George Walker Bush was the 43rd President of the United States. …… Bush was the eldest son of President G. H. W. Bush and Babara Bush. ……. In November 1977, hemet Laura Welch at a barbecue. involvement Theme Cause up-regulation activation Theme Cause Site Theme human monocyte IL-10 gp41 p70(S6)-kinase
Languages Are Structural • Objects are not just feature vectors • They have parts and subparts • Which have relations with each other • They can be trees, graphs, etc. • Objects are seldom i.i.d.(independent and identically distributed) • They exhibit local and global dependencies • They form class hierarchies (with multiple inheritance) • Objects’ properties depend on those of related objects • Deeply interwoven with knowledge
First-Order Logic • Main theoretical foundation of computer science • General language for describing complex structures and knowledge • Trees, graphs, dependencies, hierarchies, etc. easily expressed • Inference algorithms (satisfiability testing, theorem proving, etc.)
Languages Are Statistical Microsoft buysPowerset Microsoft acquires Powerset Powersetis acquired by Microsoft Corporation The Redmond software giant buysPowerset Microsoft’s purchase ofPowerset, … …… I saw the man with the telescope NP I sawthe man with the telescope NP ADVP I sawthe manwith the telescope Here in London, Frances Deek is a retired teacher … In the Israeli town …, Karen London says … Now London says … G. W. Bush …… …… Laura Bush…… Mrs. Bush …… Which one? London PERSON or LOCATION?
Languages Are Statistical • Languages are ambiguous • Our information is always incomplete • We need to model correlations • Our predictions are uncertain • Statistics provides the tools to handle this
Probabilistic Graphical Models • Mixture models • Hidden Markov models • Bayesian networks • Markov random fields • Maximum entropy models • Conditional random fields • Etc.
The Problem • Logic is deterministic, requires manual coding • Statistical models assume i.i.d. data,objects = feature vectors • Historically, statistical and logical NLPhave been pursued separately • We need to unify the two! • Burgeoning field in machine learning: Statistical relational learning
Costs and Benefits ofStatistical Relational Learning • Benefits • Better predictive accuracy • Better understanding of domains • Enable learning with less or no labeled data • Costs • Learning is much harder • Inference becomes a crucial issue • Greater complexity for user
Progress to Date • Probabilistic logic [Nilsson, 1986] • Statistics and beliefs [Halpern, 1990] • Knowledge-based model construction[Wellman et al., 1992] • Stochastic logic programs [Muggleton, 1996] • Probabilistic relational models [Friedman et al., 1999] • Relational Markov networks [Taskar et al., 2002] • Etc. • This talk: Markov logic [Domingos & Lowd, 2009]
Markov Logic: A Unifying Framework • Probabilistic graphical models andfirst-order logic are special cases • Unified inference and learning algorithms • Easy-to-use software: Alchemy • Broad applicability • Goal of this tutorial:Quickly learn how to use Markov logic and Alchemy for a broad spectrum of NLP applications
Overview • Motivation • Foundational areas • Probabilistic inference • Statistical learning • Logical inference • Inductive logic programming • Markov logic • NLP applications • Basics • Supervised learning • Unsupervised learning
Markov Networks Smoking Cancer • Undirected graphical models Asthma Cough • Potential functions defined over cliques
Markov Networks Smoking Cancer • Undirected graphical models Asthma Cough • Log-linear model: Weight of Feature i Feature i
Inference in Markov Networks • Goal: compute marginals & conditionals of • Exact inference is #P-complete • Conditioning on Markov blanket is easy: • Gibbs sampling exploits this
MCMC: Gibbs Sampling state← random truth assignment fori ←1tonum-samplesdo for eachvariable x sample x according to P(x|neighbors(x)) state←state with new value of x P(F) ← fraction of states in which F is true
Other Inference Methods • Belief propagation (sum-product) • Mean field / Variational approximations
MAP/MPE Inference • Goal: Find most likely state of world given evidence Query Evidence
MAP Inference Algorithms • Iterated conditional modes • Simulated annealing • Graph cuts • Belief propagation (max-product) • LP relaxation
Overview Motivation Foundational areas Probabilistic inference Statistical learning Logical inference Inductive logic programming Markov logic NLP applications Basics Supervised learning Unsupervised learning
Generative Weight Learning • Maximize likelihood • Use gradient ascent or L-BFGS • No local maxima • Requires inference at each step (slow!) No. of times feature i is true in data Expected no. times feature i is true according to model
Pseudo-Likelihood • Likelihood of each variable given its neighbors in the data • Does not require inference at each step • Widely used in vision, spatial statistics, etc. • But PL parameters may not work well forlong inference chains
Discriminative Weight Learning • Maximize conditional likelihood of query (y) given evidence (x) • Approximate expected counts by counts in MAP state of y given x No. of true groundings of clause i in data Expected no. true groundings according to model
Voted Perceptron • Originally proposed for training HMMs discriminatively • Assumes network is linear chain • Can be generalized to arbitrary networks wi← 0 fort←1toT do yMAP← Viterbi(x) wi←wi+ η[counti(yData) – counti(yMAP)] return wi / T
Overview Motivation Foundational areas Probabilistic inference Statistical learning Logical inference Inductive logic programming Markov logic NLP applications Basics Supervised learning Unsupervised learning
First-Order Logic • Constants, variables, functions, predicatesE.g.: Anna, x, MotherOf(x), Friends(x, y) • Literal: Predicate or its negation • Clause: Disjunction of literals • Grounding: Replace all variables by constantsE.g.: Friends (Anna, Bob) • World (model, interpretation):Assignment of truth values to all ground predicates
Inference in First-Order Logic • Traditionally done by theorem proving(e.g.: Prolog) • Propositionalization followed by model checking turns out to be faster (often by a lot) • Propositionalization:Create all ground atoms and clauses • Model checking: Satisfiability testing • Two main approaches: • Backtracking (e.g.: DPLL) • Stochastic local search (e.g.: WalkSAT)
Satisfiability • Input: Set of clauses(Convert KB to conjunctive normal form (CNF)) • Output: Truth assignment that satisfies all clauses, or failure • The paradigmatic NP-complete problem • Solution: Search • Key point:Most SAT problems are actually easy • Hard region: Narrow range of#Clauses/#Variables
Stochastic Local Search • Uses complete assignments instead of partial • Start with random state • Flip variables in unsatisfied clauses • Hill-climbing: Minimize # unsatisfied clauses • Avoid local minima: Random flips • Multiple restarts
The WalkSAT Algorithm fori←1 to max-triesdo solution = random truth assignment for j←1tomax-flipsdo if all clauses satisfiedthen return solution c←random unsatisfied clause with probabilityp flip a random variable inc else flip variable in c that maximizes # satisfied clauses returnfailure
Overview Motivation Foundational areas Probabilistic inference Statistical learning Logical inference Inductive logic programming Markov logic NLP applications Basics Supervised learning Unsupervised learning
Rule Induction • Given: Set of positive and negative examples of some concept • Example:(x1, x2, … , xn, y) • y:concept (Boolean) • x1, x2, … , xn:attributes (assume Boolean) • Goal: Induce a set of rules that cover all positive examples and no negative ones • Rule: xa ^ xb ^ … y (xa: Literal, i.e., xi or its negation) • Same as Horn clause: Body Head • Rulercovers example x iff xsatisfies body of r • Eval(r): Accuracy, info gain, coverage, support, etc.
Learning a Single Rule head ← y body←Ø repeat for eachliteral x rx← r with x added to body Eval(rx) body ← body ^ best x untilno x improves Eval(r) returnr
Learning a Set of Rules R ← Ø S ← examples repeat learn a single ruler R ← R U { r } S ← S − positive examples covered by r untilS = Ø returnR
First-Order Rule Induction • y and xiare now predicates with argumentsE.g.: y is Ancestor(x,y), xi is Parent(x,y) • Literals to add are predicates or their negations • Literal to add must include at least one variablealready appearing in rule • Adding a literal changes # groundings of ruleE.g.: Ancestor(x,z) ^ Parent(z,y) Ancestor(x,y) • Eval(r) must take this into accountE.g.: Multiply by # positive groundings of rule still covered after adding literal
Overview • Motivation • Foundational areas • Markov logic • NLP applications • Basics • Supervised learning • Unsupervised learning
Markov Logic • Syntax: Weighted first-order formulas • Semantics: Feature templates for Markov networks • Intuition:Soften logical constraints • Give each formula a weight(Higher weight Stronger constraint)
Example: Coreference Resolution Barack Obama, the 44th President of the United States, is the first African American to hold the office. ……
Example: Coreference Resolution Two mention constants: A and B Apposition(A,B) Head(A,“President”) Head(B,“President”) MentionOf(A,Obama) MentionOf(B,Obama) Head(A,“Obama”) Head(B,“Obama”) Apposition(B,A)
Markov Logic Networks • MLN is template for ground Markov nets • Probability of a world x: • Typed variables and constants greatly reduce size of ground Markov net • Functions, existential quantifiers, etc. • Can handle infinite domains [Singla & Domingos, 2007]and continuous domains [Wang & Domingos, 2008] Weight of formula i No. of true groundings of formula i in x
Special cases: Markov networks Markov random fields Bayesian networks Log-linear models Exponential models Max. entropy models Gibbs distributions Boltzmann machines Logistic regression Hidden Markov models Conditional random fields Obtained by making all predicates zero-arity Markov logic allows objects to be interdependent (non-i.i.d.) Relation to Statistical Models
Relation to First-Order Logic • Infinite weights First-order logic • Satisfiable KB, positive weights Satisfying assignments = Modes of distribution • Markov logic allows contradictions between formulas
MAP/MPE Inference • Problem: Find most likely state of world given evidence Query Evidence