Markov Logic in Natural Language Processing

Markov Logic in Natural Language Processing Hoifung Poon Dept. of Computer Science & Eng. University of Washington

Overview • Motivation • Foundational areas • Markov logic • NLP applications • Basics • Supervised learning • Unsupervised learning

Languages Are Structural governments lm$pxtm (according to their families)

Languages Are Structural S govern-ment-s l-m$px-t-m (according to their families) VP NP V NP IL-4 induces CD11B Involvement of p70(S6)-kinase activation in IL-10 up-regulation in human monocytes by gp41...... George Walker Bush was the 43rd President of the United States. …… Bush was the eldest son of President G. H. W. Bush and Babara Bush. ……. In November 1977, hemet Laura Welch at a barbecue. involvement Theme Cause up-regulation activation Theme Cause Site Theme human monocyte IL-10 gp41 p70(S6)-kinase

Languages Are Structural • Objects are not just feature vectors • They have parts and subparts • Which have relations with each other • They can be trees, graphs, etc. • Objects are seldom i.i.d.(independent and identically distributed) • They exhibit local and global dependencies • They form class hierarchies (with multiple inheritance) • Objects’ properties depend on those of related objects • Deeply interwoven with knowledge

First-Order Logic • Main theoretical foundation of computer science • General language for describing complex structures and knowledge • Trees, graphs, dependencies, hierarchies, etc. easily expressed • Inference algorithms (satisfiability testing, theorem proving, etc.)

Languages Are Statistical Microsoft buysPowerset Microsoft acquires Powerset Powersetis acquired by Microsoft Corporation The Redmond software giant buysPowerset Microsoft’s purchase ofPowerset, … …… I saw the man with the telescope NP I sawthe man with the telescope NP ADVP I sawthe manwith the telescope Here in London, Frances Deek is a retired teacher … In the Israeli town …, Karen London says … Now London says … G. W. Bush …… …… Laura Bush…… Mrs. Bush …… Which one? London PERSON or LOCATION?

Languages Are Statistical • Languages are ambiguous • Our information is always incomplete • We need to model correlations • Our predictions are uncertain • Statistics provides the tools to handle this

Probabilistic Graphical Models • Mixture models • Hidden Markov models • Bayesian networks • Markov random fields • Maximum entropy models • Conditional random fields • Etc.

The Problem • Logic is deterministic, requires manual coding • Statistical models assume i.i.d. data,objects = feature vectors • Historically, statistical and logical NLPhave been pursued separately • We need to unify the two! • Burgeoning field in machine learning: Statistical relational learning

Costs and Benefits ofStatistical Relational Learning • Benefits • Better predictive accuracy • Better understanding of domains • Enable learning with less or no labeled data • Costs • Learning is much harder • Inference becomes a crucial issue • Greater complexity for user

Progress to Date • Probabilistic logic [Nilsson, 1986] • Statistics and beliefs [Halpern, 1990] • Knowledge-based model construction[Wellman et al., 1992] • Stochastic logic programs [Muggleton, 1996] • Probabilistic relational models [Friedman et al., 1999] • Relational Markov networks [Taskar et al., 2002] • Etc. • This talk: Markov logic [Domingos & Lowd, 2009]

Markov Logic: A Unifying Framework • Probabilistic graphical models andfirst-order logic are special cases • Unified inference and learning algorithms • Easy-to-use software: Alchemy • Broad applicability • Goal of this tutorial:Quickly learn how to use Markov logic and Alchemy for a broad spectrum of NLP applications

Overview • Motivation • Foundational areas • Probabilistic inference • Statistical learning • Logical inference • Inductive logic programming • Markov logic • NLP applications • Basics • Supervised learning • Unsupervised learning

Markov Networks Smoking Cancer • Undirected graphical models Asthma Cough • Potential functions defined over cliques

Markov Networks Smoking Cancer • Undirected graphical models Asthma Cough • Log-linear model: Weight of Feature i Feature i

Markov Nets vs. Bayes Nets

Inference in Markov Networks • Goal: compute marginals & conditionals of • Exact inference is #P-complete • Conditioning on Markov blanket is easy: • Gibbs sampling exploits this

MCMC: Gibbs Sampling state← random truth assignment fori ←1tonum-samplesdo for eachvariable x sample x according to P(x|neighbors(x)) state←state with new value of x P(F) ← fraction of states in which F is true

Other Inference Methods • Belief propagation (sum-product) • Mean field / Variational approximations

MAP/MPE Inference • Goal: Find most likely state of world given evidence Query Evidence

MAP Inference Algorithms • Iterated conditional modes • Simulated annealing • Graph cuts • Belief propagation (max-product) • LP relaxation

Overview Motivation Foundational areas Probabilistic inference Statistical learning Logical inference Inductive logic programming Markov logic NLP applications Basics Supervised learning Unsupervised learning

Generative Weight Learning • Maximize likelihood • Use gradient ascent or L-BFGS • No local maxima • Requires inference at each step (slow!) No. of times feature i is true in data Expected no. times feature i is true according to model

Pseudo-Likelihood • Likelihood of each variable given its neighbors in the data • Does not require inference at each step • Widely used in vision, spatial statistics, etc. • But PL parameters may not work well forlong inference chains

Discriminative Weight Learning • Maximize conditional likelihood of query (y) given evidence (x) • Approximate expected counts by counts in MAP state of y given x No. of true groundings of clause i in data Expected no. true groundings according to model

Voted Perceptron • Originally proposed for training HMMs discriminatively • Assumes network is linear chain • Can be generalized to arbitrary networks wi← 0 fort←1toT do yMAP← Viterbi(x) wi←wi+ η[counti(yData) – counti(yMAP)] return wi / T

First-Order Logic • Constants, variables, functions, predicatesE.g.: Anna, x, MotherOf(x), Friends(x, y) • Literal: Predicate or its negation • Clause: Disjunction of literals • Grounding: Replace all variables by constantsE.g.: Friends (Anna, Bob) • World (model, interpretation):Assignment of truth values to all ground predicates

Inference in First-Order Logic • Traditionally done by theorem proving(e.g.: Prolog) • Propositionalization followed by model checking turns out to be faster (often by a lot) • Propositionalization:Create all ground atoms and clauses • Model checking: Satisfiability testing • Two main approaches: • Backtracking (e.g.: DPLL) • Stochastic local search (e.g.: WalkSAT)

Satisfiability • Input: Set of clauses(Convert KB to conjunctive normal form (CNF)) • Output: Truth assignment that satisfies all clauses, or failure • The paradigmatic NP-complete problem • Solution: Search • Key point:Most SAT problems are actually easy • Hard region: Narrow range of#Clauses/#Variables

Stochastic Local Search • Uses complete assignments instead of partial • Start with random state • Flip variables in unsatisfied clauses • Hill-climbing: Minimize # unsatisfied clauses • Avoid local minima: Random flips • Multiple restarts

The WalkSAT Algorithm fori←1 to max-triesdo solution = random truth assignment for j←1tomax-flipsdo if all clauses satisfiedthen return solution c←random unsatisfied clause with probabilityp flip a random variable inc else flip variable in c that maximizes # satisfied clauses returnfailure

Rule Induction • Given: Set of positive and negative examples of some concept • Example:(x1, x2, … , xn, y) • y:concept (Boolean) • x1, x2, … , xn:attributes (assume Boolean) • Goal: Induce a set of rules that cover all positive examples and no negative ones • Rule: xa ^ xb ^ …  y (xa: Literal, i.e., xi or its negation) • Same as Horn clause: Body  Head • Rulercovers example x iff xsatisfies body of r • Eval(r): Accuracy, info gain, coverage, support, etc.

Learning a Single Rule head ← y body←Ø repeat for eachliteral x rx← r with x added to body Eval(rx) body ← body ^ best x untilno x improves Eval(r) returnr

Learning a Set of Rules R ← Ø S ← examples repeat learn a single ruler R ← R U { r } S ← S − positive examples covered by r untilS = Ø returnR

First-Order Rule Induction • y and xiare now predicates with argumentsE.g.: y is Ancestor(x,y), xi is Parent(x,y) • Literals to add are predicates or their negations • Literal to add must include at least one variablealready appearing in rule • Adding a literal changes # groundings of ruleE.g.: Ancestor(x,z) ^ Parent(z,y)  Ancestor(x,y) • Eval(r) must take this into accountE.g.: Multiply by # positive groundings of rule still covered after adding literal

Overview • Motivation • Foundational areas • Markov logic • NLP applications • Basics • Supervised learning • Unsupervised learning

Markov Logic • Syntax: Weighted first-order formulas • Semantics: Feature templates for Markov networks • Intuition:Soften logical constraints • Give each formula a weight(Higher weight  Stronger constraint)

Example: Coreference Resolution Barack Obama, the 44th President of the United States, is the first African American to hold the office. ……

Example: Coreference Resolution

Example: Coreference Resolution Two mention constants: A and B Apposition(A,B) Head(A,“President”) Head(B,“President”) MentionOf(A,Obama) MentionOf(B,Obama) Head(A,“Obama”) Head(B,“Obama”) Apposition(B,A)

Markov Logic Networks • MLN is template for ground Markov nets • Probability of a world x: • Typed variables and constants greatly reduce size of ground Markov net • Functions, existential quantifiers, etc. • Can handle infinite domains [Singla & Domingos, 2007]and continuous domains [Wang & Domingos, 2008] Weight of formula i No. of true groundings of formula i in x

Special cases: Markov networks Markov random fields Bayesian networks Log-linear models Exponential models Max. entropy models Gibbs distributions Boltzmann machines Logistic regression Hidden Markov models Conditional random fields Obtained by making all predicates zero-arity Markov logic allows objects to be interdependent (non-i.i.d.) Relation to Statistical Models

Relation to First-Order Logic • Infinite weights  First-order logic • Satisfiable KB, positive weights Satisfying assignments = Modes of distribution • Markov logic allows contradictions between formulas

MLN Algorithms:The First Three Generations

MAP/MPE Inference • Problem: Find most likely state of world given evidence Query Evidence

Markov Logic in Natural Language Processing

Markov Logic in Natural Language Processing

Presentation Transcript

Markov Logic Networks: A Unified Approach To Language Processing

Markov Logic in Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Introduction to Natural Language Processing (600.465) Markov Models

Natural Language Processing

Natural Language Processing

74.406 Natural Language Processing - Formal Logic -

Markov Logic: A Representation Language for Natural Language Semantics

Logic Programming for Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing