1.49k likes | 1.91k Views
Markov Logic in Natural Language Processing. Hoifung Poon Dept. of Computer Science & Eng. University of Washington. Overview. Motivation Foundational areas Markov logic NLP applications Basics Supervised learning Unsupervised learning.
E N D
Markov Logic in Natural Language Processing Hoifung Poon Dept. of Computer Science & Eng. University of Washington
Overview • Motivation • Foundational areas • Markov logic • NLP applications • Basics • Supervised learning • Unsupervised learning
Holy Grail of NLP:Automatic Language Understanding Natural language search Answer questions Knowledge discovery …… Text Meaning 3 3
Reality: Increasingly Fragmented Parsing Semantics Tagging Information Extraction Morphology
Time for a New Synthesis? • Speed up progress • New opportunities to improve performance • But we need a new tool for this …
Languages Are Structural governments lm$pxtm (according to their families)
Languages Are Structural S govern-ment-s l-m$px-t-m (according to their families) VP NP V NP IL-4 induces CD11B Involvement of p70(S6)-kinase activation in IL-10 up-regulation in human monocytes by gp41...... George Walker Bush was the 43rd President of the United States. …… Bush was the eldest son of President G. H. W. Bush and Babara Bush. ……. In November 1977, hemet Laura Welch at a barbecue. involvement Theme Cause up-regulation activation Theme Cause Site Theme human monocyte IL-10 gp41 p70(S6)-kinase
Languages Are Structural S govern-ment-s l-m$px-t-m (according to their families) VP NP V NP IL-4 induces CD11B Involvement of p70(S6)-kinase activation in IL-10 up-regulation in human monocytes by gp41...... George Walker Bush was the 43rd President of the United States. …… Bush was the eldest son of President G. H. W. Bush and Babara Bush. ……. In November 1977, hemet Laura Welch at a barbecue. involvement Theme Cause up-regulation activation Theme Cause Site Theme human monocyte IL-10 gp41 p70(S6)-kinase
Processing Is Complex Morphology POS Tagging Chunking Semantic Role Labeling Syntactic Parsing Coreference Resolution Information Extraction ……
Pipeline Is Suboptimal Morphology POS Tagging Chunking Semantic Role Labeling Syntactic Parsing Coreference Resolution Information Extraction ……
First-Order Logic • Main theoretical foundation of computer science • General language for describing complex structures and knowledge • Trees, graphs, dependencies, hierarchies, etc. easily expressed • Inference algorithms (satisfiability testing, theorem proving, etc.)
Languages Are Statistical Microsoft buysPowerset Microsoft acquires Powerset Powersetis acquired by Microsoft Corporation The Redmond software giant buysPowerset Microsoft’s purchase ofPowerset, … …… I saw the man with the telescope NP I sawthe man with the telescope NP ADVP I sawthe manwith the telescope Here in London, Frances Deek is a retired teacher … In the Israeli town …, Karen London says … Now London says … G. W. Bush …… …… Laura Bush…… Mrs. Bush …… Which one? London PERSON or LOCATION?
Languages Are Statistical • Languages are ambiguous • Our information is always incomplete • We need to model correlations • Our predictions are uncertain • Statistics provides the tools to handle this
Probabilistic Graphical Models • Mixture models • Hidden Markov models • Bayesian networks • Markov random fields • Maximum entropy models • Conditional random fields • Etc.
The Problem • Logic is deterministic, requires manual coding • Statistical models assume i.i.d. data,objects = feature vectors • Historically, statistical and logical NLPhave been pursued separately • We need to unify the two!
Also, Supervision Is Scarce • Supervised learning needs training examples • Tons of texts … but most are not annotated • Labeling is expensive (Cf. Penn-Treebank) Need to leverage indirect supervision
A Promising Solution: Statistical Relational Learning • Emerging direction in machine learning • Unifies logical and statistical approaches • Principal way to leverage direct and indirect supervision
Key: Joint Inference • Models complex interdependencies • Propagates information from more certain decisions to resolve ambiguities in others • Advantages: • Better and more intuitive models • Improve predictive accuracy • Compensate for lack of training examples • SRL can have even greater impact when direct supervision is scarce
Challenges in ApplyingStatistical Relational Learning • Learning is much harder • Inference becomes a crucial issue • Greater complexity for user
Progress to Date • Probabilistic logic [Nilsson, 1986] • Statistics and beliefs [Halpern, 1990] • Knowledge-based model construction[Wellman et al., 1992] • Stochastic logic programs [Muggleton, 1996] • Probabilistic relational models [Friedman et al., 1999] • Relational Markov networks [Taskar et al., 2002] • Etc. • This talk: Markov logic [Domingos & Lowd, 2009]
Markov Logic: A Unifying Framework • Probabilistic graphical models andfirst-order logic are special cases • Unified inference and learning algorithms • Easy-to-use software: Alchemy • Broad applicability • Goal of this tutorial:Quickly learn how to use Markov logic and Alchemy for a broad spectrum of NLP applications
Overview • Motivation • Foundational areas • Probabilistic inference • Statistical learning • Logical inference • Inductive logic programming • Markov logic • NLP applications • Basics • Supervised learning • Unsupervised learning
Markov Networks Smoking Cancer • Undirected graphical models Asthma Cough • Potential functions defined over cliques
Markov Networks Smoking Cancer • Undirected graphical models Asthma Cough • Log-linear model: Weight of Feature i Feature i
Inference in Markov Networks • Goal: compute marginals & conditionals of • Exact inference is #P-complete • Conditioning on Markov blanket is easy: • Gibbs sampling exploits this
MCMC: Gibbs Sampling state← random truth assignment fori ←1tonum-samplesdo for eachvariable x sample x according to P(x|neighbors(x)) state←state with new value of x P(F) ← fraction of states in which F is true
Other Inference Methods • Belief propagation (sum-product) • Mean field / Variational approximations
MAP/MPE Inference • Goal: Find most likely state of world given evidence Query Evidence
MAP Inference Algorithms • Iterated conditional modes • Simulated annealing • Graph cuts • Belief propagation (max-product) • LP relaxation
Overview Motivation Foundational areas Probabilistic inference Statistical learning Logical inference Inductive logic programming Markov logic NLP applications Basics Supervised learning Unsupervised learning
Generative Weight Learning • Maximize likelihood • Use gradient ascent or L-BFGS • No local maxima • Requires inference at each step (slow!) No. of times feature i is true in data Expected no. times feature i is true according to model
Pseudo-Likelihood • Likelihood of each variable given its neighbors in the data • Does not require inference at each step • Widely used in vision, spatial statistics, etc. • But PL parameters may not work well forlong inference chains
Discriminative Weight Learning • Maximize conditional likelihood of query (y) given evidence (x) • Approximate expected counts by counts in MAP state of y given x No. of true groundings of clause i in data Expected no. true groundings according to model
Voted Perceptron • Originally proposed for training HMMs discriminatively • Assumes network is linear chain • Can be generalized to arbitrary networks wi← 0 fort←1toT do yMAP← Viterbi(x) wi←wi+ η[counti(yData) – counti(yMAP)] return wi / T
Overview Motivation Foundational areas Probabilistic inference Statistical learning Logical inference Inductive logic programming Markov logic NLP applications Basics Supervised learning Unsupervised learning
First-Order Logic • Constants, variables, functions, predicatesE.g.: Anna, x, MotherOf(x), Friends(x, y) • Literal: Predicate or its negation • Clause: Disjunction of literals • Grounding: Replace all variables by constantsE.g.: Friends (Anna, Bob) • World (model, interpretation):Assignment of truth values to all ground predicates
Inference in First-Order Logic • Traditionally done by theorem proving(e.g.: Prolog) • Propositionalization followed by model checking turns out to be faster (often by a lot) • Propositionalization:Create all ground atoms and clauses • Model checking: Satisfiability testing • Two main approaches: • Backtracking (e.g.: DPLL) • Stochastic local search (e.g.: WalkSAT)
Satisfiability • Input: Set of clauses(Convert KB to conjunctive normal form (CNF)) • Output: Truth assignment that satisfies all clauses, or failure • The paradigmatic NP-complete problem • Solution: Search • Key point:Most SAT problems are actually easy • Hard region: Narrow range of#Clauses/#Variables
Stochastic Local Search • Uses complete assignments instead of partial • Start with random state • Flip variables in unsatisfied clauses • Hill-climbing: Minimize # unsatisfied clauses • Avoid local minima: Random flips • Multiple restarts
The WalkSAT Algorithm fori←1 to max-triesdo solution = random truth assignment for j←1tomax-flipsdo if all clauses satisfiedthen return solution c←random unsatisfied clause with probabilityp flip a random variable inc else flip variable in c that maximizes # satisfied clauses returnfailure
Overview Motivation Foundational areas Probabilistic inference Statistical learning Logical inference Inductive logic programming Markov logic NLP applications Basics Supervised learning Unsupervised learning
Rule Induction • Given: Set of positive and negative examples of some concept • Example:(x1, x2, … , xn, y) • y:concept (Boolean) • x1, x2, … , xn:attributes (assume Boolean) • Goal: Induce a set of rules that cover all positive examples and no negative ones • Rule: xa ^ xb ^ … y (xa: Literal, i.e., xi or its negation) • Same as Horn clause: Body Head • Rulercovers example x iff xsatisfies body of r • Eval(r): Accuracy, info gain, coverage, support, etc.
Learning a Single Rule head ← y body←Ø repeat for eachliteral x rx← r with x added to body Eval(rx) body ← body ^ best x untilno x improves Eval(r) returnr
Learning a Set of Rules R ← Ø S ← examples repeat learn a single ruler R ← R U { r } S ← S − positive examples covered by r untilS = Ø returnR
First-Order Rule Induction • y and xiare now predicates with argumentsE.g.: y is Ancestor(x,y), xi is Parent(x,y) • Literals to add are predicates or their negations • Literal to add must include at least one variablealready appearing in rule • Adding a literal changes # groundings of ruleE.g.: Ancestor(x,z) ^ Parent(z,y) Ancestor(x,y) • Eval(r) must take this into accountE.g.: Multiply by # positive groundings of rule still covered after adding literal
Overview • Motivation • Foundational areas • Markov logic • NLP applications • Basics • Supervised learning • Unsupervised learning
Markov Logic • Syntax: Weighted first-order formulas • Semantics: Feature templates for Markov networks • Intuition:Soften logical constraints • Give each formula a weight(Higher weight Stronger constraint)
Example: Coreference Resolution Barack Obama, the 44th President of the United States, is the first African American to hold the office. ……