Statistical Relational Learning for Knowledge Extraction from the Web

Statistical Relational Learning for Knowledge Extraction from the Web Hoifung Poon Dept. of Computer Science & Eng. University of Washington 1

“Drowning in Information, Starved for Knowledge” WWW 2 2 2

Great Vision:Knowledge Extraction from Web Craven et al., “Learning to Construct Knowledge Bases from the World Wide Web," Artificial Intelligence, 1999. • Also need: • Knowledge representation and reasoning • Close the loop: Apply knowledge to extraction • Machine reading[Etzioni et al., 2007] 3

Machine Reading: Text  Knowledge …… 4 4 4

Rapidly Growing Interest • AAAI-07 Spring Symposium on Machine Reading • DARPA Machine Reading Program (2009-2014) • NAACL-10 Workshop on Learning By Reading • Etc. 5

Great Impact • Scientific inquiry and commercial applications • Literature-based discovery, robot scientists • Question answering, semantic search • Drug design, medical diagnosis • Breach knowledge acquisition bottleneck for AI and natural language understanding • Automatically semantify the Web • Etc. 6

This Talk • Statistical relational learning offers promising solutions to machine reading • Markov logic is a leading unifying framework • A success story: USP • Unsupervised, end-to-end machine reading • Extracts five times as many correct answers as state of the art, with highest accuracy of 91% 7

USP: Question-Answer Example Interestingly, the DEX-mediated IkappaBalpha induction was completely inhibited by IL-2, but not IL-4, in Th1 cells, while the reverse profile was seen in Th2 cells. Q: What does IL-2 control? A: The DEX-mediated IkappaBalpha induction 8 8

Overview Machine reading: Challenges Statistical relational learning Markov logic USP: Unsupervised Semantic Parsing Research directions 9 9 9

Key Challenges • Complexity • Uncertainty • Pipeline accumulates errors • Supervision is scarce 10

Languages Are Structural governments lm$pxtm (Hebrew: according to their families) IL-4 induces CD11B Involvement of p70(S6)-kinase activation in IL-10 up-regulation in human monocytes by gp41...... George Walker Bush was the 43rd President of the United States. …… Bush was the eldest son of President G. H. W. Bush and Babara Bush. ……. In November 1977, hemet Laura Welch at a barbecue. 11 11 11

Languages Are Structural S govern-ment-s l-m$px-t-m (Hebrew: according to their families) VP NP V NP IL-4 induces CD11B Involvement of p70(S6)-kinase activation in IL-10 up-regulation in human monocytes by gp41...... George Walker Bush was the 43rd President of the United States. …… Bush was the eldest son of President G. H. W. Bush and Babara Bush. ……. In November 1977, hemet Laura Welch at a barbecue. involvement Theme Cause up-regulation activation Theme Cause Site Theme human monocyte IL-10 gp41 p70(S6)-kinase 12 12 12

Knowledge Is Heterogeneous • Individuals E.g.: Socrates is a man • Types E.g.: Man is mortal • Inference rules E.g.: Syllogism • Ontological relations • Etc. MAMMAL FACE ISA ISPART HUMAN EYE 13 13

Complexity • Can handle using first-order logic • Trees, graphs, dependencies, hierarchies, etc. easily expressed • Inference algorithms (satisfiability testing, theorem proving, etc.) • But … logic is brittle with uncertainty 14 14

Languages Are Ambiguous Microsoft buysPowerset Microsoft acquires Powerset Powersetis acquired by Microsoft Corporation The Redmond software giant buysPowerset Microsoft’s purchase ofPowerset, … …… I saw the man with the telescope NP I sawthe man with the telescope NP ADVP I sawthe manwith the telescope Here in London, Frances Deek is a retired teacher … In the Israeli town …, Karen London says … Now London says … G. W. Bush …… …… Laura Bush…… Mrs. Bush …… Which one? London PERSON or LOCATION? 15 15 15

Knowledge Has Uncertainty • We need to model correlations • Our information is always incomplete • Our predictions are uncertain 16 16

Uncertainty • Statistics provides the tools to handle this • Mixture models • Hidden Markov models • Bayesian networks • Markov random fields • Maximum entropy models • Conditional random fields • Etc. • But … statistical models assume i.i.d. data(independently and identically distributed)objects  feature vectors

Pipeline is Suboptimal • E.g., NLP pipeline: Tokenization  Morphology  Chunking  Syntax  … • Accumulates and propagates errors • Wanted: Joint inference • Across all processing stages • Among all interdependent objects 18

Supervision is Scarce Tons of text … but most is not annotated Labeling is expensive (Cf. Penn-Treebank) Need to leverage indirect supervision 19 19 19

Redundancy • Key source of indirect supervision • State-of-the-art systems depend on this E.g., TextRunner [Banko et al., 2007] • But … Web is heterogeneous: Long tail • Redundancy only present in head regime

Overview Machine reading: Challenges Statistical relational learning Markov logic USP: Unsupervised Semantic Parsing Research directions 21 21 21

Statistical Relational Learning Burgeoning field in machine learning Offers promising solutions for machine reading Unify statistical and logical approaches Replace pipeline with joint inference Principled framework to leverage both direct and indirect supervision 22 22

Machine Reading: A Vision Challenge: Long tail 23

Machine Reading: A Vision 24

Challenges in Applying Statistical Relational Learning Learning is much harder Inference becomes a crucial issue Greater complexity for user 25 25

Progress to Date Probabilistic logic [Nilsson, 1986] Statistics and beliefs [Halpern, 1990] Knowledge-based model construction[Wellman et al., 1992] Stochastic logic programs [Muggleton, 1996] Probabilistic relational models [Friedman et al., 1999] Relational Markov networks [Taskar et al., 2002] Markov logic[Domingos & Lowd, 2009] Etc. 26 26

Progress to Date Probabilistic logic [Nilsson, 1986] Statistics and beliefs [Halpern, 1990] Knowledge-based model construction[Wellman et al., 1992] Stochastic logic programs [Muggleton, 1996] Probabilistic relational models [Friedman et al., 1999] Relational Markov networks [Taskar et al., 2002] Markov logic [Domingos & Lowd, 2009] Etc. Leading unifying framework 27 27

Overview Machine reading Statistical relational learning Markov logic USP: Unsupervised Semantic Parsing Research directions 28 28 28

Markov Networks Undirected graphical models Smoking Cancer Asthma Cough • Log-linear model: Weight of Feature i Feature i 29

First-Order Logic Constants, variables, functions, predicatesE.g.: Anna, x, MotherOf(x), Friends(x,y) Grounding: Replace all variables by constantsE.g.: Friends (Anna, Bob) World (model, interpretation):Assignment of truth values to all ground predicates 30

Markov Logic • Intuition: Soften logical constraints • Syntax: Weighted first-order formulas • Semantics: Feature templates for Markov networks • A Markov Logic Network (MLN) is a set of pairs (Fi, wi) where • Fi is a formula in first-order logic • wiis a real number Number of true groundings of Fi 31

Example: Friends & Smokers 32

Example: Friends & Smokers Probabilistic graphical models andfirst-order logic are special cases Two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A) 35

MLN Algorithms:The First Three Generations 36

Efficient Inference • Logical or statistical inference already hard • But … can do approximate inference Suffice to perform well in most cases • Combine ideas from both camps • E.g., MC-SAT  MCMC  SAT solver • Can also leverage sparsity in relational domains More: Poon & Domingos, “Sound and Efficient Inference with Probabilistic and Deterministic Dependencies”, in Proc. AAAI-2006. More: Poon, Domingos & Sumner, “A General Method for Reducing the Complexity of Relational Inference and its Application to MCMC”, in Proc. AAAI-2008. 37

Weight Learning • Probability model P(X) • X: Observable in training data • Maximize likelihood of observed data • Regularization to prevent overfitting

Weight Learning Gradient descent Use MC-SAT for inference Can also leverage second-order information [Lowd & Domingos, 2007] Requires inference No. of times clause i is true in data Expected no. times clause i is true according to MLN 39 39 39

Unsupervised Learning: How? I.I.D. learning: Sophisticated model requires more labeled data Statistical relational learning: Sophisticated model may require less labeled data Ambiguities vary among objects Joint inference  Propagate information from unambiguous objects to ambiguous ones One formula is worth a thousand labels Small amount of domain knowledge  large-scale joint inference 40 40 40

Unsupervised Weight Learning • Probability model P(X,Z) • X: Observed in training data • Z: Hidden variables • E.g., clustering with mixture models • Z: Cluster assignment • X: Observed features • Maximize likelihood of observed data by summing out hidden variables Z

Unsupervised Weight Learning Gradient descent Use MC-SAT to compute both expectations May also combine with contrastive estimation Sum over z, conditioned on observed x Summed over bothx and z More: Poon, Cherry, & Toutanova, “Unsupervised Morphological Segmentation with Log-Linear Models”, in Proc. NAACL-2009. 42 42 42 Best Paper Award

Markov Logic Unified inference and learning algorithms  Can handle millions of variables, billions of features, ten of thousands of parameters Easy-to-use software: Alchemy Many successful applications E.g.: Information extraction, coreference resolution, semantic parsing, ontology induction 43 43 43

Pipeline  Joint Inference Combine segmentation and entity resolution for information extraction Extract complex and nested bio-events from PubMed abstracts More: Poon & Domingos, “Joint Inference for Information Extraction”, in Proc. AAAI-2007. More: Poon & Vanderwende, “Joint Inference for Knowledge Extraction from Biomedical Literature”, in Proc. NAACL-2010. 44 44

Unsupervised Learning: Example Coreference resolution:Accuracy comparable to previous supervised state of the art More: Poon & Domingos, “Joint Unsupervised Coreference Resolution with Markov Logic”, in Proc. EMNLP-2008. 45 45

Overview Machine reading: Challenges Statistical relational learning Markov logic USP:Unsupervised Semantic Parsing Research directions 46 46 46

Unsupervised Semantic Parsing USP [Poon & Domingos, EMNLP-09] First unsupervised approach for semantic parsing End-to-end machine reading system Read text, answer questions OntoUSP USP  Ontology Induction [Poon & Domingos, ACL-10] Encoded in a few Markov logic formulas Best Paper Award 47 47

Semantic Parsing Goal • Microsoft buys Powerset • BUY(MICROSOFT,POWERSET) Challenge Microsoft buysPowerset Microsoft acquiressemantic search engine Powerset Powersetis acquired by Microsoft Corporation The Redmond software giant buysPowerset Microsoft’s purchase of Powerset, … 48 48 48

Limitations of Existing Approaches • Manual grammar or supervised learning • Applicable to restricted domains only • For general text • Not clear what predicates and objects to use • Hard to produce consistent meaning annotation • Also, often learn both syntax and semantics • Fail to leverage advanced syntactic parsers • Make semantic parsing harder

USP: Key Idea # 1 Target predicates and objects can be learned Viewed as clusters of syntactic or lexical variations of the same meaning BUY(-,-)  buys, acquires, ’s purchase of, …  Cluster of various expressions for acquisition MICROSOFT  Microsoft, the Redmond software giant, …  Cluster of various mentions of Microsoft 50

Statistical Relational Learning for Knowledge Extraction from the Web

Statistical Relational Learning for Knowledge Extraction from the Web

Presentation Transcript

Statistical Relational Learning

Joint Inference for Knowledge Extraction from Biomedical Literature

Information Extraction from Web Documents

Practical Statistical Relational Learning

Open Information Extraction from the Web

Statistical Relational Learning

Statistical Learning from Relational Data

Statistical Relational Learning for NLP

Statistical Relational Learning

Relational Learning of Pattern-Match Rules for Information Extraction

Markov Logic: A Unifying Framework for Statistical Relational Learning

Statistical Relational AI

Optimising the Extraction of Knowledge from Email Messages

Learning Knowledge Rich User Models from the Semantic Web

Relational Learning of Pattern-Match Rules for Information Extraction

Information Extraction from the World Wide Web

Information extraction from web pages using extraction ontologies

Efficient Learning of Statistical Relational Models

Scalable Statistical Relational Learning for NLP

Statistical Relational Learning for NLP

Scalable Statistical Relational Learning for NLP