2.22k likes | 2.35k Views
Representation, Inference and Learning in Relational Probabilistic Languages. Lise Getoor University of Maryland College Park. Avi Pfeffer Harvard University. IJCAI 2005 Tutorial. Introduction. Probability is good First-order representations are good
Representation, Inference and Learning in Relational Probabilistic Languages Lise Getoor University of Maryland College Park Avi Pfeffer Harvard University IJCAI 2005 Tutorial
Introduction • Probability is good • First-order representations are good • Variety of approaches that combine them • We won’t cover all of them in detail • apologies if we leave out your favorite • We will cover three broad classes of approaches, and present exemplars of each approach • We will highlight common issues, themes, and techniques that recur in different approaches
Running Example • There are papers, researchers, citations, reviews… • Papers have a quality, and may or may not be accepted • Authors may be smart and good writers • Papers have topics, and cite other papers which may or may not be on the same topic • Papers are reviewed by reviewers, who have moods that are influenced by the quality of the writing
Some Queries • What is the probability that a researcher is famous, given that one of her papers was accepted despite the fact that a reviewer was in a bad mood? • What is the probability that a paper is accepted, given that another paper by the same author is accepted? • What is the probability that a paper is an AI paper, given that it is cited by an AI paper? • What is the probability that a student of a famous advisor has seven high quality papers?
Sample Domains • Web Pages and Link Analysis • Battlespace Awareness • Epidemiological Studies • Citation Networks • Communication Networks (Cellphone Fraud Detection) • Intelligence Analysis (Terrorist Networks) • Financial Transactions (Money Laundering) • Computational Biology • Object Recognition and Scene Analysis • Natural Language Processing (e.g. Information Extraction and Semantic Parsing)
Roadmap • Motivation • Background: Bayesian network inference and learning • Rule-based Approaches • Frame-based Approaches • Undirected Relational Approaches • Programming Language Approaches
conditional probability table (CPT) S P(Q| W, S) W w s 0.6 0.4 w s 0.3 0.7 w s 0.4 0.6 0.1 0.9 w s Bayesian Networks [Pearl 87] Smart Good Writer Reviewer Mood Quality nodes = domain variables edges = direct causal influence Review Length Accepted Network structure encodes conditional independencies: I(Review-Length , Good-Writer | Reviewer-Mood)
S W M Q L A BN Semantics • Compact & natural representation: • nodes have k parents O(2k n) vs. O(2n) params • natural parameters conditional independencies in BN structure local CPTs full joint distribution over domain + =
mood good writer pissy false 1 pissy true 0 good false 0.7 good true 0.3 Variable Elimination [Zhang & Poole 96, Dechter 98] • To compute factors A factor is a function from values of variables to positive real numbers
Variable Elimination • To compute
Variable Elimination • To compute sum out l
Variable Elimination • To compute new factor
Variable Elimination • To compute multiply factors together then sum out w
Variable Elimination • To compute new factor
Variable Elimination • To compute
Some Other Inference Algorithms • Exact • Junction Tree [Lauritzen & Spiegelhalter 88] • Cutset Conditioning [Pearl 87] • Approximate • Loopy Belief Propagation [McEliece et al 98] • Likelihood Weighting [Shwe & Cooper 91] • Markov Chain Monte Carlo [eg MacKay 98] • Gibbs Sampling [Geman & Geman 84] • Metropolis-Hastings [Metropolis et al 53, Hastings 70] • Variational Methods [Jordan et al 98] • etc.
Parameter Estimation in BNs • Assume known dependency structure G • Goal: estimate BN parameters q • entries in local probability models, • q is good if it’s likely to generate observed data. • MLE Principle: Choose q* so as to maximize l • Alternative: incorporate a prior
Learning With Complete Data • Fully observed data: data consists of set of instances, each with a value for all BN variables • With fully observed data, we can compute = number of instances with , and • and similarly for other counts • We then estimate
Learning with Missing Data: Expectation-Maximization (EM) • Can’t compute • But • Given parameter values, can compute expected counts: • Given expected counts, estimate parameters: • Begin with arbitrary parameter values • Iterate these two steps • Converges to local maximum of likelihood this requires BN inference
Structure search • Begin with an empty network • Consider all neighbors reached by a search operator that are acyclic • add an edge • remove an edge • reverse an edge • For each neighbor • compute ML parameter values • compute score(s) = • Choose the neighbor with the highest score • Continue until we reach a local maximum
Limitations of BNs • Inability to generalize across collection of individuals within a domain • if you want to talk about multiple individuals in a domain, you have to talk about each one explicitly, with its own local probability model • Domains have fixed structure: e.g. one author, one paper and one reviewer • if you want to talk about domains with multiple inter-related individuals, you have to create a special purpose network for the domain • For learning, all instances have to have the same set of entities
First Order Approaches • Advantages of first order probabilistic models • represent world in terms of individuals and relationships between them • ability to generalize about many instances in same domain • allow compact parameterization • support reasoning about general classes of individuals rather than the individuals themselves • allow representation of high level structure, in which objects interact weakly with each other
Three Different Approaches • Rule-based approaches focus on facts • what is true in the world? • what facts do other facts depend on? • Frame-based approaches focus on objects and relationships • what types of objects are there, and how are they related to each other? • how does a property of an object depend on other properties (of the same or other objects)? • Programming language approaches focus on processes • how is the world generated? • how does one event influence another event?
Roadmap • Motivation • Background • Rule-based Approaches • Basic Approach • Knowledge-Based Model Construction • Issues • First-Order Variable Elimination • Learning • Frame-based Approaches • Undirected Relational Approaches • Programming Language Approaches
Flavors • Goldman & Charniak [93] • Breese [92] • Probabilistic Horn Abduction [Poole 93] • Probabilistic Logic Programming [Ngo & Haddawy 96] • Relational Bayesian Networks [Jaeger 97] • Bayesian Logic Programs [Kersting & de Raedt 00] • Stochastic Logic Programs [Muggleton 96] • PRISM [Sato & Kameya 97] • CLP(BN) [Costa et al. 03] • etc.
Intuitive Approach In logic programming, accepted(P) :- author(P,A), famous(A). means For all P,A if A is the author of P and Ais famous, then Pis accepted This is a categorical inference But this will not be true in many cases
Fudge Factors Use accepted(P) :- author(P,A), famous(A). (0.6) This means For all P,A if A is the author of P and Ais famous, then Pis accepted with probability 0.6 But what does this mean when there are other possible causes of a paper being accepted? e.g.accepted(P) :- high_quality(P). (0.8)
Intuitive Meaning accepted(P) :- author(P,A), famous(A). (0.6) means For all P,A if A is the author of P and Ais famous, then Pis accepted with probability 0.6, provided no other possible cause of the paper being accepted holds If more than one possible cause holds, a combining rule is needed to combine the probabilities
Meaning of Disjunction In logic programming accepted(P) :- author(P,A), famous(A). accepted(P) :- high_quality(P). means For all P,A if A is the author of P and Ais famous, or if P is high quality, then Pis accepted
Intuitive Meaning of Probabilistic Disjunction For us accepted(P) :- author(P,A), famous(A). (0.6) accepted(P) :- high_quality(P). (0.8) means For all P,A, if (A is the author of P and Ais famous successfully cause P to be accepted) or (P is high quality successfully causes P to be accepted), then P is accepted. If A is the author of P and Ais famous, they successfully cause P to be accepted with probability 0.6. If P is high quality, it successfully causes P to be accepted with probability 0.8.
Noisy-Or • Multiple possible causes of an effect • Each cause, if it is true, successfully causes the effect with a given probability • Effect is true if any of the possible causes is true and successfully causes it • All causes act independently to produce the effect (causal independence) • Note: accepted(P) :- author(P,A), famous(A). (0.6)may produce multiple possible causes for different values of A • Leak probability: effect may happen with no cause • e.g. accepted(P). (0.1)
Noisy-Or author(p1,alice) author(p1,bob) high_quality(p1) famous(alice) famous(bob) 0.6 0.6 0.8 accepted(p1)
Computing Noisy-Or Probabilities • What is P(accepted(p1)) given that Alice is an author and Alice is famous, and that the paper is high quality, but no other possible cause is true? leak
Combination Rules • Other combination rules are possible • E.g. max • In our case, P(accepted(p1)) = max {0.6,0.8,0.1} = 0.8 • Harder to interpret in terms of logic program
Roadmap • Motivation • Background • Rule-based Approaches • Basic Approach • Knowledge-Based Model Construction • Issues • First-Order Variable Elimination • Learning • Frame-based Approaches • Undirected Relational Approaches • Programming Language Approaches
Knowledge-Based Model Construction (KBMC) • Construct a Bayesian network, given a query Q and evidence E • query and evidence are sets of ground atoms, i.e., predicates with no variable symbols • e.g. author(p1,alice) • Construct network by searching for possible proofs of the query and the variables • Use standard BN inference techniques on constructed network
KBMC Example smart(alice). (0.8) smart(bob). (0.9) author(p1,alice). (0.7) author(p1,bob). (0.3) high_quality(P) :- author(P,A), smart(A). (0.5) high_quality(P). (0.1) accepted(P) :- high_quality(P). (0.9) Query isaccepted(p1). Evidence issmart(bob).
Backward Chaining Start with evidence variable smart(bob) smart(bob)
Backward Chaining Rule for smart(bob)has no antecedents – stop backward chaining smart(bob)
Backward Chaining Begin with query variable accepted(p1) smart(bob) accepted(p1)
Backward Chaining Rule for accepted(p1) has antecedent high_quality(p1) – add high_quality(p1) to network, and make parent of accepted(p1) smart(bob) high_quality(p1) accepted(p1)
Backward Chaining All of accepted(p1)’s parents have been found – create its conditional probability table (CPT) smart(bob) high_quality(p1) high_quality(p1) accepted(p1) hq 0.7 0.3 accepted(p1) hq 0 1
Backward Chaining high_quality(p1) :- author(p1,A), smart(A)has two groundings: A=aliceand A=bob smart(bob) high_quality(p1) accepted(p1)
Backward Chaining For grounding A=alice, add author(p1,alice) and smart(alice) to network, and make parents of high_quality(p1) smart(alice) smart(bob) author(p1,alice) high_quality(p1) accepted(p1)
Backward Chaining For grounding A=bob, add author(p1,bob)to network. smart(bob) is already in network. Make both parents of high_quality(p1) smart(alice) smart(bob) author(p1,alice) author(p1,bob) high_quality(p1) accepted(p1)
Backward Chaining Create CPT for high_quality(p1) – make noisy-or, and don’t forget leak probability smart(alice) smart(bob) author(p1,alice) author(p1,bob) high_quality(p1) accepted(p1)
Backward Chaining author(p1,alice), smart(alice) and author(p1,bob) have no antecedents – stop backward chaining smart(alice) smart(bob) author(p1,alice) author(p1,bob) high_quality(p1) accepted(p1)
Backward Chaining • assert evidencesmart(bob) = true, and compute P(accepted(p1) | smart(bob) = true) true smart(alice) smart(bob) author(p1,alice) author(p1,bob) high_quality(p1) accepted(p1)
Roadmap • Motivation • Background • Rule-based Approaches • Basic Approach • Knowledge-Based Model Construction • Issues • First-Order Variable Elimination • Learning • Frame-based Approaches • Undirected Relational Approaches • Programming Language Approaches
The Role of Context • Context is deterministic knowledge known prior to the network being constructed • May be defined by its own logic program • Is not a random variable in the BN • Used to determine the structure of the constructed BN • If a context predicate P appears in the body of a rule R, only backward chain on R if P is true