420 likes | 432 Views
Bayesian Networks. Nariman Farsad. Overview. Review previously presented models Introduce Bayesian networks Evaluation, Sampling and Inference Examples Applications in NLP Conclusion. Probabilistic Model. An outcome captured by n random variables (RV) RV can take m different values
E N D
Bayesian Networks Nariman Farsad N. Farsad
Overview N. Farsad Review previously presented models Introduce Bayesian networks Evaluation, Sampling and Inference Examples Applications in NLP Conclusion
Probabilistic Model N. Farsad An outcome captured by n random variables (RV) RV can take m different values A random configuration
Computational Tasks N. Farsad
Joint Distribution Model N. Farsad • Modeled using the joint distribution • Issues • Memory cost to store tables. • Number of parameters • For n=m=10, 10 billion numbers to store • Runtime cost to do lots of summations • The sparse data problem in learning
Fully Independent Model N. Farsad • Represented by • Solved most problems of joint distribution modeling • Number of parameters • But! • Too strong assumption • Not accurate
Naïve Bayes Model N. Farsad • Represented by • Efficient • Number of parameters • Good accuracy for some applications like text classification • Still over-simplified for some applications
Question? N. Farsad What if we want a better compromise between the model’s computability and the model’s accuracy.
Conditional Independence N. Farsad Independence of two random variables Conditional independence of
Answer: Bayesian Networks N. Farsad A Bayesian Network is defined by a directed acyclic graph (DAG) and a collection of conditional probability tables, where nodes in the graph represent random variables and directed edges in the graph represent conditional independence assumptions. The edges are interpreted in the following way: If Vj(1 ≤ j ≤ n) is a random variable, and Vπ(j) are parent variables of Vj , i.e., all source nodes for edges whose destination node is Vj then the probability of Vj given variables Vπ(j)is independent of any other variable; i.e.
A Graphical Example V1 V2 V3 V4 N. Farsad
Representational Power (1) V1 V2 V3 V4 N. Farsad Full joint distribution model
Representational Power (2) V1 V2 V3 V4 N. Farsad Fully independent model
Representational Power (3) V1 V2 V3 V4 N. Farsad Naïve Bayes model
Representational Power (4) X1 X2 X3 o1 o2 o3 N. Farsad HMM model
Computational Tasks (1) N. Farsad • Evaluation • Simulation • Draw • Conjoin to form a complete configuration
Computational Tasks (2) N. Farsad • Inference • Use the tables to calculate (Brute Force) • In tree Bayesian networks use message passing algorithms • Learning • From a given a network graph and complete observations, use MLE (i.e. counting)
Number of free Parameters N. Farsad • Number of free parameters for each node • k is the number or parents for that node • Examples • Fully independent model • Joint distribution model
Computational Example (1) N. Farsad • You have a new burglar alarm installed. • It is reliable about detecting burglary, but responds to minor earthquakes. • Two neighbors (John, Mary) call incase they hear the alarm • John mixes the phone ringing with alarm • Mary does not hear the alarm well
Computational Example(2) Burglary Earthquake Alarm John Calls Marry Calls N. Farsad
Evaluation Example (1) B E A J M N. Farsad
Evaluation Example (2) N. Farsad
Inference (1) N. Farsad Suppose we are interested in calculating We can calculate it using
Inference (2) N. Farsad Marginal probability
Inference (3) N. Farsad P(B = T, J = T) = P(B = T)P(E = T)P(A = T|B = T,E = T)P(J = T|A = T)P(M = T|A = T) + P(B = T)P(E = T)P(A = T|B = T,E = T)P(J = T|A = T)P(M = F|A = T) + P(B = T)P(E = T)P(A = F|B = T,E = T)P(J = T|A = F)P(M = T|A = F) + P(B = T)P(E = T)P(A = F|B = T,E = T)P(J = T|A = F)P(M = F|A = F) + P(B = T)P(E = F)P(A = T|B = T,E = F)P(J = T|A = T)P(M = T|A = T) + P(B = T)P(E = F)P(A = T|B = T,E = F)P(J = T|A = T)P(M = F|A = T) + P(B = T)P(E = F)P(A = F|B = T,E = F)P(J = T|A = F)P(M = T|A = F) + P(B = T)P(E = F)P(A = F|B = T,E = F)P(J = T|A = F)P(M = F|A = F) = 0.001 · 0.002 · 0.95 · 0.9 · 0.7 + 0.001 · 0.002 · 0.95 · 0.9 · 0.3 + 0.001 · 0.002 · 0.05 · 0.05 · 0.01 + 0.001 · 0.002 · 0.05 · 0.05 · 0.99 + 0.001 · 0.998 · 0.94 · 0.9 · 0.7 + 0.001 · 0.998 · 0.94 · 0.9 · 0.3 + 0.001 · 0.998 · 0.06 · 0.05 · 0.01 + 0.001 · 0.998 · 0.06 · 0.05 · 0.99 = 8.49017 · 10−4
Inference (4) N. Farsad • To calculate • Note • Using a similar method we can calculate
Inference (5) N. Farsad P(B = F, J = T) = P(B = F)P(E = T)P(A = T|B = F,E = T)P(J = T|A = T)P(M = T|A = T) + P(B = F)P(E = T)P(A = T|B = F,E = T)P(J = T|A = T)P(M = F|A = T) + P(B = F)P(E = T)P(A = F|B = F,E = T)P(J = T|A = F)P(M = T|A = F) + P(B = F)P(E = T)P(A = F|B = F,E = T)P(J = T|A = F)P(M = F|A = F) + P(B = F)P(E = F)P(A = T|B = F,E = F)P(J = T|A = T)P(M = T|A = T) + P(B = F)P(E = F)P(A = T|B = F,E = F)P(J = T|A = T)P(M = F|A = T) + P(B = F)P(E = F)P(A = F|B = F,E = F)P(J = T|A = F)P(M = T|A = F) + P(B = F)P(E = F)P(A = F|B = F,E = F)P(J = T|A = F)P(M = F|A = F) = 0.999 · 0.002 · 0.29 · 0.9 · 0.7 + 0.999 · 0.002 · 0.29 · 0.9 · 0.3 + 0.999 · 0.002 · 0.71 · 0.05 · 0.01 + 0.999 · 0.002 · 0.71 · 0.05 · 0.99 + 0.999 · 0.998 · 0.001 · 0.9 · 0.7 + 0.999 · 0.998 · 0.001 · 0.9 · 0.3 + 0.999 · 0.998 · 0.999 · 0.05 · 0.01 + 0.999 · 0.998 · 0.999 · 0.05 · 0.99 = 5.12899587 · 10−2
Inference (6) N. Farsad Therefore P(J = T) = P(B = T, J = T) + P(B = F, J = T) = 8.49017 · 10−4 + 5.12899587 · 10−2 = 0.0521389757 and finally
Take Away Message N. Farsad • Inference is Hard! • In fact for general Bayesian Network inference can be NP-hard • We can do better in tree Bayesian networks • Message passing algorithm, also known as sum-product algorithm can be used.
What about NLP? N. Farsad • Relatively new to NLP. • Selected example for presentation • Weissenbacher, D. 2006. Bayesian network, a model for NLP?. In Proceedings of the Eleventh Conference of the European Chapter of the Association For Computational Linguistics: Posters & Demonstrations (Trento, Italy, April 05 - 06, 2006). European Chapter Meeting of the ACL. Association for Computational Linguistics, Morristown, NJ, 195-198.
Anaphoric pronoun N. Farsad • A pronoun that refers to a linguistic expression previously introduced in the text • Example • “Nonexpression of the locus even when it is present suggests that these chromosomes …” Here pronoun it is anaphoric • “Thus, it is not unexpected that this versatile cellular …” Here pronoun is non-anaphoric
What does the paper do? N. Farsad Attempts to solve the non-anaphoric it identification using Bayesian networks.
History of Other Algorithms (1) N. Farsad • First pronoun classifier proposed in 1987 by Paice. • Relied on a set of logical first order rules • Non-anaphoric start with it and end with a delimiter like to, that, whether … • Left context of the pronoun should not be immediately preceded by certain words like before, from, to,… • the distance between the pronoun and the delimiter must be shorter than 25 words long. • the lexical items occurring between the pronoun and the delimiter must not contain certain words belonging to specific sets. • Lots of false positives
History of Other Algorithms (2) N. Farsad • To solve the false positive problem in 1994 Lappin proposed • More constrained rules in form of finite state automata. • Helped in finding specific sequences like: It is not/may be<Modaladj>; It is <Cogved> that <Subject> • Solved the false positive problem but introduced lots of false negatives
History of Other Algorithms (3) N. Farsad • Evan in 2001 • Proposed a machine learning approach based on surface clues. • 35 syntactic and contextual surface clues considered for learning • Pronoun position in the sentence • Lemma of the following verb • After learning KNN was used for classification • Accuracy was good but not great.
History of Other Algorithms (4) N. Farsad • Clement in 2004 • Used a similar machine learning approach • Used 21 of the most relevant surface clues • Classified new instances with a SVM • Achieved a better accuracy.
Room for Improvement? N. Farsad Each of the proposed models has its own strength and weaknesses Is it possible to combine the strengths of these systems to create a better system?
The Answer N. Farsad
How does it work? N. Farsad a priori probability values are calculated using frequency counts in the training corpus. a posterior probability is calculated using observations and a priori probability values 50% threshold is used to label the pronoun as non-anaphoric
Results N. Farsad
Conclusion N. Farsad • Bayesian Networks are very powerful and flexible in probabilistic modeling • Inference can be NP-Hard • For tree Bayesian Networks there exist efficient algorithms • Relatively new to NLP • Initial results seem promising
Questions? N. Farsad