1 / 23

Introduction to Probabilistic Models for Computational Biology

Introduction to Probabilistic Models for Computational Biology. Lectures 2 – Oct 3, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022. DNA.

ninon
Download Presentation

Introduction to Probabilistic Models for Computational Biology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Probabilistic Models for Computational Biology Lectures 2 – Oct 3, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022

  2. DNA AGATATGTGGATTGTTAGGATTTATGCGCGTCAGTGACTACGCATGTTACGCACCTACGACTAGGTAATGATTGATC AGATATGTGGATTGTTAGGATTTATGCGCGTCAGTGACTACGCATGTTACGCACCTACGACTAGGTAATGATTGATC Gene AUGAUUAU AUGCGCGUC AUGAUUGAU AUGAUUGAU AUGUUACGCACCUAC RNA AUGUGGAUUGUU RNA degradation MID Protein MID MID MWIV MLRTY MRV gene Genetic regulatory network Review: Gene Regulation a switch! (“transcription factor binding site”) Gene regulation transcription AUGCGCGUC translation MRV “Gene Expression” Genes regulate each others’ expression and activity.

  3. T G C T A X X X X X U C X X X X T X Protein MID X MWIV MLRTY MRV X C L gene Review: Variations in the DNA “Single nucleotide polymorphism (SNP)” AGATATGTGGATTGTTAGGATTTATGCGCGTCAGTGACTACGCATGTTACGCACCTACGACTAGGTAATGATTGATC AUGCGCGUC AUGAUUGAU AUGUUACGCACCUAC RNA AUGUGGAUUGUU Sequence variations perturb the regulatory network. Genetic regulatory network

  4. Outline • Probabilistic models in biology • Model selection problems • Mathematical foundations • Bayesian networks • Probabilistic Graphical Models: Principles and Techniques, Koller & Friedman, The MIT Press • Learning from data • Maximum likelihood estimation • Expectation and maximization

  5. Example 1 • How a change in a nucleotide in DNA, blood pressure and heart disease are related? • There can be several “models”… DNA alteration DNA alteration DNA alteration OR Blood pressure Heart disease Blood pressure Heart disease Blood pressure Heart disease

  6. A A B C B C A B C Example 2 • How genes A, B and C regulate each other’s expression levels (mRNA levels) ? • There can be several models… OR ?

  7. Model I Model II Model III … Exp 1 Exp 2 Exp N Gene A A A Gene B Gene C B C B C A B C OR ? N instances • Probabilistic graphical models • A graphical representation of statistical dependencies. • Statistical dependencies between expression levels of genes A, B, C? • Probability that model x is true given the data • Model selection: argmaxx P(model x is true | Data)

  8. Outline • Probabilistic models in biology • Model selection problem • Mathematical foundations • Bayesian networks • Learning from data • Maximum likelihood estimation • Expectation and maximization

  9. Probability Theory Review • Assume random variables Val(A)={a1,a2,a3}, Val(B)={b1,b2} • Conditional probability • Definition • Chain rule • Bayes’ rule • Probabilistic independence

  10. Probabilistic Representation • Joint distribution P over {x1,…, xn} • xi is binary • 2n-1 entries • If x’s are independent • P(x) = p(x1) … p(xn)

  11. Conditional Parameterization • The Diabetes example • Genetic risk (G), Diabetes (D) • Val (G) = {g1,g0}, Val (D) = {d1,d0} • P(G,D) = P(G) P(D|G) • P(G): Prior distribution • P(D|G): Conditional probabilistic distribution (CPD) Genetic risk Diabetes

  12. Naïve Bayes Model - Example • Elaborating the diabetes example, • Genetic Risk (G), Diabetes (D), Hypertension (H) • Val (G) = {g1,g0}, Val (D) = {d1,d0}, Val (H) = {h1,h0} • 8 entries • If S and G are independent given I, • P(G,D,H) = P(G)P(D|G)P(H|G) • 5 entries; more compact than joint Genetic risk Diabetes Hypertension

  13. Naïve Bayes Model • A class C where Val (C) = {c1,…,ck}. • Finding variables x1,…,xn • Naïve Bayes assumption • The findings are conditionally independent given the individual’s class. • The model factorizes as: • The Diabetes example • class: Genetic risk, findings: Diabetes, Hypertension

  14. Naïve Bayes Model - Example • Medical diagnosis system • Class C: disease • Findings X: symptoms • Computing the confidence: • Drawbacks • Strong assumptions

  15. Bayesian Network • Directed acyclic graph (DAG) • Node: a random variable • Edge: direct influence of one node on another • The Diabetes example revisited • Genetic risk (G), Diabetes (D), Hypertension (H) • Val (G) = {g1,g0}, Val (D) = {d1,d0}, Val (H) = {h1,h0} Genetic risk Diabetes Hypertension

  16. Bayesian Network Semantics • A Bayesian network structure G is a directed acyclic graph whose nodes represent random variables X1,…,Xn. • PaXi: parents of Xi in G • NonDescendantsXi: variables in G that are not descendants of Xi. • G encodes the following set of conditional independence assumptions, called the local Markov assumptions, and denoted by IL(G): For each variable Xi: x2 x1 x11 x3 x3 x10 x4 x7 x8 x5 x9 x6

  17. The Genetics Example • Variables • B: blood type (a phenotype) • G: genotype of the gene that encodes a person’s blood type; <A,A>, <A,B>, <A,O>, <B,B>, <B,O>, <O,O>

  18. Bayesian Network Joint Distribution • Let G be a Bayesian network graph over the variables X1,…,Xn. We say that a distribution P factorizes according to G if P can be expressed as: • A Bayesian network is a pair (G,P) where P factorizes over G, and where P is specified as a set of CPDs associated with G’s nodes.

  19. The Student Example • More complex scenario • Course difficulty (D), quality of the recommendation letter (L), Intelligence (I), SAT (S), Grade (G) • Val(D) = {easy, hard}, Val(L) = {strong, weak}, Val(I) = {i1,i0}, Val (S) = {s1,s0}, Val (G) = {g1,g2,g3} • Joint distribution requires 47 entries

  20. The Student Bayesian network • Joint distribution • P(I,D,G,S,L) = from Koller & Friedman

  21. Parameter Estimation For example, {i0,d1,g1,l0,s0} • Assumptions • Fixed network structure • Fully observed instances of the network variables: D={d[1],…,d[M]} • Maximum likelihood estimation (MLE)! “Parameters” of the Bayesian network from Koller & Friedman

  22. Outline • Probabilistic models in biology • Model selection problem • Mathematical foundations • Bayesian networks • Learning from data • Maximum likelihood estimation • Expectation and maximization

  23. Acknowledgement • Profs Daphne Koller & Nir Friedman, “Probabilistic Graphical Models”

More Related