1 / 32

Statistical modeling and classification in Biological Sequence Space

This article discusses the use of statistical modeling and classification in predicting and validating biological sequences, such as DNA, RNA, and proteins. It covers gene identification, noncoding RNA genes, protein domains, and more. Different models and methods are discussed, including HMMs, NNs, SCFG, and SVM. The importance of biological confirmation is emphasized.

atkinsond
Download Presentation

Statistical modeling and classification in Biological Sequence Space

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical modeling and classification in Biological Sequence Space April 26, 04; 9.520 Gene Yeo Poggio, Burge @MIT

  2. Framework/Issues • “Build” models around known biology • In the process, extend knowledge about known biology • “Predict” new examples • “Validate” predictions by • prediction accuracy • experimental validation • higher-level traits of predictions • conservation in other genomes

  3. Biological sequences • DNA, RNA and proteins: macromolecules built up from smaller units. • DNA: units are the nucleotide residues A, C, G and T • RNA: units are the nucleotide residues A, C, G and U • Proteins: units are the amino acid residues A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W and Y. • To a considerable extent, the chemical properties of DNA, RNA and protein molecules are encoded in the linear sequence of these basic units: their primary structure.

  4. Statistical models can be descriptive and/or predictive. • Given known biological signal-> describe the signal with statistical modeling & find unknown examples of the same signal • Gene-finding (protein-coding genes) • Noncoding RNA genes • Protein domains • Warning: although successful, models are not to be taken literally. • Most important: biological confirmation of predictions is almost always necessary.

  5. Different models RNA gene (Covariation,SCFG,NN,SVM) Protein structure (a variety of methods) Complexity Protein gene(HMM,NN) Splice site motif (WMM, MM, SVM, NN) DNA RNA Protein

  6. A case study in computational biology: modeling signals in genes With so many genomes being sequenced, it remains important to be able to identify genes and the signals within and around genes computationally.

  7. DNA transcription mRNA translation Protein What is a (protein-coding) gene? CCTGAGCCAACTATTGATGAA CCUGAGCCAACUAUUGAUGAA PEPTIDE

  8. Some facts about human genes Comprise about 3% of the genome Average gene length: ~ 8,000 bp Average of 5-6 exons/gene Average exon length: ~200 bp Average intron length: ~2,000 bp ~8% genes have a single exon

  9. The idea behind a HMM genefinder • States represent standard gene features: intergenic region, exon, intron, perhaps more (promotor, 5’UTR, 3’UTR, Poly-A,..). • Observationsembody state-dependent statistics, such as base composition, dependence, and signal features.

  10. GENSCAN (Burge & Karlin) 6 2 0 0 1 A G G A C A G G T A C G G C T G T C A T C A C T T A G A C C T C A C C C T G T G G A G C C A C A C C 6 2 0 5 1 C T A G G G T T G G C C A A T C T A C T C C C A G G A G C A G G G A G G G C A G G A G C C A G G G C 6 2 1 0 1 T G G G C A T A A A A G T C A G G G C A G A G C C A T C T A T T G C T T A C A T T T G C T T C T G A 6 2 1 5 1 C A C A A C T G T G T T C A C T A G C A A C C T C A A A C A G A C A C C A T G G T G C A C C T G A C E E E 6 2 2 0 1 T C C T G A G G A G A A G T C T G C C G T T A C T G C C C T G T G G G G C A A G G T G A A C G T G G 0 1 2 6 2 2 5 1 A T G A A G T T G G T G G T G A G G C C C T G G G C A G G T T G G T A T C A A G G T T A C A A G A C 6 2 3 0 1 A G G T T T A A G G A G A C C A A T A G A A A C T G G G C A T G T G G A G A C A G A G A A G A C T C 6 2 3 5 1 T T G G G T T T C T G A T A G G C A C T G A C T C T C T C T G C C T A T T G G T C T A T T T T C C C 6 2 4 0 1 A C C C T T A G G C T G C T G G T G G T C T A C C C T T G G A C C C A G A G G T T C T T T G A G T C I I I 0 1 2 6 2 4 5 1 C T T T G G G G A T C T G T C C A C T C C T G A T G C T G T T A T G G G C A A C C C T A A G G T G A 6 2 5 0 1 A G G C T C A T G G C A A G A A A G T G C T C G G T G C C T T T A G T G A T G G C C T G G C T C A C 6 2 5 5 1 C T G G A C A A C C T C A A G G G C A C C T T T G C C A C A C T G A G T G A G C T G C A C T G T G A 6 2 6 0 1 C A A G C T G C A C G T G G A T C C T G A G A A C T T C A G G G T G A G T C T A T G G G A C C C T T E E i t 6 2 6 5 1 G A T G T T T T C T T T C C C C T T C T T T T C T A T G G T T A A G T T C A T G T C A T A G G A A G 6 2 7 0 1 G G G A G A A G T A A C A G G G T A C A G T T T A G A A T G G G A A A C A G A C G A A T G A T T G C 6 2 7 5 1 A T C A G T G T G G A A G T C T C A G G A T C G T T T T A G T T T C T T T T A T T T G C T G T T C A 6 2 8 0 1 T A A C A A T T G T T T T C T T T T G T T T A A T T C T T G C T T T C T T T T T T T T T C T T C T C 6 2 8 5 1 C G C A A T T T T T A C T A T T A T A C T T A A T G C C T T A A C A T T G T G T A T A A C A A A A G E 5'UTR s 3'UTR 6 2 9 0 1 G A A A T A T C T C T G A G A T A C A T T A A G T A A C T T A A A A A A A A A C T T T A C A C A G T 6 2 9 5 1 C T G C C T A G T A C A T T A C T A T T T G G A A T A T A T G T G T G C T T A T T T G C A T A T T C 6 3 0 0 1 A T A A T C T C C C T A C T T T A T T T T C T T T T A T T T T T A A T T G A T A C A T A A T C A T T 6 3 0 5 1 A T A C A T A T T T A T G G G T T A A A G T G T A A T G T T T T A A T A T G T G T A C A C A T A T T poly-A 6 3 1 0 1 G A C C A A A T C A G G G T A A T T T T G C A T T T G T A A T T T T A A A A A A T G C T T T C T T C promoter 6 3 1 5 1 T T T T A A T A T A C T T T T T T G T T T A T C T T A T T T C T A A T A C T T T C C C T A A T C T C 6 3 2 0 1 T T T C T T T C A G G G C A A T A A T G A T A C A A T G T A T C A T G C C T C T T T G C A C C A T T 6 3 2 5 1 C T A A A G A A T A A C A G T G A T A A T T T C T G G G T T A A G G C A A T A G C A A T A T T T C T 6 3 3 0 1 G C A T A T A A A T A T T T C T G C A T A T A A A T T G T A A C T G A T G T A A G A G G T T T C A T Forward (+) strand Forward (+) strand intergenic 6 3 3 5 1 A T T G C T A A T A G C A G C T A C A A T C C A G C T A C C A T T C T G C T T T T A T T T T A T G G region Reverse (-) strand Reverse (-) strand 6 3 4 0 1 T T G G G A T A A G G C T G G A T T A T T C T G A G T C C A A G C T A G G C C C T T T T G C T A A T 6 3 4 5 1 C A T G T T C A T A C C T C T T A T C T T C C T C C C A C A G C T C C T G G G C A A C G T G C T G G 6 3 5 0 1 T C T G T G T G C T G G C C C A T C A C T T T G G C A A A G A A T T C A C C C C A C C A G T G C A G 6 3 5 5 1 G C T G C C T A T C A G A A A G T G G T G G C T G G T G T G G C T A A T G C C C T G G C C C A C A A 6 3 6 0 1 G T A T C A C T A A G C T C G C T T T C T T G C T G T C C A A T T T C T A T T A A A G G T T C C T T 6 3 6 5 1 T G T T C C C T A A G T C C A A C T A C T A A A C T G G G G G A T A T T A T G A A G G G C C T T G A 6 3 7 0 1 G C A T C T G G A T T C T G C C T A A T A A A A A A C A T T T A T T T T C A T T G C A A T G A T G T

  11. Splice sites can be an important signal

  12. Regular expressions can be limiting C A A G AGGT AGT 5’ splice junction in eukaryotes T C N AGC T C 3’ splice junction ≥11 Most protein binding sites are characterized by some degree of sequence specificity, but seeking a consensus sequence is often an inadequate way to recognize sites. Position-specific distributionscame to represent the variability in motif composition.

  13. S = S1 S2 S3 S4 S5 S6 S7 S8 S9 Odds Ratio R = = Score s = log2R P(S|+) P-3(S1)P-2(S2)P-1(S3) ••• P5(S8)P6(S9) P(S|-) Pbg(S1)Pbg(S2)Pbg(S3) ••• Pbg(S8)Pbg(S9) Position-specific scoring matrix (PSSM) Pos -3 -2 -1 +1 +2 +3 +4 +5 +6 A 0.3 0.6 0.1 0.0 0.0 0.4 0.7 0.1 0.1 C 0.4 0.1 0.0 0.0 0.0 0.1 0.1 0.1 0.2 G 0.2 0.2 0.8 1.0 0.0 0.4 0.1 0.8 0.2 T 0.1 0.1 0.1 0.0 1.0 0.1 0.1 0.0 0.5

  14. Ok, so we got the genes • molecular biology (transcription, splicing) • signals are modeled as states (HMM) or separately, i.e.PSSMs • Here’s another catch, there isn’t just one version of each gene. • But sometimes several

  15. Eg. alternative splicing - CD44 Human chromosome 11p… Zhu et al Science… (2003)

  16. Alternative splicing • is a major determinant of protein diversity (Lander 2001, Zavolan 2003) • 30-50% of human diseases involve alt. splicing

  17. Defining constitutive and alternative exons Constitutive exon Skipped exon 3’ alternative exon 5’ alternative exon Intron retention Mutually exclusive exons

  18. Conserved alternative, skipped exon - FXR1 Fragile X Related Gene, FXR1

  19. Another example of genes containing CSE: DMWD Myotonic Dystrophy-containing WD Repeat, DMWD

  20. Predicting new alternatively spliced exons The problem is ‘ill-posed’ High-dimensional space Not overfit data Simple feature selection Unbalanced data set sizes Labels are more “flexible”

  21. Eg. of experimentally validated

  22. Biological sequence space: challenges • Models that “represent” as much of the biology as possible. • Biologically motivated features are important • Validating attributes: • Conservation of events are key in computational biology • Higher-level consistency with known biology • Experimental validation of predictions are essential

  23. Framework/Issues • “Build” models around known biology • In the process, extend knowledge about known biology • “Predict” new examples • “Validate” predictions by • prediction accuracy • experimental validation • higher-level traits of predictions • conservation in other genomes

  24. If time permits Modeling higher order interactions: Yeast Phe tRNA Secondary Structure Tertiary Structure

  25. The Hammerhead Ribozyme Secondary structure Tertiary structure

  26. Method of Covariation / Compensatory changes One example on how to model and predict RNA 2o Structure • Covariation (using comparative genomics) Seq1: A C G A A A G U Seq2: U A G U A A U A Seq3: A G G U G A C U Seq4: C G G C A A U G Seq5: G U G G G A A C

  27. is maximal (2 bits) if x and y individually appear at random (A,C,G,U equally likely), but are perfectly correlated (e.g., always complementary) Mutual information statisticfor pair of columns in a multiple alignment = fraction of seqs w/ nt. x in col. i, nt. y in col. j = fraction of seqs w/ nt. x in col. i sum over x, y = A, C, G, U

  28. Inferring 2o Structure from Covariation

  29. Stochastic Context-Free Grammars (SCFGs) • A generalized model which is capable of handling non-local dependencies between words in a language (or bases in an RNA) Ref: Durbin et al. “Biological Sequence Analysis” 1998

  30. An SCFG Model of RNA 2o Structure “Production Rules”: • P  aWb (“pair”) • L  aW (“left bulge/loop”) • R  Wa (“right bulge/loop”) • B  SS (“bifurcation”) • S  W (“start”) • E   (“end”)

  31. last page • some of the slides were obtained from various places: • available online slides on the web (primarily from lectures by terry speed). • slides from chris burge, dirk holste

More Related