320 likes | 332 Views
This article discusses the use of statistical modeling and classification in predicting and validating biological sequences, such as DNA, RNA, and proteins. It covers gene identification, noncoding RNA genes, protein domains, and more. Different models and methods are discussed, including HMMs, NNs, SCFG, and SVM. The importance of biological confirmation is emphasized.
E N D
Statistical modeling and classification in Biological Sequence Space April 26, 04; 9.520 Gene Yeo Poggio, Burge @MIT
Framework/Issues • “Build” models around known biology • In the process, extend knowledge about known biology • “Predict” new examples • “Validate” predictions by • prediction accuracy • experimental validation • higher-level traits of predictions • conservation in other genomes
Biological sequences • DNA, RNA and proteins: macromolecules built up from smaller units. • DNA: units are the nucleotide residues A, C, G and T • RNA: units are the nucleotide residues A, C, G and U • Proteins: units are the amino acid residues A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W and Y. • To a considerable extent, the chemical properties of DNA, RNA and protein molecules are encoded in the linear sequence of these basic units: their primary structure.
Statistical models can be descriptive and/or predictive. • Given known biological signal-> describe the signal with statistical modeling & find unknown examples of the same signal • Gene-finding (protein-coding genes) • Noncoding RNA genes • Protein domains • Warning: although successful, models are not to be taken literally. • Most important: biological confirmation of predictions is almost always necessary.
Different models RNA gene (Covariation,SCFG,NN,SVM) Protein structure (a variety of methods) Complexity Protein gene(HMM,NN) Splice site motif (WMM, MM, SVM, NN) DNA RNA Protein
A case study in computational biology: modeling signals in genes With so many genomes being sequenced, it remains important to be able to identify genes and the signals within and around genes computationally.
DNA transcription mRNA translation Protein What is a (protein-coding) gene? CCTGAGCCAACTATTGATGAA CCUGAGCCAACUAUUGAUGAA PEPTIDE
Some facts about human genes Comprise about 3% of the genome Average gene length: ~ 8,000 bp Average of 5-6 exons/gene Average exon length: ~200 bp Average intron length: ~2,000 bp ~8% genes have a single exon
The idea behind a HMM genefinder • States represent standard gene features: intergenic region, exon, intron, perhaps more (promotor, 5’UTR, 3’UTR, Poly-A,..). • Observationsembody state-dependent statistics, such as base composition, dependence, and signal features.
GENSCAN (Burge & Karlin) 6 2 0 0 1 A G G A C A G G T A C G G C T G T C A T C A C T T A G A C C T C A C C C T G T G G A G C C A C A C C 6 2 0 5 1 C T A G G G T T G G C C A A T C T A C T C C C A G G A G C A G G G A G G G C A G G A G C C A G G G C 6 2 1 0 1 T G G G C A T A A A A G T C A G G G C A G A G C C A T C T A T T G C T T A C A T T T G C T T C T G A 6 2 1 5 1 C A C A A C T G T G T T C A C T A G C A A C C T C A A A C A G A C A C C A T G G T G C A C C T G A C E E E 6 2 2 0 1 T C C T G A G G A G A A G T C T G C C G T T A C T G C C C T G T G G G G C A A G G T G A A C G T G G 0 1 2 6 2 2 5 1 A T G A A G T T G G T G G T G A G G C C C T G G G C A G G T T G G T A T C A A G G T T A C A A G A C 6 2 3 0 1 A G G T T T A A G G A G A C C A A T A G A A A C T G G G C A T G T G G A G A C A G A G A A G A C T C 6 2 3 5 1 T T G G G T T T C T G A T A G G C A C T G A C T C T C T C T G C C T A T T G G T C T A T T T T C C C 6 2 4 0 1 A C C C T T A G G C T G C T G G T G G T C T A C C C T T G G A C C C A G A G G T T C T T T G A G T C I I I 0 1 2 6 2 4 5 1 C T T T G G G G A T C T G T C C A C T C C T G A T G C T G T T A T G G G C A A C C C T A A G G T G A 6 2 5 0 1 A G G C T C A T G G C A A G A A A G T G C T C G G T G C C T T T A G T G A T G G C C T G G C T C A C 6 2 5 5 1 C T G G A C A A C C T C A A G G G C A C C T T T G C C A C A C T G A G T G A G C T G C A C T G T G A 6 2 6 0 1 C A A G C T G C A C G T G G A T C C T G A G A A C T T C A G G G T G A G T C T A T G G G A C C C T T E E i t 6 2 6 5 1 G A T G T T T T C T T T C C C C T T C T T T T C T A T G G T T A A G T T C A T G T C A T A G G A A G 6 2 7 0 1 G G G A G A A G T A A C A G G G T A C A G T T T A G A A T G G G A A A C A G A C G A A T G A T T G C 6 2 7 5 1 A T C A G T G T G G A A G T C T C A G G A T C G T T T T A G T T T C T T T T A T T T G C T G T T C A 6 2 8 0 1 T A A C A A T T G T T T T C T T T T G T T T A A T T C T T G C T T T C T T T T T T T T T C T T C T C 6 2 8 5 1 C G C A A T T T T T A C T A T T A T A C T T A A T G C C T T A A C A T T G T G T A T A A C A A A A G E 5'UTR s 3'UTR 6 2 9 0 1 G A A A T A T C T C T G A G A T A C A T T A A G T A A C T T A A A A A A A A A C T T T A C A C A G T 6 2 9 5 1 C T G C C T A G T A C A T T A C T A T T T G G A A T A T A T G T G T G C T T A T T T G C A T A T T C 6 3 0 0 1 A T A A T C T C C C T A C T T T A T T T T C T T T T A T T T T T A A T T G A T A C A T A A T C A T T 6 3 0 5 1 A T A C A T A T T T A T G G G T T A A A G T G T A A T G T T T T A A T A T G T G T A C A C A T A T T poly-A 6 3 1 0 1 G A C C A A A T C A G G G T A A T T T T G C A T T T G T A A T T T T A A A A A A T G C T T T C T T C promoter 6 3 1 5 1 T T T T A A T A T A C T T T T T T G T T T A T C T T A T T T C T A A T A C T T T C C C T A A T C T C 6 3 2 0 1 T T T C T T T C A G G G C A A T A A T G A T A C A A T G T A T C A T G C C T C T T T G C A C C A T T 6 3 2 5 1 C T A A A G A A T A A C A G T G A T A A T T T C T G G G T T A A G G C A A T A G C A A T A T T T C T 6 3 3 0 1 G C A T A T A A A T A T T T C T G C A T A T A A A T T G T A A C T G A T G T A A G A G G T T T C A T Forward (+) strand Forward (+) strand intergenic 6 3 3 5 1 A T T G C T A A T A G C A G C T A C A A T C C A G C T A C C A T T C T G C T T T T A T T T T A T G G region Reverse (-) strand Reverse (-) strand 6 3 4 0 1 T T G G G A T A A G G C T G G A T T A T T C T G A G T C C A A G C T A G G C C C T T T T G C T A A T 6 3 4 5 1 C A T G T T C A T A C C T C T T A T C T T C C T C C C A C A G C T C C T G G G C A A C G T G C T G G 6 3 5 0 1 T C T G T G T G C T G G C C C A T C A C T T T G G C A A A G A A T T C A C C C C A C C A G T G C A G 6 3 5 5 1 G C T G C C T A T C A G A A A G T G G T G G C T G G T G T G G C T A A T G C C C T G G C C C A C A A 6 3 6 0 1 G T A T C A C T A A G C T C G C T T T C T T G C T G T C C A A T T T C T A T T A A A G G T T C C T T 6 3 6 5 1 T G T T C C C T A A G T C C A A C T A C T A A A C T G G G G G A T A T T A T G A A G G G C C T T G A 6 3 7 0 1 G C A T C T G G A T T C T G C C T A A T A A A A A A C A T T T A T T T T C A T T G C A A T G A T G T
Regular expressions can be limiting C A A G AGGT AGT 5’ splice junction in eukaryotes T C N AGC T C 3’ splice junction ≥11 Most protein binding sites are characterized by some degree of sequence specificity, but seeking a consensus sequence is often an inadequate way to recognize sites. Position-specific distributionscame to represent the variability in motif composition.
S = S1 S2 S3 S4 S5 S6 S7 S8 S9 Odds Ratio R = = Score s = log2R P(S|+) P-3(S1)P-2(S2)P-1(S3) ••• P5(S8)P6(S9) P(S|-) Pbg(S1)Pbg(S2)Pbg(S3) ••• Pbg(S8)Pbg(S9) Position-specific scoring matrix (PSSM) Pos -3 -2 -1 +1 +2 +3 +4 +5 +6 A 0.3 0.6 0.1 0.0 0.0 0.4 0.7 0.1 0.1 C 0.4 0.1 0.0 0.0 0.0 0.1 0.1 0.1 0.2 G 0.2 0.2 0.8 1.0 0.0 0.4 0.1 0.8 0.2 T 0.1 0.1 0.1 0.0 1.0 0.1 0.1 0.0 0.5
Ok, so we got the genes • molecular biology (transcription, splicing) • signals are modeled as states (HMM) or separately, i.e.PSSMs • Here’s another catch, there isn’t just one version of each gene. • But sometimes several
Eg. alternative splicing - CD44 Human chromosome 11p… Zhu et al Science… (2003)
Alternative splicing • is a major determinant of protein diversity (Lander 2001, Zavolan 2003) • 30-50% of human diseases involve alt. splicing
Defining constitutive and alternative exons Constitutive exon Skipped exon 3’ alternative exon 5’ alternative exon Intron retention Mutually exclusive exons
Conserved alternative, skipped exon - FXR1 Fragile X Related Gene, FXR1
Another example of genes containing CSE: DMWD Myotonic Dystrophy-containing WD Repeat, DMWD
Predicting new alternatively spliced exons The problem is ‘ill-posed’ High-dimensional space Not overfit data Simple feature selection Unbalanced data set sizes Labels are more “flexible”
Biological sequence space: challenges • Models that “represent” as much of the biology as possible. • Biologically motivated features are important • Validating attributes: • Conservation of events are key in computational biology • Higher-level consistency with known biology • Experimental validation of predictions are essential
Framework/Issues • “Build” models around known biology • In the process, extend knowledge about known biology • “Predict” new examples • “Validate” predictions by • prediction accuracy • experimental validation • higher-level traits of predictions • conservation in other genomes
If time permits Modeling higher order interactions: Yeast Phe tRNA Secondary Structure Tertiary Structure
The Hammerhead Ribozyme Secondary structure Tertiary structure
Method of Covariation / Compensatory changes One example on how to model and predict RNA 2o Structure • Covariation (using comparative genomics) Seq1: A C G A A A G U Seq2: U A G U A A U A Seq3: A G G U G A C U Seq4: C G G C A A U G Seq5: G U G G G A A C
is maximal (2 bits) if x and y individually appear at random (A,C,G,U equally likely), but are perfectly correlated (e.g., always complementary) Mutual information statisticfor pair of columns in a multiple alignment = fraction of seqs w/ nt. x in col. i, nt. y in col. j = fraction of seqs w/ nt. x in col. i sum over x, y = A, C, G, U
Stochastic Context-Free Grammars (SCFGs) • A generalized model which is capable of handling non-local dependencies between words in a language (or bases in an RNA) Ref: Durbin et al. “Biological Sequence Analysis” 1998
An SCFG Model of RNA 2o Structure “Production Rules”: • P aWb (“pair”) • L aW (“left bulge/loop”) • R Wa (“right bulge/loop”) • B SS (“bifurcation”) • S W (“start”) • E (“end”)
last page • some of the slides were obtained from various places: • available online slides on the web (primarily from lectures by terry speed). • slides from chris burge, dirk holste