Statistical modeling and classification in Biological Sequence Space

Statistical modeling and classification in Biological Sequence Space April 26, 04; 9.520 Gene Yeo Poggio, Burge @MIT

Framework/Issues • “Build” models around known biology • In the process, extend knowledge about known biology • “Predict” new examples • “Validate” predictions by • prediction accuracy • experimental validation • higher-level traits of predictions • conservation in other genomes

Biological sequences • DNA, RNA and proteins: macromolecules built up from smaller units. • DNA: units are the nucleotide residues A, C, G and T • RNA: units are the nucleotide residues A, C, G and U • Proteins: units are the amino acid residues A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W and Y. • To a considerable extent, the chemical properties of DNA, RNA and protein molecules are encoded in the linear sequence of these basic units: their primary structure.

Statistical models can be descriptive and/or predictive. • Given known biological signal-> describe the signal with statistical modeling & find unknown examples of the same signal • Gene-finding (protein-coding genes) • Noncoding RNA genes • Protein domains • Warning: although successful, models are not to be taken literally. • Most important: biological confirmation of predictions is almost always necessary.

Different models RNA gene (Covariation,SCFG,NN,SVM) Protein structure (a variety of methods) Complexity Protein gene(HMM,NN) Splice site motif (WMM, MM, SVM, NN) DNA RNA Protein

A case study in computational biology: modeling signals in genes With so many genomes being sequenced, it remains important to be able to identify genes and the signals within and around genes computationally.

DNA transcription mRNA translation Protein What is a (protein-coding) gene? CCTGAGCCAACTATTGATGAA CCUGAGCCAACUAUUGAUGAA PEPTIDE

Some facts about human genes Comprise about 3% of the genome Average gene length: ~ 8,000 bp Average of 5-6 exons/gene Average exon length: ~200 bp Average intron length: ~2,000 bp ~8% genes have a single exon

The idea behind a HMM genefinder • States represent standard gene features: intergenic region, exon, intron, perhaps more (promotor, 5’UTR, 3’UTR, Poly-A,..). • Observationsembody state-dependent statistics, such as base composition, dependence, and signal features.

GENSCAN (Burge & Karlin) 6 2 0 0 1 A G G A C A G G T A C G G C T G T C A T C A C T T A G A C C T C A C C C T G T G G A G C C A C A C C 6 2 0 5 1 C T A G G G T T G G C C A A T C T A C T C C C A G G A G C A G G G A G G G C A G G A G C C A G G G C 6 2 1 0 1 T G G G C A T A A A A G T C A G G G C A G A G C C A T C T A T T G C T T A C A T T T G C T T C T G A 6 2 1 5 1 C A C A A C T G T G T T C A C T A G C A A C C T C A A A C A G A C A C C A T G G T G C A C C T G A C E E E 6 2 2 0 1 T C C T G A G G A G A A G T C T G C C G T T A C T G C C C T G T G G G G C A A G G T G A A C G T G G 0 1 2 6 2 2 5 1 A T G A A G T T G G T G G T G A G G C C C T G G G C A G G T T G G T A T C A A G G T T A C A A G A C 6 2 3 0 1 A G G T T T A A G G A G A C C A A T A G A A A C T G G G C A T G T G G A G A C A G A G A A G A C T C 6 2 3 5 1 T T G G G T T T C T G A T A G G C A C T G A C T C T C T C T G C C T A T T G G T C T A T T T T C C C 6 2 4 0 1 A C C C T T A G G C T G C T G G T G G T C T A C C C T T G G A C C C A G A G G T T C T T T G A G T C I I I 0 1 2 6 2 4 5 1 C T T T G G G G A T C T G T C C A C T C C T G A T G C T G T T A T G G G C A A C C C T A A G G T G A 6 2 5 0 1 A G G C T C A T G G C A A G A A A G T G C T C G G T G C C T T T A G T G A T G G C C T G G C T C A C 6 2 5 5 1 C T G G A C A A C C T C A A G G G C A C C T T T G C C A C A C T G A G T G A G C T G C A C T G T G A 6 2 6 0 1 C A A G C T G C A C G T G G A T C C T G A G A A C T T C A G G G T G A G T C T A T G G G A C C C T T E E i t 6 2 6 5 1 G A T G T T T T C T T T C C C C T T C T T T T C T A T G G T T A A G T T C A T G T C A T A G G A A G 6 2 7 0 1 G G G A G A A G T A A C A G G G T A C A G T T T A G A A T G G G A A A C A G A C G A A T G A T T G C 6 2 7 5 1 A T C A G T G T G G A A G T C T C A G G A T C G T T T T A G T T T C T T T T A T T T G C T G T T C A 6 2 8 0 1 T A A C A A T T G T T T T C T T T T G T T T A A T T C T T G C T T T C T T T T T T T T T C T T C T C 6 2 8 5 1 C G C A A T T T T T A C T A T T A T A C T T A A T G C C T T A A C A T T G T G T A T A A C A A A A G E 5'UTR s 3'UTR 6 2 9 0 1 G A A A T A T C T C T G A G A T A C A T T A A G T A A C T T A A A A A A A A A C T T T A C A C A G T 6 2 9 5 1 C T G C C T A G T A C A T T A C T A T T T G G A A T A T A T G T G T G C T T A T T T G C A T A T T C 6 3 0 0 1 A T A A T C T C C C T A C T T T A T T T T C T T T T A T T T T T A A T T G A T A C A T A A T C A T T 6 3 0 5 1 A T A C A T A T T T A T G G G T T A A A G T G T A A T G T T T T A A T A T G T G T A C A C A T A T T poly-A 6 3 1 0 1 G A C C A A A T C A G G G T A A T T T T G C A T T T G T A A T T T T A A A A A A T G C T T T C T T C promoter 6 3 1 5 1 T T T T A A T A T A C T T T T T T G T T T A T C T T A T T T C T A A T A C T T T C C C T A A T C T C 6 3 2 0 1 T T T C T T T C A G G G C A A T A A T G A T A C A A T G T A T C A T G C C T C T T T G C A C C A T T 6 3 2 5 1 C T A A A G A A T A A C A G T G A T A A T T T C T G G G T T A A G G C A A T A G C A A T A T T T C T 6 3 3 0 1 G C A T A T A A A T A T T T C T G C A T A T A A A T T G T A A C T G A T G T A A G A G G T T T C A T Forward (+) strand Forward (+) strand intergenic 6 3 3 5 1 A T T G C T A A T A G C A G C T A C A A T C C A G C T A C C A T T C T G C T T T T A T T T T A T G G region Reverse (-) strand Reverse (-) strand 6 3 4 0 1 T T G G G A T A A G G C T G G A T T A T T C T G A G T C C A A G C T A G G C C C T T T T G C T A A T 6 3 4 5 1 C A T G T T C A T A C C T C T T A T C T T C C T C C C A C A G C T C C T G G G C A A C G T G C T G G 6 3 5 0 1 T C T G T G T G C T G G C C C A T C A C T T T G G C A A A G A A T T C A C C C C A C C A G T G C A G 6 3 5 5 1 G C T G C C T A T C A G A A A G T G G T G G C T G G T G T G G C T A A T G C C C T G G C C C A C A A 6 3 6 0 1 G T A T C A C T A A G C T C G C T T T C T T G C T G T C C A A T T T C T A T T A A A G G T T C C T T 6 3 6 5 1 T G T T C C C T A A G T C C A A C T A C T A A A C T G G G G G A T A T T A T G A A G G G C C T T G A 6 3 7 0 1 G C A T C T G G A T T C T G C C T A A T A A A A A A C A T T T A T T T T C A T T G C A A T G A T G T

Splice sites can be an important signal

Regular expressions can be limiting C A A G AGGT AGT 5’ splice junction in eukaryotes T C N AGC T C 3’ splice junction ≥11 Most protein binding sites are characterized by some degree of sequence specificity, but seeking a consensus sequence is often an inadequate way to recognize sites. Position-specific distributionscame to represent the variability in motif composition.

S = S1 S2 S3 S4 S5 S6 S7 S8 S9 Odds Ratio R = = Score s = log2R P(S|+) P-3(S1)P-2(S2)P-1(S3) ••• P5(S8)P6(S9) P(S|-) Pbg(S1)Pbg(S2)Pbg(S3) ••• Pbg(S8)Pbg(S9) Position-specific scoring matrix (PSSM) Pos -3 -2 -1 +1 +2 +3 +4 +5 +6 A 0.3 0.6 0.1 0.0 0.0 0.4 0.7 0.1 0.1 C 0.4 0.1 0.0 0.0 0.0 0.1 0.1 0.1 0.2 G 0.2 0.2 0.8 1.0 0.0 0.4 0.1 0.8 0.2 T 0.1 0.1 0.1 0.0 1.0 0.1 0.1 0.0 0.5

Ok, so we got the genes • molecular biology (transcription, splicing) • signals are modeled as states (HMM) or separately, i.e.PSSMs • Here’s another catch, there isn’t just one version of each gene. • But sometimes several

Eg. alternative splicing - CD44 Human chromosome 11p… Zhu et al Science… (2003)

Alternative splicing • is a major determinant of protein diversity (Lander 2001, Zavolan 2003) • 30-50% of human diseases involve alt. splicing

Defining constitutive and alternative exons Constitutive exon Skipped exon 3’ alternative exon 5’ alternative exon Intron retention Mutually exclusive exons

Conserved alternative, skipped exon - FXR1 Fragile X Related Gene, FXR1

Another example of genes containing CSE: DMWD Myotonic Dystrophy-containing WD Repeat, DMWD

Predicting new alternatively spliced exons The problem is ‘ill-posed’ High-dimensional space Not overfit data Simple feature selection Unbalanced data set sizes Labels are more “flexible”

Eg. of experimentally validated

Biological sequence space: challenges • Models that “represent” as much of the biology as possible. • Biologically motivated features are important • Validating attributes: • Conservation of events are key in computational biology • Higher-level consistency with known biology • Experimental validation of predictions are essential

Framework/Issues • “Build” models around known biology • In the process, extend knowledge about known biology • “Predict” new examples • “Validate” predictions by • prediction accuracy • experimental validation • higher-level traits of predictions • conservation in other genomes

If time permits Modeling higher order interactions: Yeast Phe tRNA Secondary Structure Tertiary Structure

The Hammerhead Ribozyme Secondary structure Tertiary structure

Method of Covariation / Compensatory changes One example on how to model and predict RNA 2o Structure • Covariation (using comparative genomics) Seq1: A C G A A A G U Seq2: U A G U A A U A Seq3: A G G U G A C U Seq4: C G G C A A U G Seq5: G U G G G A A C

is maximal (2 bits) if x and y individually appear at random (A,C,G,U equally likely), but are perfectly correlated (e.g., always complementary) Mutual information statisticfor pair of columns in a multiple alignment = fraction of seqs w/ nt. x in col. i, nt. y in col. j = fraction of seqs w/ nt. x in col. i sum over x, y = A, C, G, U

Inferring 2o Structure from Covariation

Stochastic Context-Free Grammars (SCFGs) • A generalized model which is capable of handling non-local dependencies between words in a language (or bases in an RNA) Ref: Durbin et al. “Biological Sequence Analysis” 1998

An SCFG Model of RNA 2o Structure “Production Rules”: • P  aWb (“pair”) • L  aW (“left bulge/loop”) • R  Wa (“right bulge/loop”) • B  SS (“bifurcation”) • S  W (“start”) • E   (“end”)

last page • some of the slides were obtained from various places: • available online slides on the web (primarily from lectures by terry speed). • slides from chris burge, dirk holste

Statistical modeling and classification in Biological Sequence Space

Statistical modeling and classification in Biological Sequence Space

Presentation Transcript

Biological classification

Biological Classification

Sequence Classification: Chunking

Biological Classification

BIOLOGICAL CLASSIFICATION

Biological Sequence Comparison and Alignment

Linnaeus And Biological Classification

Biological sequence analysis

Statistical modeling and classification in Biological Sequence Space

Statistical Classification

Sequence Classification Using Statistical Pattern Recognition

Biological Sequence Analysis

Biological Sequence Analysis

Statistical Modeling

Statistical modeling, classification, and sensor management

Biological Classification

Biological Sequence Analysis

Biological Classification

Statistical Modeling

Background: Biological Classification

Statistical Classification

Biological Classification