Dynamic Programming (cont’d)

Dynamic Programming (cont’d) CS 466 Saurabh Sinha

This is more likely. This is less likely. Affine Gap Penalties • In nature, a series of k indels often come as a single event rather than a series of k single nucleotide events: ATA__GGC ATGATCGC ATA_G_GC ATGATCGC Normal scoring would give the same score for both alignments

Accounting for Gaps • Gaps- contiguous sequence of spaces in one of the rows • Score for a gap of length x is: -(ρ +σx) where ρ >0 is the penalty for introducing a gap: gap opening penalty ρ will be large relative to σ: gap extension penalty because you do not want to add too much of a penalty for extending the gap.

Affine gap penalty in DP • When computing si,j, need to look at si,j-1, si,j-2, si,j-3,…. and si-1,j, si-2,j, … • Each cell needs O(n) time for update • O(n2) cells • Therefore, O(n3) algorithm • We can still do this in O(n2) time

Affine Gap Penalty Recurrences Continue Gap in w (deletion) si,j = s i-1,j - σ max s i-1,j –(ρ+σ) si,j = s i,j-1 - σ max s i,j-1 –(ρ+σ) si,j = si-1,j-1 + δ(vi, wj) max s i,j s i,j Start Gap in w (deletion): from middle Continue Gap in v (insertion) Start Gap in v (insertion):from middle Match or Mismatch End deletion: from top End insertion: from bottom

Optional Reading Section 6.10 (J & P)Multiple Alignment

Gene Prediction

Gene Prediction: Computational Challenge • Gene: A sequence of nucleotides coding for protein • Gene Prediction Problem: Determine the beginning and end positions of genes in a genome

The Genetic Code SOURCE: http://www.bioscience.org/atlases/genecode/genecode.htm

Codons • In 1961 Sydney Brenner and Francis Crick discovered frameshift mutations • Systematically deleted nucleotides from DNA • Single and double deletions dramatically altered protein product • Effects of triple deletions were minor • Conclusion: every triplet of nucleotides, each codon, codes for exactly one amino acid in a protein

Great Discovery Provoking Wrong Assumption • In 1964, Charles Yanofsky and Sydney Brenner proved colinearity in the order of codons with respect to amino acids in proteins • As a result, it was incorrectly assumed that the triplets encoding for amino acid sequences form contiguous strips of information.

Exons and Introns • In eukaryotes, the gene is a combination of coding segments (exons) that are interrupted by non-coding segments (introns) • This makes computational gene prediction in eukaryotes even more difficult • Prokaryotes don’t have introns - Genes in prokaryotes are continuous

Splicing intron1 intron2 exon2 exon3 exon1 transcription splicing translation exon = coding intron = non-coding Batzoglou

Gene prediction • More difficult in eukaryotes than in prokaryotes (due to introns). • In human genome, ~3% of DNA sequence is genes • Lot of “junk” DNA between genes, and even inside genes (between exons). • Gene prediction must deal with this.

Gene prediction: broadly speaking • Statistical approaches:look for features than appear frequently in genes and infrequently elsewhere • Similarity based approaches: a newly sequenced gene may be similar to a known gene. • even this is not so simple. The exon structures may be different between otherwise similar genes

Statistical approaches

Open Reading Frames (ORFs) • Let us consider gene prediction in prokaryotes (no introns) • Detect potential coding regions by looking at ORFs • A region of length n is comprised of (n/3) codons • Stop codons break genome into segments between consecutive Stop codons • The subsegments of these that start from the Start codon (ATG) are ORFs ATG TGA Genomic Sequence Open reading frame

ORFs • 6 reading frames in any given sequence • 6 ways to map the DNA sequence to codon sequence (+1,+2,+3,-1,-2,-3) • 3 on either strand • Look at all 6 reading frames for ORFs

Long vs.Short ORFs • Long open reading frames may be a gene • At random, we should expect one stop codon every (64/3) ~= 21 codons • However, genes are usually much longer than this • A basic approach is to scan for ORFs whose length exceeds certain threshold • This is naïve because some genes (e.g. some neural and immune system genes) are relatively short

Codon usage • In a given sequence (e.g., an ORF), compute frequency distribution of codons (64 element array): codon usage array • Codon usage array for coding sequences is different from that for non-coding sequences • If the codon usage array for an ORF is much more similar to that of coding sequences than to that of non-coding sequences, the ORF could be a gene

Codon usage • Codons coding for “Arg” in human: • CGU: 37%, CGC: 38%, CGA: 7%, CGG: 10%, AGA: 5%, AGG: 3% • In a coding sequence, codon CGC is 12 times more likely than codon AGG • An ORF preferring CGC over AGG is likely to be a gene

Codon Usage in Human Genome

Codon usage • One way to test if an ORF is a gene is to compute • Pr(ORF sequence under a coding sequence model) • Pr(ORF sequence under a non-coding model) • Ratio of the two. • These methods work best in prokaryotes • The exon-intron trouble is not handled yet

Promoter Structure in Prokaryotes (E.Coli) • Transcription starts at offset 0. • Pribnow Box (-10) • Gilbert Box (-30) • Ribosomal Binding Site (+10)

Ribosomal Binding Site

Splicing Signals: an additional statistical clue, for eukaryotes Exons are interspersed with introns and typically flanked by GT and AG

Donor site 5’ 3’ Position % Splice site detection From lectures by Serafim Batzoglou (Stanford)

Consensus splice sites

Statistical approaches: summary • Codon usage • Promoter motifs • Ribosome binding site • Splicing sites

Similarity based approaches

Similarity based approaches • Some genomes may be very well-studied, with many genes having been experimentally verified. • Closely-related organisms may have similar genes • Unknown genes in one species may be compared to genes in some closely-related species

The basic approach • Given a protein sequence, and a genomic sequence, find a set of substrings of the genomic sequence whose concatenation best fits the protein sequence • Deals with the exon-intron problem • First cut: Find fragments in the genomic sequence that match portions of the protein sequence (local alignment) • Then find the “optimal” subset of non-overlapping fragments

Exon chaining • Each of the fragments of the genomic sequence that somewhat match the protein (locally) is a putative exon • The “goodness” of the match is the “weight” assigned to this putative exon • Thus, we have a set of weighted intervals (l,r,w): for a fragment from l to r, with weight w representing how well it matches (a portion of) the protein

Exon Chaining Problem • Input: A set of weighted intervals (l,r,w) • Output: A maximum weight chain of non-overlapping intervals from this set

Exon Chaining Problem: Graph Representation edge from every li to ri edge between every two successive vertices • This problem can be solved with dynamic programming in O(n) time. 21

Assumptions • No two intervals have a common boundary point. So the (li,ri) define 2n distinct points, if there are n intervals

Exon Chaining Algorithm ExonChaining (G, n) //Graph, number of intervals fori ←to 2n si← 0 fori ← 1 to 2n if vertex vi in G corresponds to right end of the interval I j← index of vertex for left end of the interval I w← weight of the interval I si← max {sj + w, si-1} else si← si-1 returns2n

Not very helpful • A chain is a set of non-overlapping exons in order (left to right) • But the matching protein portions may not be in the same order !

Spliced Alignment • Begins by selecting either all putative exons between potential acceptor and donor sites or by finding all substrings similar to the target protein (as in the Exon Chaining Problem). • This set is further filtered in a such a way that attempt to retain all true exons, with some false ones. • Then find the chain of exons such that the sequence similarity to the target protein sequence is maximized

Spliced Alignment Problem: Formulation • Input: Genomic sequences G, target sequence T, and a set of candidate exons (blocks) B. • Output: A chain of exons Γ such that the global alignment score between Γ* and T is maximized Γ* - concatenation of all exons from chain Γ

Dynamic programming • Genomic sequence G = g1g2…gn • Target sequence T = t1t2…tm • As usual, we want to find the optimal alignment score of the i-prefix of G and the j-prefix of T • Problem is, there are many i-prefixes possible (since multiple blocks may include position i)

Idea • Find the optimal alignment score of the i-prefix of G and the j-prefix of T assuming that this alignment uses a particular block B at position i • S(i, j, B) • For every block B that includes i

Recurrence If i is not the starting vertex of block B: • S(i, j, B) = max { S(i – 1, j, B) – indel penalty S(i, j – 1, B) – indel penalty S(i – 1, j – 1, B) + δ(gi, tj) } If i is the starting vertex of block B: • S(i, j, B) = max { S(i, j – 1, B) – indel penalty maxall blocks B’ preceding block B S(end(B’), j, B’) – indel penalty maxall blocks B’ preceding block B S(end(B’), j – 1, B’) + δ(gi, tj) }

Dynamic Programming (cont’d)