250 likes | 270 Views
Gene Prediction: Similarity-Based Methods. (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign. Many slides are taken/adapted from http://www.bioalgorithms.info/slides.htm.
E N D
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign Many slides are taken/adapted from http://www.bioalgorithms.info/slides.htm
The Gene Prediction Problem • Given genome sequences, determine where are the genes • The problem is easier for prokaryotes (no introns) • The problem is significantly harder for eukaryotes (alternative splicing)
Exons vs. Introns • Exon: A portion of the gene that appears in both the primary and the mature mRNA transcripts. • Intron: A portion of the gene that is transcribed but excised prior to translation.
Definition of a Gene • Regulatory regions: up to 50 kb upstream of +1 site • Exons: protein coding and untranslated regions (UTR) 1 to 178 exons per gene (mean 8.8) 8 bp to 17 kb per exon (mean 145 bp) • Introns: splice acceptor and donor sites, junk DNA average 1 kb – 50 kb per intron • Gene size: Largest – 2.4 Mb (Dystrophin). Mean – 27 kb.
Different Views of a Gene Gene ATGCTTGCCAAAT…TCG… DNA Exons Pre-mRNA e2 e3 e1 Introns e2 e3 e1 mRNA Protein MSRTAQ…
Approaches to Gene Prediction • Similarity-based approaches: • Exploit the fact that many genes are conserved across species • Can be highly reliable • Only good for finding unknown genes • Statistical approaches • Exploit statistical characteristics of coding regions and non-coding regions and other knowledge about genes • Can potentially detect new genes • May not be reliable • They can/should be combined • Currently no principled approaches for doing this
Outline • The idea of similarity-based approach to gene prediction • Exon Chaining Problem • Spliced Alignment Problem
Using Known Genes to Predict New Genes • Some organism’s genome may be very well- documented, with many genes having been experimentally verified. • Closely-related organisms may have similar genes • Unknown genes in one species may be compared to genes in some closely-related species
Comparing Genes in Two Genomes • Small islands of similarity corresponding to similarities between exons
Reverse Translation • Given a known protein, find a gene in the genome which codes for it • One might infer the coding DNA of the given protein by reversing the translation process • Inexact: amino acids map to > 1 codon • This problem is essentially reduced to an alignment problem
mRNA (codon sequence) { { { { { exon1 intron1 exon2 intron2 exon3 Portion of genome Comparing Genomic DNA Against mRNA
Frog Gene (known) Human Genome Using Similarities to Find the Exon Structure • The known frog gene is aligned to different locations in the human genome • Find the “best” path to reveal the exon structure of human gene
Frog Genes (known) Human Genome Finding Local Alignments Use local alignments to find all islands of similarity
Chaining Local Alignments • Find substrings that match a given gene sequence (candidate exons) • Define structure of candidate exons as (l, r, w) (left, right, weight defined as score of local alignment) • Look for a maximum chain of substrings • Chain: a set of non-overlapping nonadjacent intervals.
5 5 15 9 11 4 3 0 2 3 5 6 11 13 16 20 25 27 28 30 32 Exon Chaining Problem • Locate the beginning and end of each interval (2n points) • Find the “best” path
Exon Chaining Problem: Formulation • Exon Chaining Problem: Given a set of putative exons, find a maximum set of non-overlapping putative exons • Input: a set of weighted intervals (putative exons) • Output: A maximum chain of intervals from this set
Exon Chaining: Graph Representation • This problem can be solved with dynamic programming in O(n) time.
Exon Chaining Algorithm ExonChaining (G, n) //Graph, number of intervals • fori ←to 2n • si← 0 • fori ← 1 to 2n • if vertex vi in G corresponds to right end of interval I • j← index of vertex for left end of the interval I • w← weight of the interval I • sj← max {sj + w, si-1} • else • si← si-1 • return s2n
Exon Chaining: Deficiencies • Poor definition of the putative exon endpoints • Optimal chain of intervals may not correspond to any valid alignment • First interval may correspond to a suffix, whereas second interval may correspond to a prefix • Combination of such intervals is not a valid alignment
Spliced Alignment • Proposed in 1996 by Mikhail Gelfand and colleagues • Goal: Use a protein within one genome to reconstruct the exon-intron structure of a (related) gene in another genome. • Method • Begins by selecting either all putative exons between potential acceptor and donor sites or by finding all substrings similar to the target protein (as in the Exon Chaining Problem) • Find a chain of putative exons that has the highest similarity to the target protein
Spliced Alignment Problem: Formulation • Goal: Find a chain of blocks in a genomic sequence that best fits a target sequence • Input: Genomic sequences G, target sequence T, and set of candidate exons B. • Output: A chain of exons Γ such that the global alignment score s(Γ*, T) is maximum among all chains of blocks from B. Γ* is the string formed by concatenating strings in Γ. Essentially an alignment problem…
The solution to the sliced alignment problem will be discussed later when we talk about sequence alignment…
What You Should Know • Why splicing causes difficulty in gene prediction • The formulation and algorithm for Exon Chaining • Why Spliced Alignment is a better formulation than Exon Chaining