Gene Prediction: Similarity-Based Methods

Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign Many slides are taken/adapted from http://www.bioalgorithms.info/slides.htm

The Gene Prediction Problem • Given genome sequences, determine where are the genes • The problem is easier for prokaryotes (no introns) • The problem is significantly harder for eukaryotes (alternative splicing)

Splicing Causes Problem…

Exons vs. Introns • Exon: A portion of the gene that appears in both the primary and the mature mRNA transcripts. • Intron: A portion of the gene that is transcribed but excised prior to translation.

Definition of a Gene • Regulatory regions: up to 50 kb upstream of +1 site • Exons: protein coding and untranslated regions (UTR) 1 to 178 exons per gene (mean 8.8) 8 bp to 17 kb per exon (mean 145 bp) • Introns: splice acceptor and donor sites, junk DNA average 1 kb – 50 kb per intron • Gene size: Largest – 2.4 Mb (Dystrophin). Mean – 27 kb.

Different Views of a Gene Gene ATGCTTGCCAAAT…TCG… DNA Exons Pre-mRNA e2 e3 e1 Introns e2 e3 e1 mRNA Protein MSRTAQ…

Approaches to Gene Prediction • Similarity-based approaches: • Exploit the fact that many genes are conserved across species • Can be highly reliable • Only good for finding unknown genes • Statistical approaches • Exploit statistical characteristics of coding regions and non-coding regions and other knowledge about genes • Can potentially detect new genes • May not be reliable • They can/should be combined • Currently no principled approaches for doing this

Outline • The idea of similarity-based approach to gene prediction • Exon Chaining Problem • Spliced Alignment Problem

Using Known Genes to Predict New Genes • Some organism’s genome may be very well- documented, with many genes having been experimentally verified. • Closely-related organisms may have similar genes • Unknown genes in one species may be compared to genes in some closely-related species

Comparing Genes in Two Genomes • Small islands of similarity corresponding to similarities between exons

Reverse Translation • Given a known protein, find a gene in the genome which codes for it • One might infer the coding DNA of the given protein by reversing the translation process • Inexact: amino acids map to > 1 codon • This problem is essentially reduced to an alignment problem

mRNA (codon sequence) { { { { { exon1 intron1 exon2 intron2 exon3 Portion of genome Comparing Genomic DNA Against mRNA

Frog Gene (known) Human Genome Using Similarities to Find the Exon Structure • The known frog gene is aligned to different locations in the human genome • Find the “best” path to reveal the exon structure of human gene

Frog Genes (known) Human Genome Finding Local Alignments Use local alignments to find all islands of similarity

Chaining Local Alignments • Find substrings that match a given gene sequence (candidate exons) • Define structure of candidate exons as (l, r, w) (left, right, weight defined as score of local alignment) • Look for a maximum chain of substrings • Chain: a set of non-overlapping nonadjacent intervals.

5 5 15 9 11 4 3 0 2 3 5 6 11 13 16 20 25 27 28 30 32 Exon Chaining Problem • Locate the beginning and end of each interval (2n points) • Find the “best” path

Exon Chaining Problem: Formulation • Exon Chaining Problem: Given a set of putative exons, find a maximum set of non-overlapping putative exons • Input: a set of weighted intervals (putative exons) • Output: A maximum chain of intervals from this set

Exon Chaining: Graph Representation • This problem can be solved with dynamic programming in O(n) time.

Exon Chaining Algorithm ExonChaining (G, n) //Graph, number of intervals • fori ←to 2n • si← 0 • fori ← 1 to 2n • if vertex vi in G corresponds to right end of interval I • j← index of vertex for left end of the interval I • w← weight of the interval I • sj← max {sj + w, si-1} • else • si← si-1 • return s2n

Exon Chaining: Deficiencies • Poor definition of the putative exon endpoints • Optimal chain of intervals may not correspond to any valid alignment • First interval may correspond to a suffix, whereas second interval may correspond to a prefix • Combination of such intervals is not a valid alignment

Spliced Alignment • Proposed in 1996 by Mikhail Gelfand and colleagues • Goal: Use a protein within one genome to reconstruct the exon-intron structure of a (related) gene in another genome. • Method • Begins by selecting either all putative exons between potential acceptor and donor sites or by finding all substrings similar to the target protein (as in the Exon Chaining Problem) • Find a chain of putative exons that has the highest similarity to the target protein

Spliced Alignment Problem: Formulation • Goal: Find a chain of blocks in a genomic sequence that best fits a target sequence • Input: Genomic sequences G, target sequence T, and set of candidate exons B. • Output: A chain of exons Γ such that the global alignment score s(Γ*, T) is maximum among all chains of blocks from B. Γ* is the string formed by concatenating strings in Γ. Essentially an alignment problem…

Lewis Carroll Example

The solution to the sliced alignment problem will be discussed later when we talk about sequence alignment…

What You Should Know • Why splicing causes difficulty in gene prediction • The formulation and algorithm for Exon Chaining • Why Spliced Alignment is a better formulation than Exon Chaining

Gene Prediction: Similarity-Based Methods

Gene Prediction: Similarity-Based Methods

Presentation Transcript

Image Similarity

DNA sequence analysis

Gene Expression Profiling

10/19/05 Gene Regulation (formerly Gene Prediction - 2)

Clustering analysis of microarray gene expression data

3. Genome Annotation: Gene Prediction

Predictive Methods Using DNA Sequences

Automate Function Prediction

Gene finding and gene structure prediction

Genomics and Personalized Care in Health Systems Lecture 6 Gene Finding (Part 1)

Gene Prediction: Statistical Approaches

Document Similarity Measures

(H)MMs in gene prediction and similarity searches

Exogean: an expert gene annotation framework based on directed acyclic coloured multigraphs

Splicing Exons: A Eukaryotic Challenge to Gene Prediction

3. Genome Annotation: Gene Prediction (II)

Methods of gene transformation ～ particle bombardment ～

Inference of gene regulatory networks using regression based network method

Regression based KNN for gene function prediction using heterogeneous data sources

Approximation of Protein Structure for Fast Similarity Measures

Gene Prediction: Statistical Methods

Gene Prediction: Statistical Approaches