260 likes | 429 Views
A Method to Detect Gene Structure and Alternative Splice Sites by Agreeing ESTs to a Genomic Sequence. Paola Bonizzoni Graziano Pesole * Raffaella Rizzi DISCo, University of Milan-Bicocca, Italy * Department of Physiology and Biochemistry, University of Milan, Italy
E N D
A Method to Detect Gene Structure and Alternative Splice Sites by Agreeing ESTs to a Genomic Sequence Paola Bonizzoni Graziano Pesole* Raffaella Rizzi DISCo, University of Milan-Bicocca, Italy *Department of Physiology and Biochemistry, University of Milan, Italy Supported by FIRB Bioinformatics: Genomics and Proteomics
Outline • Gene structure and alternative splicing (AS) • Problem definition and algorithm • ASPic program • Experimental results and discussion
5’ 3’ 3’ 5’ TRANSCRIPTION 5’ pre-mRNA 3’ exon 1 exon 2 exon 3 SPLICING by spliceosome mRNA splicing product exon 1 exon 2 exon 3 EST Expressed Sequence Tag (cDNA) exon 1 exon 3 exon 2 Mechanism of Splicing DNA
1 1 2 1 1 2 3 2 3 2 3 3 3 Exons Introns Third splicing mode Second splicing mode First splicing mode Modes of Alternative Splicing Genomic sequence
1 2 3 Competing 5’–3’ Exclusive exons: 2b Modes of Alternative Splicing 1 3 1 2b
Why AS is important? • AS occurs in 59% of human genes (Graveley, 2001) • AS expands protein diversity (generates from a single gene multiple transcripts) • AS is tissue-specific (Graveley, 2001) • AS is related to human diseases
NEED tools to Motivations • predict alternative splicing forms • analyze such a mechanism by a representation of splicing forms Regulation of AS is still an open problem
But to predict the exon-intron gene structure is a complicate goal because of What is available? • sequencing errors in EST make difficult to locate splice sites by alignment • duplications, repeated sequences may produce more than one possible EST alignment Fast programs to produce a single EST alignment to a genomic sequence: Spidey (Wheelan et al., 2001) Squall (Ogasawara & Morishita, 2002)
Open Problems • Formal definition of AS prediction problem … • Combined analysis of ESTs alignments related to the same gene by agreeing ESTs to a common exon-intron gene structure • Optimization criteria
Formal Definitions • Def1 • Genomic sequence, G = I1 f1 I2 f2 I3 f3 … In fn In+1, where Ii (i=1, 2, …, n+1) are introns and fi (i=1, 2, …, n) are exons • Def2 • Exon factorization of G, GE = f1 f2 f3 … fn • Def3 • EST factorization of an EST Scompatible with GE is S=s1 s2 … sk s.t. there exists 1 i1 < i2 < … < ik n: • st= fit for t=2, 3, …, k-1 • s1 is a suffix of fi1 and sk is a prefix of fik • Def1 • Genomic sequence, G = I1 f1 I2 f2 I3 f3 … In fn In+1, where Ii (i=1, 2, …, n+1) are introns and fi (i=1, 2, …, n) are exons • Def2 • Exon factorization of G, GE = f1 f2 f3 … fn • Def3 • EST factorization of an EST Scompatible with GE is S=s1 s2 … sk s.t. there exists 1 i1 < i2 < … < ik n: • edit (st, fit) error for t=2, 3, …, k-1 • edit(s1, suff(fi1)) error and edit(sk, pref(fik)) error st= suff (fit) orst= pref (fit) splice variant
The Problem Input - A genomic sequence G - A set of EST sequences S = {S1, S2, …, Sn} Output An exon factorization GE of G (GE = f1, f2, …, fn) and a set of ESTs factorizations compatible with GE Objective: minimize n
A2 A1A2 B D1 D1 C1 7 exons 4 exons S1 A2 D1 C1 S2 A1A2 B D1 S3 A2 D1D2 C1C2 Example Genomic sequence G A2 A2 A2 A2 A1A2 A1A2 A1A2 B B B D1 D1 D1 C1 C1 D1D2 D1D2 D1D2 D1D2 C1C2 C1C2 C1C2 C1C2 EST set S = {S1, S2, S3}
Results • MEFC is MAX-SNP-hard (linear reduction from NODE-COVER) • heuristic algorithm: Iterate process to factorize each EST backtracking to recompute previous EST factors if not compatible to GE
em The algorithm Iterative jth step: partial EST factorization of Si (compute factor sij) si-1 1 si-1 j-1 si-1 j si-1 n Si-1 si1 si j-1 sij Si e1 e2 em G After placing all the factors sij for the set S, place the external factors; if (Compatible(em, exon_list)) then add em to exon_list; otherwise try to place sij elsewhere; If not possible then backtrack;
ag gt The algorithm (more details) Compute factor sij G exon c2 si1 si j-1 si jy si j Si c1 c1 c1 c1 c2 c2 c3 c4 c5 sij Find the rightmost gt pattern such that the edit distance between sijy and the genomic substring from ag to gt is bounded Find the canonical ag pattern on the left Then the algorithm searches a perfect match of c2 on G Then the entire factor sij can be placed on G The algorithm searches a perfect match of c1 on G Sij can be divided into n components ck (k=1,2,…,n) At least one of these components for k from 1 to (n-1) is error-free and can be placed on G Suppose that c2 has a perfect match on G Suppose that c1 has no perfect match on G
ASPic (Alternative Splicing PredICtion) Input - A minimum length of an exon - A maximum number of exons in the exon factorization of the genomic sequence - An error percentage - A genomic sequence - An ESTs set (or cluster) Output - A text file for all ESTs alignments - An HTML file for the exon factorization of the genomic sequence
ASPic data validation • Genomic sequences from ASAP database • EST clusters of human chromosome 1 from UniGene database Validation Database: ASAP (Lee et al., 2003) ASPic INPUT:
Experimental Results Genomic sequence (official gene name) ASAP introns detected by ASPic Novel introns detected by ASPic Genomic shift detected by ASPic Introns detected by ASAP
Execution times PENTIUM IV, 1600 MHZ, 256 MB, running Linux
An example of data (gene HNRPR) ASPic finds a novel intron from 2144 to 5333 confirmed by 18 EST sequences Positions are from 0 for ASPic and from 1 for ASAP
An example of data (gene HNRPR, intron 2144-5333) EST ID Left and right ends of the two exons EST exons Genomic exons
Responsabili di progetto: Prof. Paola Bonizzoni Prof. Graziano Pesole Responsabile disegno software: Raffaella Rizzi Sito WEB: Gabriele Ravanelli Rappresentazione grafica: Francesco Perego Anna Redondi Analisi dati: Francesca Rossin Altri contributi: Gianluca Dellavedova