Overview

Overview • Biological motivation • Methods in gene prediction • Mapping of large EST data sets • Applications of EST data mining

ESTomics Sorin Istrail

Biological motivation • Model of eukaryotic gene transcription and translation RNA polymerase II promoter Upstream binding sites TATA box Gene DNA coding strand Sp1 Oct1 C/EBP Initiator

Biological motivation • Model of eukaryotic gene transcription and translation RNA polymerase II promoter Upstream binding sites TATA box Gene DNA coding strand Sp1 Oct1 C/EBP Initiator Transcription AAUAAA cap Exon 2 Exon 1 Intron primary transcript (A)n AG GT 3’ UTR 5’ UTR

Biological motivation • Model of eukaryotic gene transcription and translation RNA polymerase II promoter Upstream binding sites TATA box Gene DNA coding strand Sp1 Oct1 C/EBP Initiator Transcription AAUAAA cap Exon 2 Exon 1 Intron primary transcript (A)n AG GT 3’ UTR 5’ UTR Splicing mRNA 3’ UTR 5’ UTR

Biological motivation • Model of eukaryotic gene transcription and translation RNA polymerase II promoter Upstream binding sites TATA box Gene DNA coding strand Sp1 Oct1 C/EBP Initiator Transcription AAUAAA cap Exon 2 Exon 1 Intron primary transcript (A)n AG GT 3’ UTR 5’ UTR Splicing mRNA 3’ UTR 5’ UTR Translation protein (peptide)

Biological motivation • Expressed Sequence Tags (ESTs) are cDNA fragments • 500 bp long on average • may span one or more exons • cDNA: single-stranded DNA complementary to an RNA, synthesized from it by reverse transcription 3’ UTR 5’ UTR Gene DNA coding strand Exon 4 (non-coding) Exon 1 Exon 3 Exon 2 Intron Intron primary transcript Intron mRNA ESTs

Methods in gene finding • Ab initio analysis of genomic sequences (GenScan, Burge and Karlin 1997; HMMer, Haussler et al. 1993, Krogh et al. 1994; FGenesH, Solovyev and Salamov 1994) • Comparison of protein and genomic sequences (Procrustes, Gelfand et al. 1996; Genewise, Birney and Durbin) • Comparison of expressed DNA (ESTs, cDNA, mRNA) and genomic sequences (EST_GENOME, Mott 1997; SIM4, Florea et al. 1998) • Cross-species genomic sequence comparisons (ROSETTA, Batzoglou et al. 2000; CEM, Bafna and Huson 2000)

Ab initio gene finders • Use information embedded in the genomic sequence to predict the exon model • polyadenylation signal (AATAAA) • differential codon usage in coding versus non-coding sections of the gene • upstream regulatory signals (TATA boxes) and local characteristics of the sequence (CpG islands) • splice recognition signals (e.g., GT-AG) • Markov models are the predominant predictive method • Caveats • not effective in detecting alternatively spliced forms, interleaved or overlapping genes

The GenScan method • High-level organization • each of the basic functional units of a gene is associated with a state in the HMM • Lower-level organization • separate sequence prediction module for each of the higher-level elements • exons (marginal, internal, phase-specific) - inhomogeneous 3-periodic fifth order Markov model • introns and intergenic regions - homogeneous 5th order Markov model • 5’ and 3’UTRs - homogeneous 5th order Markov model • polyadenylation signal • donor and acceptor splice sites - WAM and the Maximal Dependence Decomposition (MDD), i.e., a decision tree-based weighted position matrix

GenScan’s HMM for sequence generation Reverse (-) strand F- (5’UTR) F+ (5’UTR) P- (prom) P + (prom) E0 + Einit- I0 + I0 - E0 - Einit+ Esngl+ (single-exon gene) Esngl- (single-exon gene) N (intergenic region) I1 - E1 - E1 + I1 + A- (polyA signal) A+ (polyA signal) I2 - Eterm- E2 + I2 + Eterm+ E2 - T+ (3’UTR) T- (3’UTR) Forward (+) strand (“Prediction of complete gene structures in human genomic DNA”(1997) Burge and Karlin, JMB 268, p. 86)

Protein-genomic sequence comparisons • Use sequence similarity between the protein and the protein-coding regions of the genomic sequence for gene model prediction • Algorithmic techniques • dynamic programming-based sequence alignment algorithms • specialized recognition modules for splice junction prediction • profile HMMs • Examples • Procrustes (Gelfand et al. 1996) • combinatorial pairing of putative splice junctions to form introns • uses protein-genomic sequence similarity to validate the correct pairings • Genewise (Durbin and Birney) • HMM-based sequence profiles • uses similarity between the query protein and a database of protein families organized in profiles (Pfam) • Caveats • prediction limited to coding regions (excluding 5’ and 3’ UTRs)

cDNA-genomic sequence comparisons • Use similarities between the cDNA (ESTs, mRNAs) and the genomic sequences to predict the gene model. • Algorithmic techniques • dynamic-programming based sequence alignment algorithms • specialized module for splice junction detection (pattern matching techniques, or statistical modeling) • Examples • EST_GENOME (Mott 1997) • dynamic programming alignment with an affine scoring scheme • uniform scoring for large indels (introns) • SIM4 (Florea et al. 1998) • incremental exon detection and refinement with ‘blast’-like and greedy sequence comparison techniques • pattern matching prediction of splice junctions • Caveats • accuracy depends on the quality of the data source (e.g., cannot detect genomic contamination by unspliced introns, or spurious priming)

Cross-species genomic sequence comparison • Use the sequence similarity and the ordering of homologous regions between genomic sequences from related organisms to infer their common gene model. • Algorithmic techniques • dynamic programming-based sequence comparison algorithms • statistical modeling of the splice junctions and other common transcriptional elements • Examples • ROSETTA (Batzoglou et al. 2000), CEM (Conserved Exon Model; Bafna and Huson 2000) • progressive sequence alignment between the various categories of orthologus regions (based on the expected sequence similarity) • statistical methods for splice signal recognition (?) • Caveats • accuracy depends on the specificity of sequence similarity and the presence of delimiting transcriptional signals at that locus (similarity may extend past the gene boundaries)

Automatic gene annotation with Otto

Components of the automatic gene annotation • Bn - blastn (dbEST, CHGI, CMGI, RefSeq) • S4 - SIM4 (dbEST, CHGI, CMGI, RefSeq) • Genewise (nr) • GenScan • FGenesH • repeat - RepeatMasker • etc. • Otto automatic gene predictions by Otto • Promoted curated transcripts

Using large EST data sets for gene prediction EST exon models Clustered exon models reverse (-) strand reverse (-) strand forward (+) strand forward (+) strand Genomic axis EST (exon) matches

Using large EST data sets for gene prediction • Each EST may span one or more of a gene’s exons • Overlapping ESTs and mRNAs on the genome can be used to infer gene models • Large data sets must be used for completeness • dbEST ( ~3.7 million ESTs) • UniGene (~90,000 ESTs and mRNA transcripts, grouped by similarity) • proprietary data sets (LifeSeq, CHGI) • Analyzing such large data sets is time and resource-consuming • Strategy for EST data mining • determine the occurrences of a large set of cDNA sequences in a target genome (mapping) • group the overlapping EST matches on the genome to infer the underlying gene model (clustering)

Mapping ESTs to a target genome • MappingDetermine, for a given EST, the exact genomic location(s) and exon model(s), i.e. • exon coordinates in the genomic sequence • genomic match strand (forward, or reverse complement) • percent sequence identity values (at the exon and EST levels) • spliced EST-genomic sequence alignment • ValidationCriteria for validating putative EST occurrences on the genome • EST coverage • similarity between the EST and genomic sequences • e.g., >80% of the EST must match the genome, at >90% sequence identity

Technical challenges • cDNA • Sequencing errors and polymorphisms • Interspecies contamination • Low quality EST data • Gene model • Multiple gene homologues • Alternative splicing • Interleaving and overlapping of genes • Genomic sequence • Repetitive elements • Genomic contamination • Genomic sequence representation • Large data size • ~3 billion bp in the human genome • ~2.8 billion bp in dbEST

Source: primary cDNA data Sequencing errors and polymorphisms (e.g., SNPs) Vector contamination substitution indel T GT AG GT AG C cDNA ESTs vector Low quality of EST data ACTGATGCAGTCATATA GCATCTATCGGATTGCC TAAAATCGGACGGATCA CGATCTGATAATATAAA..... ....NNNATNACNACAGNNTAANC... A Interspecies contamination PolyA tails ATCGTAAAA... AAAAAATAAAAAAAAAAAA... ....ATCTTAC EST Genome cDNA library

Source: underlying gene model • Multiple gene homologues • generate multiple EST matches • need to distinguish the true match based on sequence similarity • complicated by sequencing errors in cDNA data EST Ortholog (true match) Paralog 3 Paralog 2 Paralog 1

Source: underlying gene model 1 3 2 GT GT AG AG 3 1 1 2 3 1 3 GT AG • Alternative splicing • a single gene gives rise to more than one mRNA sequences and protein products • may occur as a result of tissue specificity, or to activate different regulatory pathways • cannot be identified by ab initio methods mRNAtranscript 1 genomic sequence mRNAtranscript 2

Source: underlying gene model • Interleaving and overlapping of genes • genes located in the introns of another gene • overlapping exons from different genes • difficult to detect with ab initio methods Gene 1 Gene2

Source: genomic sequence CGGATAGACATAAC CGGATAGACATAAC CGGATAGACATAAC CAGCAGCAGCAGCA CAGCAGCAGCAGCA CAGCAGCAGCAGCA • Repetitive elements • classes: • LINEs (Long Interspersed Nuclear Elements) -- 7,000bp • SINEs (Short Interspersed Nuclear Elements) -- 300bp -- e.g., Alu • low complexity regions -- e.g., ACACACACACACACAC • tandem repeats -- e.g., CAGCAGCAGCAG • occur in large numbers in the genome • considerably increase the size of the computation

Source: genomic sequence • Genomic contamination • unspliced introns (A) • internal priming (B) • these artifacts can only be resolved by clustering the ESTs on the genomic axis, or in conjunction with other prediction methods unspliced intron EST EST genome genome AATATAAA false (non-genic) primer (A) (B)

Source: genomic sequence Chr ACCGATCACGTATCTAGCGATCTTAAGGCTATCCCATGCGA.... BACs ~150 kb ...ACCGATCACGTATCTAGCGATCTTAAGGCTATCCCATGCGAGACTTAGCTTACGGACGGATTCGAGCGGATCTATCTGAGCT.... • Genomic sequence representation • ideal view: one sequence per chromosome • public sequences: BACs, contigs, ordered and oriented to approximate full-chromosomes • possible mis-ordering and mis-orienting • incomplete genomic sequence Gap

Source: genomic sequence • Celera genome assembly • generated using the Whole Genome Shotgun (WGS) method and a compartmentalized sequence assembler • sequence = partially ordered and oriented collection of scaffolds • scaffolds = ordered and oriented collection of contigs • known mean and distribution of gap lengths Scaffolds Contig ordering and orienting with mate-pairs Shared fragments Gap(,2) Fragments BACs (finished or unordered collections of contigs) ...ACCGATCACGTATCTAGCGATCTTAAGGCTATCCCATGCGAGACTTAGCTTACGGNNNCATTCGAGCGGATCTATCTGAGCT....

Source: genomic sequence Scaffold Contigs BACtigs Genomic sequence Fragments

Strategies for large scale EST mapping EST 1 2 1 Mb genome • Direct mapping with an exact cDNA-genomic sequence alignment method (SIM4, EST_GENOME) • divide the genome in n overlapping fragments • align the EST against each of the genomic fragments • Time required • SIM4 - 0.3s per EST/Mb (1 EST vs. genome in 15 minutes) • EST_GENOME - even slower • Too expensive!

Strategies for large scale EST mapping EST 1 EST 2 EST 3 EST 4 5’ 3’ mRNA transcript Genome Exon 2(coding) Exon 3(coding) Exon 4(coding) Exon 1(5’UTR) Exon 5(3’UTR) • Mapping of ESTs to the genome via the (predicted) mRNA transcripts • map each of the ESTs on the set of (predicted) mRNA transcripts, or genes with known genomic locations • align the EST against the genomic fragment containing the gene for the EST with an exact alignment method • Faster than exact mapping • Can be used to improve existing gene models, but not to discover new ones

Strategies for large scale EST mapping • Two-stage mapping of ESTs to the genome • detect potential EST matches on the genome with a fast similarity search program (signal finding) • blastn, MUMer, tfastx • align the EST against the bounded genomic region containing the signal with an exact alignment method (polishing) • SIM4, EST_GENOME 1 2 EST EST signal genome bounded genomic regions containing the EST signal

Repeat detection and resolution • Repeats represent ~40% of the sequence of the human genome • Some repeats can be found in the 3’ UTRs of the genes • Spurious priming can produce repetitive ESTs • In tests using dbEST 1% of the ESTs found accounted for 99% of the EST signals • Resolution Strategies • repeat mask the genome prior to mapping using, e.g., RepeatMasker • repeat mask the EST data prior to mapping • selectively mask only those ESTs with large numbers of occurrences, during mapping

EST data mining • Gene prediction by genomic EST clustering (previously discussed) • Generation of gene indices by EST clustering and assembly • 5’ and 3’ UTR reconstruction • Detection of alternatively spliced gene variants

Gene indices • Quality and vector trim the EST sequences • Cluster the ESTs in groups based on sequence similarity • Assemble the ESTs in each cluster using a multiple alignment program • For each cluster, select a consensus sequence = EST assembly • Each EST assembly is a potential mRNA transcript • Detect potential splice variants by pairwise comparisons between highly similar EST assemblies

5’ and 3’ UTR reconstruction • Map the ESTs on the genomic axis • Cluster the EST matches along the genomic axis in the area surrounding the predicted transcripts, in a manner consistent with the GenBank annotation • Determine putative 3’ mRNA transcript ends in the vicinity of the 3’-most EST-genomic alignments • Use genomic information (e.g., poly-adenylation signals AATAAA) to validate the 3’ UTR ends

Detection of alternative splices • Using EST consensus information • cluster the ESTs to create gene indices • determine the consensus sequence for each cluster • compare highly similar consensus sequences to detect putative alternatively spliced exons (indel blocks) • Using the EST-genomic sequence alignments • cluster the EST matches along the genomic axis to infer possible exon models • determine (internal) exons that are present in some, but not all, ESTs in the cluster (alternatively spliced) • collect EST evidence for alternatively spliced variants

References • Lewin B (2000) Genes VII, Oxford University Press Inc., New York, ISBN 0-19-879276-X. • Burge C, and Karlin S. (1997) Prediction of complete gene structures in human genomic DNA, J Mol Biol. 268(1):78-94. • Kulp D, Haussler D, Reese MG, and Eeckman FH. (1996) A generalized hidden Markov model for the recognition of human genes in DNA, Proc Int Conf Intell Syst Mol Biol.4:134-42. • Krogh A, Mian IS, and Haussler D. (1994) A hidden Markov model that finds genes in E. coli DNA, Nucleic Acids Res. 22(22):4768-78. • Solovyev VV, Salamov AA, and Lawrence CB. (1994) Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames, Nucleic Acids Res.22(24):5156-63. • Salamov AA, and Solovyev VV. (2000) Ab initio gene finding in Drosophila genomic DNA, Genome Res. 10(4):516-22.

References • Gelfand MS, Mironov AA, and Pevzner PA (1996) Gene recognition via spliced sequence alignment, Proc Natl Acad Sci USA93(17):9061-6. • Mott R. (1997) EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA, Comput Appl Biosci.13(4):477-8. • Florea L, Hartzell G, Zhang Z, Rubin GM, and Miller W. (1998) A computer program for aligning a cDNA sequence with a genomic DNA sequence, Genome Res.8(9):967-74. • Florea, L. and Walenz, B. (in preparation) ESTMapper: Massive EST Mapping. • Batzoglou S, Pachter L, Mesirov JP, Berger B, and Lander ES. (2000) Human and mouse gene structure: comparative analysis and application to exon prediction, Genome Res.10(7):950-8. • Bafna V, and Huson DH. (2000) The conserved exon method for gene finding, Proc Int Conf Intell Syst Mol Biol.8:3-12. • Quackenbush J, Liang F, Holt I, Pertea G, and Upton J. (2000) The TIGR gene indices: reconstruction and representation of expressed gene sequences, Nucleic Acids Res.28(1):141-5.

References • Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, et al. (2001) The sequence of the human genome, Science 291(5507):1304-51. • Gautheret D, Poirot O, Lopez F, Audic S, and Claverie JM. (1998) Alternate polyadenylation in human mRNAs: a large-scale analysis by EST clustering, Genome Res.8(5):524-30. • Kan Z, Rouchka EC, Gish WR, and States DJ. (2001) Gene structure prediction and alternative splicing analysis using genomically aligned ESTs, Genome Res.11(5):889-900. • Kan Z, Gish W, Rouchka E, Glasscock J, and States D. (2000) UTR reconstruction and analysis using genomically aligned EST sequences, Proc Int Conf Intell Syst Mol Biol. 8:218-27. • Ji H, Zhou Q, Wen F, Xia H, Lu X, and Li Y. (2001) AsMamDB: an alternative splice database of mammals, Nucleic Acids Res.29(1):260-3.

Overview

Overview

Presentation Transcript

Overview

Overview

OVERVIEW

Overview

Overview

Overview

Overview

Overview

overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview