1 / 1

INTRODUCTION Expressed sequence tags offer a low cost approach to gene discovery

From ESTs to partial genomes. The Environmental Genomics Thematic Programme Data Centre. Alasdair Anthony, Ralf Schmid, James Wasmuth, John Parkinson and Mark Blaxter. Nematode Genomics, Institute of Cell, Animal and Population Biology, University of Edinburgh, EH9 3JT. INTRODUCTION

andren
Download Presentation

INTRODUCTION Expressed sequence tags offer a low cost approach to gene discovery

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. From ESTs to partial genomes The Environmental Genomics Thematic Programme Data Centre Alasdair Anthony, Ralf Schmid, James Wasmuth, John Parkinson and Mark Blaxter Nematode Genomics, Institute of Cell, Animal and Population Biology, University of Edinburgh, EH9 3JT INTRODUCTION • Expressed sequence tags offer a low cost approach to gene discovery • For a range of non-model organisms, ESTs represent the only sequence information available • Using this data to create 'partial genomes' means the data can be interpreted in a genomic context • To facilitate the creation of partial genomes, we have created a suite of software tools, designed to form a complete EST pipeline • The first tool in the pipline, trace2dbest, process raw chromatograms into high quality sequence objects • These sequences are then used to build a partial genome, using the PartiGene tool • The partial genome is held in an SQL database, which can be made accessible through the web • A further software tool, prot4EST, provides robust translation of the error prone sequences PartiGene prot4EST • PartiGene represents the core of the partial genome creation process • All ESTs from a particular species are clustered and assembled to form putative gene objects • These genes can then be annotated and the information presented as a web based resource • Poor sequence quality, identification of coding region and frame-shifts make EST translation problematic • prot4EST integrates current translation solutions, BLASTX, DECoder3, ESTScan4 • Fully compatible with PartiGene Partial Genome Sequences 5 Annotation 4 Partial genome • Translation (prot4EST) • BLAST • Under development • Putative location • Functional prediction • Structure prediction • Domain identification Gene A trace2dbest BLASTN against RNA database Gene B • trace2dbest is an interactive utility for processing raw EST data Raw Chromatogram sequence similarity (E<e-65) Gene C no match sequence similarity (E<e-8) RNA sequences BLASTX against mitochondrially encoded proteins 6 Web front ends 3 Assemble • the basecalling program phred is used to produce a quality scored sequence acatcgaatcgatacatgACGTAGCAGATCAGTACATGATACACGTCGTCGTCTGCATGCTTGCCACGTCCAGTTTGGCCATTAGTACGCCCGCTGACCTGACTCTGACCATTGACCACTGATGTCCATGATTccatgacatcttgatcgtgatcga Base Calling (PHRED1) • Clusters assembled to form contigs using phrap (Green, P. unpublished) Example PartiGene HTML results output Join and extend HSPs no match BLASTX against SWISSProt • trace2dbest then performs a series of trimming steps • cross_match is used to identify leading and trailing vector sequence sequence similarity (E<e-8) no match length and quality filters Trimming 2 Cluster • Next user defined leader andadapter sequences are trimmed Run DECoder • Sequences clustered on the basis of similarity (BLAST) using CLOBB2 Nembase was created using php to submit queries to the PartiGene database Peptide prediction Parse results fails filters Run ESTScan • poly(A) tails are identified based on user defined parameters and trimmed length and quality filters fails filters dbEST EST file TYPE: EST STATUS: New CONT_NAME: Blaxter ML CITATION: Expressed Sequence Tags from the humus earthworm L. rubellus LIBRARY: Earthworm Lambda Zap Express Library EST#: Lr_adE_01H01_T3 CLONE: Lr_adE_01H01 SOURCE: PCR_F: T3 PCR_B: T7PL PLATE: 01 ROW: H COLUMN: 01 SEQ_PRIMER: T3 P_END: 5' HIQUAL_START: 1 HIQUAL_STOP: 478 DNA_TYPE: cDNA PUBLIC: PUT_ID: gb|AAA74396.1| cytochrome c oxidase subunit IV COMMENT: Sequencing was performed in Edinburgh SEQUENCE: CCAACACCGTCATGTCCGGAGACACGACCATGTTCCCAGGTATCGCCGATCGTATGCAGA AGGAGATCACGAGCATGGCTCCAAGCACGATGAAGATCAAGATCATCGCTCCACCCGAGC GCAAGTACTCCGTATGGATCGGTGGGTCCATCCTGGCTTCCCTGTCCACCTTCCAGCAGA TGTGGATCAGCAAGCAGGAGTACGACGAGTCCGGCCCATCCATCGTCCACAGGAAGTGCT TCTAAATGCACCGCCGACAACGAGTTACCAAGGGCGACAGAAAGAACCCGCTAACGCGAG CACACACACGCAAGCAAACACACAGCGTGCACGTACATACAACATCACACAACCCATCTC TATGACTCACACACCTTTTCAACCGAACTTTATCCAAATTACGCAAACCGAAGTTTCGAT TTTATTTCGTCCTTGTGGACACAAAAGTAATTTAAAAATCTCTGTACGCCTTAATTTGAG GCTATAGTTTGCTTTTGTAACTTAAGGCGATCACAGATTCTAGATGCAATCGTGACTTTA TATTTTACGATTTAT || Identify longest ORF from six frame translation High quality sequence >= 30 residues long 1 Collate sequences • Sequences downloaded from public database + SUMMARY • The PartiGene process has been used to create several species specific databases, including nembase (http://www.nematodes.org) and lumbribase (http://www.earthworms.org). • The software is freely available under a GNU license at http://nema.cap.ed.ac.uk/PartiGene • The software is under continued development, SimiTri (a tool allowing phylogenetic • comparisons) is due to be integrated into the pipeline soon. An additional module, annot8er is also under development cDNA library information dbEST References: 1. Ewing, B., & Green, P. (1998) Base-calling of automated sequencer traces using phred. Genome Res. 8, 175-194 2. Parkinson J., Guiliano D.B. & Blaxter M. (2002) Making sense of EST sequences by CLOBBing them. BMC Bioinformatics. 3, 31 3. Fukunishi, Y. & Hayashizaki, Y. (2001) Amino-acid translation for cDNA with frame-shift error. Physiol. Genomics. 5, 81-87 4. Iseli, C., Jongeneel, C.V., & Bucher, P. (1999) ESTScan: A Program for detecting, evaluating and reconstructing potential coding regions in EST sequences. ISMB7, 138-158 Acknowledgments: the authors would like to thank Ann Hedley and the rest of the Environmental Genomics Data Centre team for their help. The project is funded by NERC.

More Related