1 / 24

Using EST Evidence to Automatically Predict Alternatively Spliced Genes

Using EST Evidence to Automatically Predict Alternatively Spliced Genes. Bob Zimmermann Master’s Thesis Defense 12-12-2006. What is Genome Annotation*?. Assert the locations of protein coding genes in genomic DNA Very small part of the genome:1-2% in Hs Assert what of proteins are encoded

jed
Download Presentation

Using EST Evidence to Automatically Predict Alternatively Spliced Genes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using EST Evidence to Automatically Predict Alternatively Spliced Genes Bob Zimmermann Master’s Thesis Defense 12-12-2006

  2. What is Genome Annotation*? • Assert the locations of protein coding genes in genomic DNA • Very small part of the genome:1-2% in Hs • Assert what of proteins are encoded • Give a “profile” of a genome’s products

  3. How are Proteins Produced?

  4. How is RNA Processed?

  5. What is Alternative Splicing?

  6. “Gene” vs. “Transcript” • Gene - region of genomic transcription (locus) • Transcript - specific splice of a gene • Genes have one or more “isoforms” • These can lead to different protein products • Greatly influences genome diversity • 50-70% Hs genes altspliced Stated goal: find all transcripts of all genes

  7. Way 1: Sequencing and Alignment RNA Pros: • Reliable • Inexpensive Cons: • Bias toward overexpressed genes • Bias toward short genes cDNA >gi|7732|emb|X54360.1|DMCID AGTCCACTCGTAAGAAACATAGGAATAAGACGCAGCATTCAAAAAATATTGACTTGTCTTACAAAACTGATTTTCATTGTTCGCTACTTAATATTTAGTGATA . . . genome

  8. Way 2: Gene Prediction Pros: • No intrinsic bias • Cheap (like computers) Cons: • Won’t predict alt splices • Not extraordinarily sensitive

  9. Way 3: Hand Annotation Pros: • Most reliable, most trusted • Most likely to pinpoint subtlety Cons: • Expensive • Won’t scale with the rate of sequencing

  10. Can we reduce human labor? • Sequencing fl-cDNAs for all transcripts is not cost-effective • N-SCAN gets ~24k genes (most loci) in Hs • But there are at least 85k transcripts • Possible solution: ESTs--short, partial sequences of transcripts • Inexpensive • Lots of data, fast • Scales well with the rate of sequencing

  11. EST Data is Messy and Redundant

  12. Simplify: N-SCAN_EST Wei, et. al. 2006 NNNNNEEEEEIEEEEEEIIIIIIIIEEEEEIIIEEEEEENNNN

  13. Better: Reduce Alignments Haas, et. al. 2003 PASA How can we harness this?

  14. Method 1: MultiPass ESTSEQ • Run multiple times on same target regions …in only 1870 hours!

  15. Method 1: MultiPass ESTSEQ

  16. NEEEEIIIIOOOOIIIIIEEEE EEEMMIIIIEEEENNEEIIIIIIIIMMEEEE Method 2: AltSplice ESTSEQ

  17. Method 2: AltSplice ESTSEQ

  18. Way 3: Annotation Update • As a post-processing step, use ESTs to find alternate isoforms and correct: N-SCAN_EST prediction Haas, et. al. 2003

  19. Fantastic • This is a very strong result but-- • A lot of this can come from fl-cDNAs alone • The amount of seq we used is costly • 255k ESTs • 21k fl-cDNAs • So where do we fit in? • With no ESTs, PASA can’t predict anything • We can

  20. 10,000 ESTs .49 Sn 100,000 ESTs .48 Sn

  21. Conclusions • Multiple-pass prediction with different sequences predicts fancy altsplices • Heuristically reincorporating ESTs greatly improves annotations • With an order of magnitude fewer ESTs, this performs as well as N-SCAN_EST • Small sequencing projects can yield strong annotations

  22. Caveats and Future Work • We used all annotations in Dm to train • Related species perform well • The multipass method can be efficient • Instead of naïvely running on all ESTs, only conflicting regions could be rerun

  23. Acknowledgements • Research • Brent lab: • Sam - N-SCAN • Chaochun - N-SCAN_EST • Jeltje - ideas for MP_EST and PASA • Randy - endless parameter estimation conversations • Mani Arumugam, Beth Frazier, Aaron Tenney, Charles Comstock, Suman Kumar and Laura Langton • BDGP for vector sequences and nice ESTs • Committee: Gary Stormo, Jeremy Buhler • Love and support • Laura, my friends, my cats

More Related