130 likes | 333 Views
Hitchhiker’s Guide to PASA: An RFC on Integrating an Alien Program to Brent Lab. Bob Zimmermann 11-07-06. Again, What Is PASA?. P rogram to A ssemble S pliced A lignments “PASA” == Core algorithm Given genomic coords, what is a likely candidate for a full length transcript?
E N D
Hitchhiker’s Guide to PASA:An RFC on Integrating an Alien Program to Brent Lab Bob Zimmermann 11-07-06
Again, What Is PASA? • Program to Assemble Spliced Alignments • “PASA” == Core algorithm • Given genomic coords, what is a likely candidate for a full length transcript? • “PASA Pipeline” == Heuristic Suite • Mish mash of programs; the topic of today’s presentation
PASA’s Original Goals • Improvement of annotations thru ESTs • How can ESTs imply additional transcripts? • How do we define likely updates? • How do we align? • Etc., etc. • A lot of this work is usually hand-done • But it doesn’t all need to be • Greatly reduces overhead
Preliminaries • PASA Pipeline uses MySQL: • Annotations are loaded via adapters • All ESTs + alignments are stored in the db • Storage problem? • Web portal displays results via db • Annotations are compared in the db • Atom is a set of assembled EST alignments • Takes the form of a database
To Illustrate (user) conf, ESTs, genome annotation updates… (PASA Pipeline) assemblies (a PASA db)
So What Happens? • Three Major Phases: • Alignment • Assembly • Update
Alignment • Typically, BLAT is the first pass. • GMAP is an alternative • Should these not pass validation… • Reasonable intron length • Percent ID • Single Exon • Sim4 is run • Bail on the EST otherwise
Assembly • “Maximal” assemblies • Most consistent alignments…RTFP • Subject to more validation • FL-cDNAs are considered putative novel • ESTs are possible extensions • Alignments ORFs are guessed • Longest ORF--should we think about this?
Comparison • User supplies an annotation set • Pipeline marks “good” updates • Percent overlap • Percent ID (non-flcDNA assms) • Min ORF size • Max UTRs • All tweakable • SO: Better predictions, more annotations! • Another chance for us to rule the school
What do we want with this? • I’m working on it: • Use to augment our predictions • Long pipeline: ESTs,flcDNAs->iPE->N-SCAN->PASA • Use to generate EST sequences • Different fork in the pipeline: Alignment->Assembly->ESTSEQ->N-SCAN • An awesome alignment tool • Can incorporate Pairagon, etc. • Ideas?
Caveats! • PASA’s algorithm is cubic in # of ESTs • Awful for human • Brian wrote a faster algorithm • Still running (3-4 days?) human. • Who trusts ESTs anyway? • seqclean tool can get rid of some junk • any number of criteria can be added • maybe N-SCAN tips the scales back?
More Caveats • Missing pieces: • Alignent to estseq • Update gtfs • Use Pairagon • Brian is not versioning well • But he might make me a developer (good)