210 likes | 373 Views
Trace recalling on MGC traces. Dec 21 2005. Why?. There are almost certainly alternately spliced targets in the MGC set that we would like to find Might be able to get some more hits and confirmed hits using trace recalling because of ambiguity sequence alignment. How?.
E N D
Trace recalling on MGC traces Dec 21 2005
Why? • There are almost certainly alternately spliced targets in the MGC set that we would like to find • Might be able to get some more hits and confirmed hits using trace recalling because of ambiguity sequence alignment
How? • Pipeline begins with .blat files generated by Mike • Result of BLATing each MGC trace (or the assembled fwd/rev reads) to the human genome • Represent a set of loci from which trace sequence could have originated
How? • Extract BLAT aligned sequence + 1000 bp flanking sequence from human genome • Run trace recalling between each trace and the corresponding extracted genomic loci • Adjust scores of first alignment (ambig sequence to genome) by adding back scores from intron penalties • This lessens bias from processed pseudogenes
How? • Select “correct locus” as the locus that aligns with the highest adjusted score • For the rest of the analysis this is the only locus that is considered • Apply hit criteria to each first align file to the correct locus • Spliced alignment (at least 1 intron) • > 60% of splice sites have at least 8 matches in a 10 bp window around the splice site • Overall percent identity > 75%
How? • Last step classifies each trace as a hit or a non-hit • Lift coordinates of alignment to extracted genomic fragment back to genomic coordinates • Hit becomes confirmed if there is at least a 1 bp overlap to the targeted predicted gene
How? • As part of trace recalling each read/genomic fragment is flagged if an alternate splice is observed • Compare alignment of ambiguity sequence and alignment of recalled sequence to determine if there is an alternate splice • example
How? • Analysis splits at this point between: • Comparing hit, confirmed hit, non-hit status of reads to original pipeline • Trying to find alternate splices in the whole set of traces
Results • Comparison of hit, confirmed hit, non-hit status* • 120 experiments went from non-hit to confirmed hit • 37 experiments went from non confirmed hit to confirmed hits * this part isn’t quite done and some of the non-hit confirmed hit cases look a little funny
Results • Finding alternate splices • Trace recalling identifies 622 alternate splicing events in the MGC set • Retained intron: 148 • Alt 3’ ss: 40 • Alt 5’ ss: 36 • Alt splice both sides: 56 • Alternate exon: 103 • Clean alternate exon: 189 • Mutex exon: 26 • Clean mutex exon: 23
Results • Finding alternate splices • Trace recalling identifies 622 alternate splicing events in the MGC set • Retained intron: 148 • Alt 3’ ss: 40 • Alt 5’ ss: 36 • Alt splice both sides: 56 • Alternate exon: 103 • Clean alternate exon: 189 • Mutex exon: 26 • Clean mutex exon: 23 • The projector in Bryan 509 working: priceless
Results • 288 of these are what I consider the “hard” altsplices to get (clean alt exon, clean mutex exon, individual 3’ or 5’ splice sites) • Wanted to validate these predictions somehow • Would normally go back to known gene but if there was a known gene it wouldn’t be an MGC target!
Results • Look at cases where the same type of altsplice is observed on both reads • There were a total of 72 experiments in which the same altsplice is observed on both reads (high confidence altsplices) • Example
Results • Breakdown of validated altsplices by type • Clean alternate exon: 39 • Alternate exon: 6 • Retained intron: 16 • Alt 5’ ss: 3 • Alt 3’ ss: 6 • Clean mutex exon: 2
Results • Flagged altsplices which were not validated (low confidence altsplices) could be: • Mistakes • Reads didn’t overlap • Didn’t see both sides of an alternate splice • One good read and one read that totally failed • Might be slightly different types (eg a clean alternate exon and an alternate exon) • Examples
A slight misalignment causes one read to be flagged as an “alternate exon” and the other to be flagged as a “clean alternate exon”… the black one is probably right
Recalled sequence picks up where you would expect if the single trace part were corrupted by noise 12 bp alternate splice site
Results • Looked at 40 examples of low confidence hits • 26 of them looked like they fell into one of the last 3 categories from before • 14 looked like actual miscalled alternate splices
To do • Modification to trace recalling which might clean up the alignments a bit more • Define something like the hit criteria for MGC alignments to take into account the number of matches in the trace recalling alignments (look at old E-value stuff)