170 likes | 227 Views
SPIDA combines comparative sequence analysis with EST alignment to identify coding regions. The tool aims to improve UTR annotation, utilize EST resources effectively, and handle inaccurate data. It provides novel CDS annotation and eliminates false positives using substitution periodicity index. Discover the SPIDA methodology for accurate exon validation and reading frame determination.
E N D
SPIDASubstitution Periodicity Index and Domain Analysis Combining comparative sequence analysis with EST alignment to identify coding regions Damian Keefe Birney Group EBI
SPIDA - Motivation • Improve UTR annotation • Make use of the ever expanding EST resource - good UTR source • Make use of the ever increasing number of comparative genomes • Cope with inaccurate and partial data sets • Complement existing Ensembl methodologies • Novelty - not probabilistic modeling
SPIDA: The Basic Idea • Use comparative data to determine if ESTs are mapped to the correct place by looking for coding signal - ie 3 periodic substitution pattern. • Provide CDS annotation - determine translation frame from mutation pattern. • Eliminate false positives and pseudogenes by requiring evidence from at least two species separated by ~50MY of independent evolution. • Annotate further using Pfam (Verify?).
Substitution periodicity index acgtacgtacgtacgtacgtacgtacgt total 0120120120120120120120120120 for aagtaggtacgaacgtccgttagcacgt frame 1 1 1 1 11 1 f0 1 1 f1 1 1 2 f2 1 1 1 1 1 5 S0 = 1/((2+5)/2) = 0.28 S1 = 2/((1+5)/2) = 0.67 S2 = 5/((1+2)/2) = 3.33 if denominator = 0,denominator = 1 SPI = max(S0,S1,S2) If we know the ‘wobbly’ frame, we also know the translation frame
SPI: Current Implementation Multiple Pairwise SPI at whole exon and 48bp window resolutions on each exon. Also ORF status of each frame in each window is determined. Heuristics then applied - v. preliminary! Window threshold applied at min 7 mutations, min 7 spi. (This is a quick way of calculating that the probability of this pattern of mutations occurring by chance is less than 1/100). Exons or windows are grown by aa sequence walking in the SPI-frame to a stop codon in both directions. The resulting ORF = CP3O. Conserved with a Periodicity of 3 Open Reading Frame. The same CP3O must be generated by more than one species if it is to be accepted. Mouse and/or Rat = 1 spp.
SPIDA for EGASP: Overview • Map dbEST to Genome - Exonerate • Filter then flatten ‘transcript’ structures, preserving all exon boundaries • SPI analysis of exons to give validation and reading frame. (TBA alignments) • Extend CDS (cp3o) from validated exons • Remove ‘transcripts’ with inconsistencies • Pfam_fs search of translated CDS • Reject CDS from single-exon ESTs with max Pfam-e > 1.0 • Report unique exons
Degree of Automation Given :- an ensembl database containing EST mappings a database containing multispecies alignments the Ensembl computational infrastructure hmmpfam + Pfam_fs Mysql The entire procedure is, and will continue to be, completely automatic
Confessions & Cockups • The first time the script ran all the way through was April 15th • I didn’t have time to run it again or check the results against the design set. • EST selection was too conservative, so too few exons were found. • I wasn’t aware that the EST mapping by exonerate had placed a substantial proportion of the mappings on the wrong strand.
Analysis in the light of known shortcomings 1. The SPI calculation is strand-agnostic so we ignore strand.2. After the filtering procedure, the remaining ESTs overlapped 5900 of the 9313 unique vega exons (including putatives and pseudogenes) by at least one base. So, at most, SPIDA might have found/confirmed 5900 exons given the input data.3. A number of filters in the SPIDA procedure were designed to identify and remove exons which looked ‘wrong’. These filters will have removed a number of exons which were correctly identified as conserved and three periodic, but were mapped to the wrong strand. It is not possible to compensate for these filters in the following analysis, but the Sn value is probably too low.
Analysis in the light of known shortcomings Of the 5900 vega exons covered by the filtered ESTs, spida confirmed 5033 as having 3-periodicity. 5025 of these were either known-validated(5019) or putative(6). Eight were from pseudogenes. This suggests a minimum sensitivity in excess of 80%. In total SPIDA confirmed 5037 exons. ie. 4 false positives. However the high values observed for the e-value in Pfam domain analysis of 3 of these exons suggested they may not be false. Either way, the specificity of SPIDA is in excess of 99% | tscpt_id | domain description |domain_e | +----------+---------------------------------------+----------+ | 19859 | ATP:guanido phosphotransferase N-ter | 7e-55 | | 27325 | Elongation factor Tu GTP binding doma | 1.3e-55 | | 28630 | Protein of unknown function (DUF431) | 1.7 | | 15070 | Fibronectin type III domain | 7.5e-34 |
Why did SPIDA miss 867 exons There were 343 pseudogene exons (137 processed pseudogenes). It is not supposed to find these. There were a similar number of vega putative exons, wet lab analysis suggests 85% of these may not be real. They were filtered out because they looked wrong (probably mapped to the wrong strand). There was no informative alignment. There was an alignment but the mutation rate was too low to get SPI above threshold. There was an alignment but the exon wasn't entirely orf and no windows achieved spi threshold.
Issues for the Future • Mappings to the wrong strand • Transcripts with validated exons but inconsistent frames. • Partial-exon ESTs. • Joining ESTs • Better Multi-Species Alignments • Gene-finder or screening tool? • Re-engineer the software for speed and compute farm deployment.
Ewan NIH-NHGRI EBI Ensembl Team Havana Team Sanger ISG Team Elliott Margulies and team. Ben Paton Guy Slater Acknowledgements
SPI: Variations Pairwise: The basic idea - calculated from an aligned sequence pair. Multiple Pairwise: Heuristic evaluation of the results of Pairwise SPI calculated on several species for the same human ‘exon’. MultiSpecies: Use the substitutions from more than one aligned species to calculate SPI. acgtacgtacgtacgtacgtacgtacgt total 0120120120120120120120120120 for aagtaggtacgaacgtccgttagcacgt frame acgtaagtgcgtacgtacgttcgtacgt 1 2 1 1 1 21 1 f0 1 1 f1 1 1 2 f2 2 1 1 2 1 7
SPI: Possible Modes of Use Whole Exon: Use Blast or Exonerate to Map EST to Genome. SPI gives validation and translation frame Only good for mid-rank ie non-UTR exons Windowed Exon: As above but SPI (and mutation rate) calculated in sliding window. Detect CDS starts and Ends by extending >threshold windows. 3’ to a stop, 5’ to a step up in mutation rate Windowed Alignment: Scan raw comparative alignments. For each >threshold window extend 3’ and 5’ to splice-site/poly-pyrimidine/branch signal/tataa signal.
Filtering for EGASP ESTs mapping to ENCODE 70961 (185768 ‘exons’). After removing all mappings with indels: 34905 (84201). Keeping only the best alignment for each EST and only EST mappings which begin at first base of EST 31570 (71865). After flattening: 8927 ESTs = 8977 different transcripts = 16981 exons. After SPI analysis -> 3640 CP3Os After selecting one CP3O per transcript = 1908 CP3Os = 7407 'exons’. After removing 5' fragments with no met 7389 exons. Unique start,end,chr,strand combinations 5240 ‘exons’. After liftOver 5193 'exons’.