130 likes | 152 Views
DOGFISH-C system detects splice sites, coding starts/stops in multispecies alignments using contextual information. Training data includes vertebrate alignments, gene set from March 2005. Derives per-site probability estimates using weight matrices and diagnostic k-mers. Integrates species estimates into single result. Predicts CDSs using boundary site estimates and species alignments. Concludes full DOGFISH could enhance performance as a post-processing step.
E N D
CDS predictions using DOGFISH-C David Carter dmc@sanger.ac.uk Wellcome Trust Sanger Institute 6th May 2005
DOGFISH • Detection Of Genomic Features In Sequence Homologies • A four-component system to detect splice sites, coding starts/stops etc in multiple-species alignments DOGFISH-C • “Contextual” component only • Plus simple best-path CDS finder to derive single transcripts
DOGFISH components What’s in an alignment? • Taxonomic information: mutations, or lack of, at a given position. Evolutionary models. • Contextual information: does each sequence look right in itself? (DOGFISH-C) • Indels: where are the gaps? • Which species are present at all? Derive an estimate from each “view”, and combine into a single result.
Training data • UCSC MultiZ 8-species vertebrate alignments (minus chimp, plus frog) • VEGA gene set from March 2005 • DOGFISH trained to discriminate true sites from equal numbers of decoys taken at random from within genes • Final best-path search tuned using genes from 13 Encode training regions
Deriving per-site probability estimates Candidate site is represented by 100 bases each side of site itself, and 100 each side of every informant species position it’s aligned with: up to 8 x 200 bases in total. • Step 1: derive many statistics per species • 1a: position-specific weight matrices • 1b: significant k-mers in subregions • Step 2: derive one estimate per species • Step 3: integrate into a single estimate
Step 1a: Position-specific weight matrices • Train 6th-order position-specific weight matrices: one for each coding phase for true sites, and one for decoys. • Given a candidate sequence for a given species, find the overall best-scoring true-site model, i.e. find the most likely phase • At each position, take logodds between best true-site model and decoy model, giving 200 logodds scores.
Step 1b:Diagnostic k-mers-in-regions • As well as applying weight matrices, count occurrences of 200 “diagnostic” k-mers (k=1 to 6) within specific regions of the 200-base window • “Diagnostic” means frequency differs between true and decoy sites: e.g. AG is rare in positions -30 to -1 for true acceptor sites but not decoys. • Captures more subtle, less position-specific effects.
Step 2: convert 400 scores per species to one estimate per species • Now we have 200 positional logodds scores and 200 k-mer counts for our 200-base sequence, but we want a single probability estimate (that this site is a true one). • Train and run a relevance vector machine (RVM): decides which are the useful (“relevant”) statistics and what weight to give each one. • This gives better results than just adding the scores (as we would if we made the independence assumptions made in e.g. HMMs)
Step 3: convert up to 8 per-species estimates to one overall estimate • Now we have an estimate for each species that aligned to the target. • Boost estimates of species that did align, and introduce low “default” estimate for those that didn’t; more distant species have larger boosts and milder defaults. • Train and run another RVM that takes (exactly) 8 inputs and outputs the single DOGFISH-C estimate for this site.
Error rates (%) on balanced test set “Error” means estimate < 0.5 for true site, or > 0.5 for decoy
Predicting CDS’s (in a hurry) • A candidate CDS is any sequence • [ATG|AG] … <ORF> … [Stop|GT] • Use the DOGFISH-C candidate site estimates for the two boundary sites • Introduce further statistic based on which species get an alignment with “convincing” length across the candidate CDS • CDS estimate = • 5’-site estimate * 3’-site estimate * aligned-species estimate • Hand-tune a few more parameters (missing tea break) • Apply DP search to look for best legal CDS sequence (so single transcript only) across the Encode region
CDS prediction results (my figures) on 31 unseen Encode regions, May 3rd 2005.
Conclusions/Plans/Thanks • “Full” DOGFISH could well boost performance as a post processing step • Detect transcription start sites! • Alternative transcripts • Thanks to: Richard Durbin; Thomas Down (RVM expert); Patrick Meidl (Vega); organizers; ...