1 / 13

CDS predictions using DOGFISH-C

CDS predictions using DOGFISH-C. David Carter dmc@sanger.ac.uk Wellcome Trust Sanger Institute 6th May 2005. DOGFISH D etection O f G enomic F eatures I n S equence H omologies A four-component system to detect splice sites, coding starts/stops etc in multiple-species alignments

sboucher
Download Presentation

CDS predictions using DOGFISH-C

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CDS predictions using DOGFISH-C David Carter dmc@sanger.ac.uk Wellcome Trust Sanger Institute 6th May 2005

  2. DOGFISH • Detection Of Genomic Features In Sequence Homologies • A four-component system to detect splice sites, coding starts/stops etc in multiple-species alignments DOGFISH-C • “Contextual” component only • Plus simple best-path CDS finder to derive single transcripts

  3. DOGFISH components What’s in an alignment? • Taxonomic information: mutations, or lack of, at a given position. Evolutionary models. • Contextual information: does each sequence look right in itself? (DOGFISH-C) • Indels: where are the gaps? • Which species are present at all? Derive an estimate from each “view”, and combine into a single result.

  4. Training data • UCSC MultiZ 8-species vertebrate alignments (minus chimp, plus frog) • VEGA gene set from March 2005 • DOGFISH trained to discriminate true sites from equal numbers of decoys taken at random from within genes • Final best-path search tuned using genes from 13 Encode training regions

  5. Deriving per-site probability estimates Candidate site is represented by 100 bases each side of site itself, and 100 each side of every informant species position it’s aligned with: up to 8 x 200 bases in total. • Step 1: derive many statistics per species • 1a: position-specific weight matrices • 1b: significant k-mers in subregions • Step 2: derive one estimate per species • Step 3: integrate into a single estimate

  6. Step 1a: Position-specific weight matrices • Train 6th-order position-specific weight matrices: one for each coding phase for true sites, and one for decoys. • Given a candidate sequence for a given species, find the overall best-scoring true-site model, i.e. find the most likely phase • At each position, take logodds between best true-site model and decoy model, giving 200 logodds scores.

  7. Step 1b:Diagnostic k-mers-in-regions • As well as applying weight matrices, count occurrences of 200 “diagnostic” k-mers (k=1 to 6) within specific regions of the 200-base window • “Diagnostic” means frequency differs between true and decoy sites: e.g. AG is rare in positions -30 to -1 for true acceptor sites but not decoys. • Captures more subtle, less position-specific effects.

  8. Step 2: convert 400 scores per species to one estimate per species • Now we have 200 positional logodds scores and 200 k-mer counts for our 200-base sequence, but we want a single probability estimate (that this site is a true one). • Train and run a relevance vector machine (RVM): decides which are the useful (“relevant”) statistics and what weight to give each one. • This gives better results than just adding the scores (as we would if we made the independence assumptions made in e.g. HMMs)

  9. Step 3: convert up to 8 per-species estimates to one overall estimate • Now we have an estimate for each species that aligned to the target. • Boost estimates of species that did align, and introduce low “default” estimate for those that didn’t; more distant species have larger boosts and milder defaults. • Train and run another RVM that takes (exactly) 8 inputs and outputs the single DOGFISH-C estimate for this site.

  10. Error rates (%) on balanced test set “Error” means estimate < 0.5 for true site, or > 0.5 for decoy

  11. Predicting CDS’s (in a hurry) • A candidate CDS is any sequence • [ATG|AG] … <ORF> … [Stop|GT] • Use the DOGFISH-C candidate site estimates for the two boundary sites • Introduce further statistic based on which species get an alignment with “convincing” length across the candidate CDS • CDS estimate = • 5’-site estimate * 3’-site estimate * aligned-species estimate • Hand-tune a few more parameters (missing tea break) • Apply DP search to look for best legal CDS sequence (so single transcript only) across the Encode region

  12. CDS prediction results (my figures) on 31 unseen Encode regions, May 3rd 2005.

  13. Conclusions/Plans/Thanks • “Full” DOGFISH could well boost performance as a post processing step • Detect transcription start sites! • Alternative transcripts • Thanks to: Richard Durbin; Thomas Down (RVM expert); Patrick Meidl (Vega); organizers; ...

More Related