CDS predictions using DOGFISH-C

CDS predictions using DOGFISH-C David Carter dmc@sanger.ac.uk Wellcome Trust Sanger Institute 6th May 2005

DOGFISH • Detection Of Genomic Features In Sequence Homologies • A four-component system to detect splice sites, coding starts/stops etc in multiple-species alignments DOGFISH-C • “Contextual” component only • Plus simple best-path CDS finder to derive single transcripts

DOGFISH components What’s in an alignment? • Taxonomic information: mutations, or lack of, at a given position. Evolutionary models. • Contextual information: does each sequence look right in itself? (DOGFISH-C) • Indels: where are the gaps? • Which species are present at all? Derive an estimate from each “view”, and combine into a single result.

Training data • UCSC MultiZ 8-species vertebrate alignments (minus chimp, plus frog) • VEGA gene set from March 2005 • DOGFISH trained to discriminate true sites from equal numbers of decoys taken at random from within genes • Final best-path search tuned using genes from 13 Encode training regions

Deriving per-site probability estimates Candidate site is represented by 100 bases each side of site itself, and 100 each side of every informant species position it’s aligned with: up to 8 x 200 bases in total. • Step 1: derive many statistics per species • 1a: position-specific weight matrices • 1b: significant k-mers in subregions • Step 2: derive one estimate per species • Step 3: integrate into a single estimate

Step 1a: Position-specific weight matrices • Train 6th-order position-specific weight matrices: one for each coding phase for true sites, and one for decoys. • Given a candidate sequence for a given species, find the overall best-scoring true-site model, i.e. find the most likely phase • At each position, take logodds between best true-site model and decoy model, giving 200 logodds scores.

Step 1b:Diagnostic k-mers-in-regions • As well as applying weight matrices, count occurrences of 200 “diagnostic” k-mers (k=1 to 6) within specific regions of the 200-base window • “Diagnostic” means frequency differs between true and decoy sites: e.g. AG is rare in positions -30 to -1 for true acceptor sites but not decoys. • Captures more subtle, less position-specific effects.

Step 2: convert 400 scores per species to one estimate per species • Now we have 200 positional logodds scores and 200 k-mer counts for our 200-base sequence, but we want a single probability estimate (that this site is a true one). • Train and run a relevance vector machine (RVM): decides which are the useful (“relevant”) statistics and what weight to give each one. • This gives better results than just adding the scores (as we would if we made the independence assumptions made in e.g. HMMs)

Step 3: convert up to 8 per-species estimates to one overall estimate • Now we have an estimate for each species that aligned to the target. • Boost estimates of species that did align, and introduce low “default” estimate for those that didn’t; more distant species have larger boosts and milder defaults. • Train and run another RVM that takes (exactly) 8 inputs and outputs the single DOGFISH-C estimate for this site.

Error rates (%) on balanced test set “Error” means estimate < 0.5 for true site, or > 0.5 for decoy

Predicting CDS’s (in a hurry) • A candidate CDS is any sequence • [ATG|AG] … <ORF> … [Stop|GT] • Use the DOGFISH-C candidate site estimates for the two boundary sites • Introduce further statistic based on which species get an alignment with “convincing” length across the candidate CDS • CDS estimate = • 5’-site estimate * 3’-site estimate * aligned-species estimate • Hand-tune a few more parameters (missing tea break) • Apply DP search to look for best legal CDS sequence (so single transcript only) across the Encode region

CDS prediction results (my figures) on 31 unseen Encode regions, May 3rd 2005.

Conclusions/Plans/Thanks • “Full” DOGFISH could well boost performance as a post processing step • Detect transcription start sites! • Alternative transcripts • Thanks to: Richard Durbin; Thomas Down (RVM expert); Patrick Meidl (Vega); organizers; ...

CDS predictions using DOGFISH-C

CDS predictions using DOGFISH-C

Presentation Transcript

Dogfish Sharks

CDs

Better climate predictions using hindsight

Spiny Dogfish

Dogfish Dissection

Dogfish Shark aka Bruce

Extracting and Using CDS Data

CDS 130

Dogfish Shark Dissection Images

Dogfish Shark Dissection

Dogfish Shark

DOGFISH SHARK

CDS/ISIS

Using Ratios (Proportions) to Make Predictions

CDS/ISIS

Dose Predictions using LWR

Using Relationships to Make Predictions

Using Prior Knowledge to Make Predictions

Dogfish Sharks