280 likes | 388 Views
A knowledge-based approach to integrated genome annotation. Michael Brent Washington University. EST-, mRNA-, and protein-based methods. Outline of our process. MGC validated clones + RefSeq NM’s. Remove all with frame shifts. Fill with spliced Hs mRNA & EST. Threaded de novo
E N D
A knowledge-based approach tointegrated genome annotation Michael Brent Washington University
Outline of our process MGC validated clones + RefSeq NM’s Remove all with frame shifts Fill with spliced Hs mRNA & EST Threaded de novo predict- ions Paragon aligner BLAT N-SCAN +EST ENCODE Workshop
Paragon aligner Manimozhiyan Arumugam with Chaochun Wei
Better EST/cDNA-to-genome alignment • Idea • Go beyond minimizing mismatches and gaps • Accurate probabilities in correct alignments • Estimate parameters for each sequence set ENCODE Workshop
Better EST/cDNA alignment • Two sources of mismatches & gaps • Error (sequencing, RT) • Quals give local probs. Not used here. • Polymorphism (RNA vs. genome strains) • Gap vs. indel rates are different • Parameters must vary with sequence quality & source strains/polymorphism rates • E.g. prefer non-matches in low quality bases ENCODE Workshop
Better EST/cDNA alignment • Introns • Accurate probabilities in correct alignments • GT/AG vs. GC/AG vs. AT/AC • Absolutely no junk splice sites • Not clear what to do with polymorphic sites • Long introns are rarer than short introns ENCODE Workshop
Small exon in finished cDNA STANDARD TOOL (EST_GENOME) GENOME 351 GCGGGCGCGTTGGTGCGGAAAGCGGCGGACTATGTCCGAAGCAAGGATTT 400 |||||||||||||||||||||||||||||||||||||||||||||||||| BC000810 51 GCGGGCGCGTTGGTGCGGAAAGCGGCGGACTATGTCCGAAGCAAGGATTT 100 GENOME 401 CCGGGACTACCTCATGAGGTGACG-Agcgcc.......tgtagCACTTCT 16339 ||||||||||||||||| || ||| |>>>>> 15907 >>>>> ||||| BC000810 101 CCGGGACTACCTCATGA-GT-ACGCA.................--CTTCT 129 GENOME 16340 GGGGCCCAGTAGCCAACTGGGGTCTTCCCATTGCTGCCATCAATGATATG 16389 |||||||||||||||||||||||||||||||||||||||||||||||||| BC000810 130 GGGGCCCAGTAGCCAACTGGGGTCTTCCCATTGCTGCCATCAATGATATG 179 OUR PAIR HMM GENOME 351 GCGGGCGCGTTGGTGCGGAAAGCGGCGGACTATGTCCGAAGCAAGGATTT 400 |||||||||||||||||||||||||||||||||||||||||||||||||| BC000810 51 GCGGGCGCGTTGGTGCGGAAAGCGGCGGACTATGTCCGAAGCAAGGATTT 100 GENOME 401 CCGGGACTACCTCATGAGGTGAC.......AATAGTACGGTAAG...... 13006 ||||||||||||||||||>>>>> 12584 >>>>>||||>>>>> 3326 BC000810 101 CCGGGACTACCTCATGAG.................TACG........... 122 GENOME 13007 TGTAGCACTTCTGGGGCCCAGTAGCCAACTGGGGTCTTCCCATTGCTGCC 13046 >>>>>||||||||||||||||||||||||||||||||||||||||||||| BC000810 123 .....CACTTCTGGGGCCCAGTAGCCAACTGGGGTCTTCCCATTGCTGCC 167 ENCODE Workshop
Blind test • Test set • 100 alignment pairs of MGC clones to genome • Paragon & EST_genome differ on all of them • Output format identical • Evaluation • Curator attempting to explain discrepancies • Result • 37 cases where biological evidence favors 1 • In 31/37 Paragon alignment is supported ENCODE Workshop
Future directions • UTR vs. ORF • Polymorphism is more common in UTR • And 3rd position in ORF • Conservation • Use alignments to distinguish true from false • Splice sites, introns • Codons • Polymorphisms (analogous to quality values) ENCODE Workshop
Conceptual shift • Traditional view • cDNA data “speaks for itself”. Theory neutral. • Alignment = counting matches, mismatches, gaps • cDNA = genome annotation ENCODE Workshop
Conceptual shift • Our view • More knowledge = better alignments & annotations • cDNA is very useful evidence re: gene structure • Need to align it correctly • Need to determine its completeness • If not complete, predict the remainder • Gene prediction & cDNA alignment are the same problem • cDNA/EST just adds another information source ENCODE Workshop
N-SCAN_EST Chaochun Wei
TWINSCAN/N-SCAN_EST • Goal: • Integrate EST information with TWINSCAN to • improve accuracy where EST evidence exits • without losing the ability to predict novel genes. ENCODE Workshop
Twinscan_est ENCODE Workshop
Generating EST-alignment Sequence ENCODE Workshop
Modeling EST alignment sequence • Probability models • In each HMM state • Separate models for EST alignment sequence • Probabilities of DNA, conservation sequence, and EST sequence are multiplied. • Very similar to models of genomic alignments ENCODE Workshop
Multi-genome methods:N-SCAN Samuel Gross with Randall Brown
N-SCAN:Using multi-genome alignments • Motivation • Many genomes should give stronger signal of negative selection than two • Lots of genomes are being sequenced • Methods • Extend Twinscan to a phylogenetic tree model • At each site, mutation rate & pattern of tolerated substitutions depend on function ENCODE Workshop
Example • A multiple alignment that (A) is and (B) is not typical of the splice boundary shown ENCODE Workshop
Using mutation patterns for improving gene prediction • Tree hidden Markov model • Each state • generates columns of a multiple alignment • by a substitution process • along the branches of a phylogenetic tree ENCODE Workshop
Challenges • Columns are not correct, orthologous • Sequencing error • Alignment error • Change of function (I am not a mouse!) ENCODE Workshop
Differences from EXONIPHY • Approach • Estimate models of actual alignments, not evolutionary processes • Model • Independent substitution probabilities on each branch of the tree • 6 characters: A, C, G, T, gap, unaligned • Condition backwards from target genome ENCODE Workshop
Using mutation patterns for improving gene prediction • Traditional factorization • Pr(a2) Pr(a1|a2) Pr(h|a1) Pr(m|a1) Pr(c|a2) • N-SCAN factorization • Pr(h) Pr(a1|h) Pr(a2|a1) Pr(m|a1) Pr(c|a2) ENCODE Workshop
Preliminary study in human ENCODE Workshop
Preliminary study in human ENCODE Workshop
Fin ENCODE Workshop