A knowledge-based approach to integrated genome annotation

A knowledge-based approach tointegrated genome annotation Michael Brent Washington University

EST-, mRNA-, and protein-based methods

Outline of our process MGC validated clones + RefSeq NM’s Remove all with frame shifts Fill with spliced Hs mRNA & EST Threaded de novo predict- ions Paragon aligner BLAT N-SCAN +EST ENCODE Workshop

Paragon aligner Manimozhiyan Arumugam with Chaochun Wei

Better EST/cDNA-to-genome alignment • Idea • Go beyond minimizing mismatches and gaps • Accurate probabilities in correct alignments • Estimate parameters for each sequence set ENCODE Workshop

Better EST/cDNA alignment • Two sources of mismatches & gaps • Error (sequencing, RT) • Quals give local probs. Not used here. • Polymorphism (RNA vs. genome strains) • Gap vs. indel rates are different • Parameters must vary with sequence quality & source strains/polymorphism rates • E.g. prefer non-matches in low quality bases ENCODE Workshop

Better EST/cDNA alignment • Introns • Accurate probabilities in correct alignments • GT/AG vs. GC/AG vs. AT/AC • Absolutely no junk splice sites • Not clear what to do with polymorphic sites • Long introns are rarer than short introns ENCODE Workshop

Small exon in finished cDNA STANDARD TOOL (EST_GENOME) GENOME 351 GCGGGCGCGTTGGTGCGGAAAGCGGCGGACTATGTCCGAAGCAAGGATTT 400 |||||||||||||||||||||||||||||||||||||||||||||||||| BC000810 51 GCGGGCGCGTTGGTGCGGAAAGCGGCGGACTATGTCCGAAGCAAGGATTT 100 GENOME 401 CCGGGACTACCTCATGAGGTGACG-Agcgcc.......tgtagCACTTCT 16339 ||||||||||||||||| || ||| |>>>>> 15907 >>>>> ||||| BC000810 101 CCGGGACTACCTCATGA-GT-ACGCA.................--CTTCT 129 GENOME 16340 GGGGCCCAGTAGCCAACTGGGGTCTTCCCATTGCTGCCATCAATGATATG 16389 |||||||||||||||||||||||||||||||||||||||||||||||||| BC000810 130 GGGGCCCAGTAGCCAACTGGGGTCTTCCCATTGCTGCCATCAATGATATG 179 OUR PAIR HMM GENOME 351 GCGGGCGCGTTGGTGCGGAAAGCGGCGGACTATGTCCGAAGCAAGGATTT 400 |||||||||||||||||||||||||||||||||||||||||||||||||| BC000810 51 GCGGGCGCGTTGGTGCGGAAAGCGGCGGACTATGTCCGAAGCAAGGATTT 100 GENOME 401 CCGGGACTACCTCATGAGGTGAC.......AATAGTACGGTAAG...... 13006 ||||||||||||||||||>>>>> 12584 >>>>>||||>>>>> 3326 BC000810 101 CCGGGACTACCTCATGAG.................TACG........... 122 GENOME 13007 TGTAGCACTTCTGGGGCCCAGTAGCCAACTGGGGTCTTCCCATTGCTGCC 13046 >>>>>||||||||||||||||||||||||||||||||||||||||||||| BC000810 123 .....CACTTCTGGGGCCCAGTAGCCAACTGGGGTCTTCCCATTGCTGCC 167 ENCODE Workshop

ENCODE Workshop

Blind test • Test set • 100 alignment pairs of MGC clones to genome • Paragon & EST_genome differ on all of them • Output format identical • Evaluation • Curator attempting to explain discrepancies • Result • 37 cases where biological evidence favors 1 • In 31/37 Paragon alignment is supported ENCODE Workshop

Future directions • UTR vs. ORF • Polymorphism is more common in UTR • And 3rd position in ORF • Conservation • Use alignments to distinguish true from false • Splice sites, introns • Codons • Polymorphisms (analogous to quality values) ENCODE Workshop

Conceptual shift • Traditional view • cDNA data “speaks for itself”. Theory neutral. • Alignment = counting matches, mismatches, gaps • cDNA = genome annotation ENCODE Workshop

Conceptual shift • Our view • More knowledge = better alignments & annotations • cDNA is very useful evidence re: gene structure • Need to align it correctly • Need to determine its completeness • If not complete, predict the remainder • Gene prediction & cDNA alignment are the same problem • cDNA/EST just adds another information source ENCODE Workshop

N-SCAN_EST Chaochun Wei

TWINSCAN/N-SCAN_EST • Goal: • Integrate EST information with TWINSCAN to • improve accuracy where EST evidence exits • without losing the ability to predict novel genes. ENCODE Workshop

Twinscan_est ENCODE Workshop

Generating EST-alignment Sequence ENCODE Workshop

Modeling EST alignment sequence • Probability models • In each HMM state • Separate models for EST alignment sequence • Probabilities of DNA, conservation sequence, and EST sequence are multiplied. • Very similar to models of genomic alignments ENCODE Workshop

Multi-genome methods:N-SCAN Samuel Gross with Randall Brown

N-SCAN:Using multi-genome alignments • Motivation • Many genomes should give stronger signal of negative selection than two • Lots of genomes are being sequenced • Methods • Extend Twinscan to a phylogenetic tree model • At each site, mutation rate & pattern of tolerated substitutions depend on function ENCODE Workshop

Example • A multiple alignment that (A) is and (B) is not typical of the splice boundary shown ENCODE Workshop

Using mutation patterns for improving gene prediction • Tree hidden Markov model • Each state • generates columns of a multiple alignment • by a substitution process • along the branches of a phylogenetic tree ENCODE Workshop

Challenges • Columns are not correct, orthologous • Sequencing error • Alignment error • Change of function (I am not a mouse!) ENCODE Workshop

Differences from EXONIPHY • Approach • Estimate models of actual alignments, not evolutionary processes • Model • Independent substitution probabilities on each branch of the tree • 6 characters: A, C, G, T, gap, unaligned • Condition backwards from target genome ENCODE Workshop

Preliminary study in human ENCODE Workshop

Fin ENCODE Workshop

A knowledge-based approach to integrated genome annotation

A knowledge-based approach to integrated genome annotation

Presentation Transcript

Genome annotation

MICROBIAL GENOME ANNOTATION

Subsystem Approach to Genome Annotation

Computational Genome Annotation

Genome Annotation

Genome Annotation

Eukaryotic Genome Annotation

Genome Annotation

Genome Annotation

Integrated Knowledge Approach to Sustainability Policy

A Knowledge-based Approach to Citation Extraction

Genome Annotation

Genome Annotation Continued

microbial genome annotation

Genome Annotation

Genome Annotation

VectorBase genome annotation

Eukaryotic Genome Annotation

Arabidopsis Genome Annotation

A Statistical Approach to Literature-based Gene Group Annotation

A Knowledge Based Approach to Community Planning

Genome Annotation