1 / 28

Recent Advances In De Novo Gene Prediction.

Recent Advances In De Novo Gene Prediction. Jeltje van Baren Laboratory for Computational Genomics (Brent lab) Washington University in Saint Louis.

zuwena
Download Presentation

Recent Advances In De Novo Gene Prediction.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Recent Advances In De Novo Gene Prediction. Jeltje van Baren Laboratory for Computational Genomics (Brent lab) Washington University in Saint Louis

  2. GGCACGAGGCAGGGCCGGTGGGATCCCCTCGGGCTCCCGCCTTAGCATGCTGGCCGGGACATCTGGTGAA CATGGCCTCTGCTACTGCGGCAGCAGCACGACGGGGCCTCGGCCGGGCTCTCCCTCTCCTCTGGCGTGGC TACCAGACCGAGCGGGGCGTTTACGGCTACCGGCCGAGGAAGCCCGAGAGCCGCGAGCCCCAGGGCGCCC TGGAGCGCCCCCCAGTTGATCATGGCCTTGCCAGGTTGGTGACAGTATATTGTGAGCATGGTCATAAAGC TGCCAAAATCAACCCCCTCTTCACCGGACAAGCCCTGCTGGAGAATGTGCCTGAAATCCAAGCCCTGGTG CAGACACTGCAGGGACCCTTCCACACGGCAGGATTATTGAACATGGGGAAGGAAGAGGCCTCACTTGAGG AAGTGTTAGTCTATCTCAATCAAATCTACTGTGGGCAGATTTCTATTGAAACCTCCCAACTTCAGAGCCA GGATGAGAAAGACTGGTTTGCCAAGCGGTTTGAGGAACTGCAAAAGGAGACGTTTACCACAGAAGAGCGA AAACATCTGTCGAAACTAATGCTGGAATCTCAGGAGTTTGACCACTTTCTGGCCACCAAGTTCTCGACAG TGAAGCGATATGGAGGCGAAGGGGCTGAAAGCATGATGGGCTTTTTCCACGAGCTGCTGAAAATGTCGGC… Genome Program to optimize HMM-based probability model Genome with genes

  3. Gene finding Data driven • Predict genomic layout of: • Sequenced transcripts • Hypothetical similar transcripts E.g. GeneWise/ Ensembl Gene modeling Data gathering Sequence cDNA libraries

  4. Gene finding Data driven Hypothesis driven • Predict genomic layout of: • Sequenced transcripts • Hypothetical similar transcripts Predict genomic layout of hypothetical transcripts de novo E.g. GeneWise/ Ensembl E.g. TWINSCAN Gene modeling Data gathering Sequence cDNA libraries

  5. Gene finding Data driven Hypothesis driven • Predict genomic layout of: • Sequenced transcripts • Hypothetical similar transcripts Predict genomic layout of hypothetical transcripts de novo E.g. GeneWise/ Ensembl E.g. TWINSCAN Gene modeling Data gathering Sequence cDNA libraries PCR amplify & sequence cDNA

  6. GGCACGAGGCAGGGCCGGTGGGATCCCCTCGGGCTCCCGCCTTAGCATGCTGGCCGGGACATCTGGTGAA CATGGCCTCTGCTACTGCGGCAGCAGCACGACGGGGCCTCGGCCGGGCTCTCCCTCTCCTCTGGCGTGGC TACCAGACCGAGCGGGGCGTTTACGGCTACCGGCCGAGGAAGCCCGAGAGCCGCGAGCCCCAGGGCGCCC TGGAGCGCCCCCCAGTTGATCATGGCCTTGCCAGGTTGGTGACAGTATATTGTGAGCATGGTCATAAAGC TGCCAAAATCAACCCCCTCTTCACCGGACAAGCCCTGCTGGAGAATGTGCCTGAAATCCAAGCCCTGGTG CAGACACTGCAGGGACCCTTCCACACGGCAGGATTATTGAACATGGGGAAGGAAGAGGCCTCACTTGAGG AAGTGTTAGTCTATCTCAATCAAATCTACTGTGGGCAGATTTCTATTGAAACCTCCCAACTTCAGAGCCA GGATGAGAAAGACTGGTTTGCCAAGCGGTTTGAGGAACTGCAAAAGGAGACGTTTACCACAGAAGAGCGA AAACATCTGTCGAAACTAATGCTGGAATCTCAGGAGTTTGACCACTTTCTGGCCACCAAGTTCTCGACAG TGAAGCGATATGGAGGCGAAGGGGCTGAAAGCATGATGGGCTTTTTCCACGAGCTGCTGAAAATGTCGGC… Single genome predictors GENSCAN GENIE HMMGENE GENEID FGNESH • Patterns in the DNA sequence • Signals for splice sites, start & stop, translation, …

  7. Dual-genome de novo predictors • Patterns in the DNA sequence • Signals for splice sites, start & stop, translation, … • Patterns of natural selection Human protein coding region human TTATCCACCAGACCAGATAGATACTTGTCTGCCACCCTC |||||||||||||||||||| || ||||| || || ||| TTATCCACCAGACCAGATAGGTATTTGTCAGCTACTCTC mouse Human intron (non-coding spacer) human CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC |||||| || | ||||||||| || || || CTAGAGC----AAGAAGACAGGTACCATAGGGCTCTCCT mouse

  8. Dual-genome de novo predictors Very effective in small genomes Mammalian genomes: small coding fraction Closely related informants: too much sequence similarity Distantly related informants: not enough information When is a base under selective constraint? human TTATCCA-CAGACCAGATAGATACATGTCTGCCACCCTC ||||||| |||||||||||| || |||| || || ||| TTATCCACCAGACCAGATAGGTATTTGTCAGCTACTCTC mouse

  9. Multi genome predictors • When multiple genomes are used, evolutionary distance must be taken into account. • EHMM (Pedersen & Hein) • phylo HMM (Siepel & Haussler) • phylo HMM: Combination of continuous time Markov chains for describing evolution of a residue and discrete HMM for gene prediction.

  10. Remaining challenges Things we should predict Alternative splicing Overlapping genes ‘Unusual’ genes UTRs Things we should not predict Pseudogenes

  11. UTR modelling The problems: no ORF sequence pattern less conservation than coding transcription start sites are hard to recognize TSS

  12. Spliced 5’ UTRs ~64% Pr ATG… TSS ~27% ~9%

  13. Spliced 5’ UTRs ~64% ~127 nt ~27% ~133 nt ~10 kb ~62 nt ~9% ~133 nt ~115 nt ~62 nt

  14. Prom CDS intron 5’ UTR modelling 5’UTR

  15. Prom CDS intron intron 5’ UTR modelling 5’UTR UTRexon UTRexon

  16. Predicting 5’ UTRs in human • Method can correctly predict whether a UTR is spliced 89% of the time (if ATG predicted correctly – about 57%) 36% correct 35% correct

  17. Pseudogenes • A pseudogene is a nonfunctional copy of a real gene elsewhere in the genome • The human genome contains between 7,800 and 13,300 pseudogenes • Nonprocessed pseudogenes are present in RefSeq as annotated single exon genes.

  18. Pseudogenes and gene prediction

  19. Pseudogenes and gene prediction

  20. Pseudogenes and gene prediction

  21. Pseudogenes and gene prediction

  22. Finding pseudogenes Identify parent gene in transcript database Database can also consist of gene predictions: Iterative rounds of gene prediction and pseudogene removal

  23. Results of pseudogene removal 5240000 5245000 5250000 MADHIP Prediction Prediction after masking masked pseudogene Gene number from 30k to 27k

  24. Brent lab people Randy Brown Sam Gross Mikhail Velikanov Paul Flicek Chaochun Wei Aaron Tenney Manimozhiyan Arumugam Evan Keibler Michael Stevens Michael Brent

  25. RefSeq pseudogene

  26. Pseudogene removal: intron alignment Predicted gene BLAST Known gene Align prediction to genomic region of known gene & match intron locations If the intron positions do not line up, the exon is a putative pseudogene

  27. Using conserved synteny in pseudogene finding • If a (predicted) gene has a homolog elsewhere in the genome: • Map homolog to mouse and human genome • If there is a better hit in the human genome than in the mouse conserved syntenic region: possible pseudogene human mouse ?

More Related