280 likes | 485 Views
Recent Advances In De Novo Gene Prediction. Jeltje van Baren Laboratory for Computational Genomics (Brent lab) Washington University in Saint Louis.
E N D
Recent Advances In De Novo Gene Prediction. Jeltje van Baren Laboratory for Computational Genomics (Brent lab) Washington University in Saint Louis
GGCACGAGGCAGGGCCGGTGGGATCCCCTCGGGCTCCCGCCTTAGCATGCTGGCCGGGACATCTGGTGAA CATGGCCTCTGCTACTGCGGCAGCAGCACGACGGGGCCTCGGCCGGGCTCTCCCTCTCCTCTGGCGTGGC TACCAGACCGAGCGGGGCGTTTACGGCTACCGGCCGAGGAAGCCCGAGAGCCGCGAGCCCCAGGGCGCCC TGGAGCGCCCCCCAGTTGATCATGGCCTTGCCAGGTTGGTGACAGTATATTGTGAGCATGGTCATAAAGC TGCCAAAATCAACCCCCTCTTCACCGGACAAGCCCTGCTGGAGAATGTGCCTGAAATCCAAGCCCTGGTG CAGACACTGCAGGGACCCTTCCACACGGCAGGATTATTGAACATGGGGAAGGAAGAGGCCTCACTTGAGG AAGTGTTAGTCTATCTCAATCAAATCTACTGTGGGCAGATTTCTATTGAAACCTCCCAACTTCAGAGCCA GGATGAGAAAGACTGGTTTGCCAAGCGGTTTGAGGAACTGCAAAAGGAGACGTTTACCACAGAAGAGCGA AAACATCTGTCGAAACTAATGCTGGAATCTCAGGAGTTTGACCACTTTCTGGCCACCAAGTTCTCGACAG TGAAGCGATATGGAGGCGAAGGGGCTGAAAGCATGATGGGCTTTTTCCACGAGCTGCTGAAAATGTCGGC… Genome Program to optimize HMM-based probability model Genome with genes
Gene finding Data driven • Predict genomic layout of: • Sequenced transcripts • Hypothetical similar transcripts E.g. GeneWise/ Ensembl Gene modeling Data gathering Sequence cDNA libraries
Gene finding Data driven Hypothesis driven • Predict genomic layout of: • Sequenced transcripts • Hypothetical similar transcripts Predict genomic layout of hypothetical transcripts de novo E.g. GeneWise/ Ensembl E.g. TWINSCAN Gene modeling Data gathering Sequence cDNA libraries
Gene finding Data driven Hypothesis driven • Predict genomic layout of: • Sequenced transcripts • Hypothetical similar transcripts Predict genomic layout of hypothetical transcripts de novo E.g. GeneWise/ Ensembl E.g. TWINSCAN Gene modeling Data gathering Sequence cDNA libraries PCR amplify & sequence cDNA
GGCACGAGGCAGGGCCGGTGGGATCCCCTCGGGCTCCCGCCTTAGCATGCTGGCCGGGACATCTGGTGAA CATGGCCTCTGCTACTGCGGCAGCAGCACGACGGGGCCTCGGCCGGGCTCTCCCTCTCCTCTGGCGTGGC TACCAGACCGAGCGGGGCGTTTACGGCTACCGGCCGAGGAAGCCCGAGAGCCGCGAGCCCCAGGGCGCCC TGGAGCGCCCCCCAGTTGATCATGGCCTTGCCAGGTTGGTGACAGTATATTGTGAGCATGGTCATAAAGC TGCCAAAATCAACCCCCTCTTCACCGGACAAGCCCTGCTGGAGAATGTGCCTGAAATCCAAGCCCTGGTG CAGACACTGCAGGGACCCTTCCACACGGCAGGATTATTGAACATGGGGAAGGAAGAGGCCTCACTTGAGG AAGTGTTAGTCTATCTCAATCAAATCTACTGTGGGCAGATTTCTATTGAAACCTCCCAACTTCAGAGCCA GGATGAGAAAGACTGGTTTGCCAAGCGGTTTGAGGAACTGCAAAAGGAGACGTTTACCACAGAAGAGCGA AAACATCTGTCGAAACTAATGCTGGAATCTCAGGAGTTTGACCACTTTCTGGCCACCAAGTTCTCGACAG TGAAGCGATATGGAGGCGAAGGGGCTGAAAGCATGATGGGCTTTTTCCACGAGCTGCTGAAAATGTCGGC… Single genome predictors GENSCAN GENIE HMMGENE GENEID FGNESH • Patterns in the DNA sequence • Signals for splice sites, start & stop, translation, …
Dual-genome de novo predictors • Patterns in the DNA sequence • Signals for splice sites, start & stop, translation, … • Patterns of natural selection Human protein coding region human TTATCCACCAGACCAGATAGATACTTGTCTGCCACCCTC |||||||||||||||||||| || ||||| || || ||| TTATCCACCAGACCAGATAGGTATTTGTCAGCTACTCTC mouse Human intron (non-coding spacer) human CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC |||||| || | ||||||||| || || || CTAGAGC----AAGAAGACAGGTACCATAGGGCTCTCCT mouse
Dual-genome de novo predictors Very effective in small genomes Mammalian genomes: small coding fraction Closely related informants: too much sequence similarity Distantly related informants: not enough information When is a base under selective constraint? human TTATCCA-CAGACCAGATAGATACATGTCTGCCACCCTC ||||||| |||||||||||| || |||| || || ||| TTATCCACCAGACCAGATAGGTATTTGTCAGCTACTCTC mouse
Multi genome predictors • When multiple genomes are used, evolutionary distance must be taken into account. • EHMM (Pedersen & Hein) • phylo HMM (Siepel & Haussler) • phylo HMM: Combination of continuous time Markov chains for describing evolution of a residue and discrete HMM for gene prediction.
Remaining challenges Things we should predict Alternative splicing Overlapping genes ‘Unusual’ genes UTRs Things we should not predict Pseudogenes
UTR modelling The problems: no ORF sequence pattern less conservation than coding transcription start sites are hard to recognize TSS
Spliced 5’ UTRs ~64% Pr ATG… TSS ~27% ~9%
Spliced 5’ UTRs ~64% ~127 nt ~27% ~133 nt ~10 kb ~62 nt ~9% ~133 nt ~115 nt ~62 nt
Prom CDS intron 5’ UTR modelling 5’UTR
Prom CDS intron intron 5’ UTR modelling 5’UTR UTRexon UTRexon
Predicting 5’ UTRs in human • Method can correctly predict whether a UTR is spliced 89% of the time (if ATG predicted correctly – about 57%) 36% correct 35% correct
Pseudogenes • A pseudogene is a nonfunctional copy of a real gene elsewhere in the genome • The human genome contains between 7,800 and 13,300 pseudogenes • Nonprocessed pseudogenes are present in RefSeq as annotated single exon genes.
Finding pseudogenes Identify parent gene in transcript database Database can also consist of gene predictions: Iterative rounds of gene prediction and pseudogene removal
Results of pseudogene removal 5240000 5245000 5250000 MADHIP Prediction Prediction after masking masked pseudogene Gene number from 30k to 27k
Brent lab people Randy Brown Sam Gross Mikhail Velikanov Paul Flicek Chaochun Wei Aaron Tenney Manimozhiyan Arumugam Evan Keibler Michael Stevens Michael Brent
Pseudogene removal: intron alignment Predicted gene BLAST Known gene Align prediction to genomic region of known gene & match intron locations If the intron positions do not line up, the exon is a putative pseudogene
Using conserved synteny in pseudogene finding • If a (predicted) gene has a homolog elsewhere in the genome: • Map homolog to mouse and human genome • If there is a better hit in the human genome than in the mouse conserved syntenic region: possible pseudogene human mouse ?