1 / 29

Prokaryotic gene finding

Prokaryotic gene finding. Marie Skovgaard Ph.D. student marie@cbs.dtu.dk. Prokarya. Can you spot the gene?. >AE006641 GTATACTCTTCTTCCCTATACATTGTCGCAGCAAGCTTAGTTTCTTTAGCCTCTCTGCTTTCATTATTAC TTATAATCTTAATAGCAAGGAGACATATGATAGAGTATTTCTATATGATTCCTTCGTTCGTTTATATGAA

Download Presentation

Prokaryotic gene finding

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Prokaryotic gene finding Marie Skovgaard Ph.D. student marie@cbs.dtu.dk

  2. Prokarya

  3. Can you spot the gene? >AE006641 GTATACTCTTCTTCCCTATACATTGTCGCAGCAAGCTTAGTTTCTTTAGCCTCTCTGCTTTCATTATTAC TTATAATCTTAATAGCAAGGAGACATATGATAGAGTATTTCTATATGATTCCTTCGTTCGTTTATATGAA CTTTATTGTCGCACTAAACTTCACTGCAATATTTTTAGAGTTAATAAGAGCACCTAGAGTGTGGGTAAAA ACTGAAAGAAGTGCCAAGGTTACGGGGGAGGTCATGGGATGATAACTGAATTTTTACTTAAAAAGAAATT AGAAGAACATTTAAGCCATGTAAAGGAAGAGAATACGATATATGTAACAGATTTAGTAAGATGCCCCAGA AGAGTAAGATATGAGAGTGAATACAAGGAGCTTGCAATCTCTCAGGTTTACGCGCCTTCAGCTATTTTAG GGGACATATTGCATCTCGGTCTTGAAAGCGTATTAAAAGGGAACTTTAATGCAGAAACTGAAGTTGAAAC TCTGAGAGAAATTAACGTCGGAGGTAAAGTTTATAAAATTAAAGGAAGAGCCGATGCAATAATTAGAAAT GACAACGGGAAGAGTATTGTAATTGAGATAAAAACTTCTAGAAGTGATAAAGGATTACCTCTAATTCATC ATAAAATGCAGCTACAGATATATTTATGGTTATTTAGTGCAGAAAAAGGTATACTAGTTTACATAACTCC AGATAGGATAGCTGAGTATGAAATAAACGAACCTTTAGATGAAGCAACAATAGTAAGACTTGCAGAGGAT ACAATAATGTTACAAAACTCACCTAGATTCAACTGGGAATGTAAATATTGCATATTTTCCGTCATTTGCC CAGCTAAACTAACCTAAAATTAAAATCTCTCATCGATATAATTAAATTGTGCACACTAGACCAGTAGTTG CCACAATAGCTGGGAGTGACAGTGGAGGAGGTGCTGGATTACAGGCTGATCTAAAGACGTTTAGCGCATT AGGAGTTTTTGGTACAACAATAATAACCGGTTTAACAGCACAGAATACAAGAACAGTTACAAAAGTATTA GAGATACCATTAGATTTCATTGAAGCTCAGTTTGATGCGGTTTGCCTAGATTTACATCCAACTCACGCCA AAACTGGAATGTTAGCTTCTGGTAAAGTGGTAGAACTTGTACTGAGAAAAATTAGAGAGTATAACATAAA ACTAGTTTTAGATCCAGTGATGGTTGCGAAATCTGGATCATTATTGGTAACAGAGGATATCTCGGAGCAA ATAAAAAAGGCGATGAAGGAGGCCATAATATCTACTCCAAACAGATATGAAGCTGAGATAATAAATAAGA CAAAGATTAATAGTCAAGATGATGTTATAAAAGCGGCAAGGGAAATTTATTCTAAGTATGGGAATGTTGT AGTTAAAGGATTTAATGGAGTAGATTACGCCATAATTGACGGAGAAGAAATAGAGTTAAAAGGTGATTAC ATCAGTACTAAAAATACACATGGTAGTGGAGACGTATTTTCTGCCTCCATAACTGCATATCTTGCCTTGG GATACAAACTTAAAGATGCATTAATAAGAGCTAAAAAATTCGCTACAATGACAGTCAAATACGGTTTGGA CTTAGGAGGAGGATATGGACCAGTAGATCCCTTTGCCCCTATAGAGTCCATAGTGAAGAGAGAAGAAGGA AGAAATCAGCTAGAAAACTTACTTTGGTACTTAGAGTCTAATCTTAACGTTATACTTAAACTAATTAACG / (ATG|TTG|GTG)((…)*?)(TGA|TAG|TAA)/

  4. Identifying open reading frames / (ATG|TTG|GTG)((…)*?)(TGA|TAG|TAA)/

  5. A. pernix (43% AT)

  6. Why care about over annotated genes? • Genome comparison: • Fraction of known proteins • Average gene length • Amino acid composition • The quality of our databases • To gain biological knowledge

  7. Regular expression: /[AT][CG][AC][ACGT]*A[TG][CG]/ The regular expression is able to find all posible sequences, but do not distinguish between the consensus sequence and the highly unlikely sequence: ACAC—ATC or TGCT--AGG Weigth matrixes can be used to score the sequence but do not deal with insertions and deletions. ACA---ATG TCAACTATC ACAC--AGC AGA---ATG ACCG--ATC Regular expression

  8. 0.4 A 0.2 C 0.4 G 0.2 T 0.2 0.6 0.6 A 0.8 C G T 0.2 A C 0.8 G 0.2 T A 0.8 C 0.2 G T A 1.0 C G T A C G 0.2 T 0.8 A C 0.8 G 0.2 T 1.0 0.4 1.0 1.0 1.0 Markov model ACA---ATG TCAACTATC ACAC--AGC AGA---ATG ACCG--ATC

  9. Begin End Profile HMM • Profile HMM have a predefined architecture and the parameters are estimated from multiple sequence alignments. • Profile HMM are not usefull for gene finding, since all genes in an organism can not be aligned in a meaningfull way.

  10. ATG GTG TTG A T G C A T G C A T G C TAG TAA TGA S2 S3 S4 S5 S1 Markov Model for gene finding Define a simple architecture: / (ATG|TTG|GTG)((…)*?)(TGA|TAG|TAA)/

  11. Knowledge of the structure of genes is used to define the architecture of the model. Sequences (x) from known genes are used to estimate the parameters of the model – training of the model. The training is done by counting the number of times a nucleotide occur in a given state and dividing this number with the number of sequences used in training giving the frequencies. Markov models

  12. ATG GTG TTG A T G C A T G C A T G C TAG TAA TGA Sequence x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 …..….xn S1 S2 S3 S4 S5 States S2 S3 S4 S5 S1 Training

  13. 0.98 A: 0.25 T: 0.23 G: 0.27 C: 0.25 ATG: 0.77 TTG: 0.11 GTG: 0.12 CTG: 0.00 A: 0.22 T: 0.24 G: 0.27 C: 0.27 A: 0.26 T: 0.24 G: 0.25 C: 0.25 TAG: 0.6 TAA: 0.3 TGA: 0.1 S2 S3 S4 S5 S1 Model after training The trained model can be used to search for genes in DNA sequences.

  14. Sequence ATG A T T T C G C G C G A T ……….T A G S1 S2 S3 S4 S5 0.77 0.00 0.00 0.00 States 0.00 (0.22*0.77) 0.00 0.00 0.00 0.00 (0.23*0.22*0.77) 0.00 0.00 0.00 0.00 (0.24*0.23*0.22*0.77) 0.00 0.00 0.00 =P(x|M) Searching with the HMM

  15. Log-Odds score • The propability of a sequence gets infinitly small as the sequence x becomes longer. • This is solved by defining a background (NULL) model. For example a random distribution: A=T=C=G=0.25 • From this the Log-Odds score can be calculated: -log(P(x|M)/P(x|NULL)) • A high Log-Odds score corresponds to a sequence that looks more like the gene model than the background model.

  16. ATG GTG TTG A T G C A T G C A T G C TAG TAA TGA S2 S3 S4 S5 S1 Is the model to simple?

  17. Codon usage • Synonymous codons incode the same amino acid. At random synonymous codons would be expected to be used with equal frequencies. In real life synonomous codons have different frequencies. • Different species have consistent and characteristic codon biases. Lateral transferred genes and genes from plasmids and phages will have atypical codon usage. • Variations in codon usage within an organism can be modelled in different coding models in the HMM.

  18. 1stPosition 2nd Position 3rdPosition U C A G U 30,407 Phe22,581 Phe18,943 Leu18,629 Leu 11,523 Ser11,766 Ser9,793 Ser12,195 Ser 22,048 Tyr16,669 Tyr2,706 Stop326 Stop 7,062 Cys8,846 Cys1,260 Stop20,756 Trp UCAG C 15,018 Leu15,104 Leu5,316 Leu71,710 Leu 9,569 Pro7,491 Pro11,496 Pro31,614 Pro 17,631 His13,272 His20,912 Gln39,285 Gln 28,458 Arg29,968 Arg4,860 Arg7,404 Arg UCAG A 41,375 Ile34,261 Ile5,967 Ile37,994 Met 12,223 Thr31,889 Thr9,683 Thr19,682 Thr 24,189 Asn29,529 Asn45,812 Lys14,076 Lys 11,982 Ser21,907 Ser2,899 Arg1,694 Arg UCAG G 24,910 Val20,800 Val14,850 Val35,979 Val 20,808 Ala34,770 Ala27,468 Ala45,862 Ala 43,817 Asp25,996 Asp53,780 Glu24,312 Glu 33,731 Gly40,396 Gly10,902 Gly15,118 Gly UCAG Fields : [number] [amino acid]

  19. AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC ATG GTG TTG TAG TAA TGA S1 S2 S3 Is the model to simple?

  20. AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC ATG GTG TTG TAG TAA TGA S1 S3 S2 S4 HMM for gene finding

  21. AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC ATG GTG TTG TAG TAA TGA S E Multiple coding models

  22. Order of the model • A zero order Markov model (state) has a propability of letter in the state – the propabilities are independent of the previous sequence. The NULL model is a zero order Markov model (A=T=G=C=0.25). • The propability of a letter in a first order Markov model depends on the previous letter (di-nucleotide distributions). • Second order depends on the two previous letters (corresponding to a codon).

  23. Order of the coding model • Inter-codon denpendencies are correlations between amino acids typically found in proteins. They reflect typical features of proteins and can be used to improve the performance of the gene finder. • The use of higher order coding models in gene finding is a way to capture these inter-codon denpendencies. • Higher order models requires more training data and more computational time when searching.

  24. The Shine-Dalgarno sequence • The ribosome binds to the messenger RNA through baseparing to the 30S ribosomal subunit. • The binding site is the Shine-Dalgarno sequence (SD). • The SD is a purine-rich sequence (consensussequence: AGGAG) at the 5' end of most prokaryotic mRNAs. • The SDis found 5-10 basepairs upstream from the start codon.

  25. EasyGene

  26. R. prowazekii

  27. GeneMark.hmm http://opal.biology.gatech.edu/GeneMark/gmhmm2_prok.cgi Lukashin A. and Borodovsky M., “GeneMark.hmm: new solutions for gene finding”, NAR, 1998, Vol. 26, No. 4, pp. 1107-1115. EasyGene http://cbs.dtu.dk/services/EasyGene Schou Larsen T. and Krogh A., “EasyGene – A prokaryotic gene finder that ranks ORFs by statistical significance”. BMC Bioinformatics 2003, 4:21

More Related