1 / 53

A markovian approach for the analysis of the gene structure

A markovian approach for the analysis of the gene structure. C. MelodeLima 1 , L. Guéguen 1 , C. Gautier 1 and D. Piau 2. 1 Biométrie et Biologie Evolutive UMR CNRS 5558, Université Claude Bernard Lyon 1, France 2 Institut Camille Jordan UMR CNRS 5208, Université Claude Bernard Lyon 1, France.

marvel
Download Presentation

A markovian approach for the analysis of the gene structure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A markovian approach for the analysis of the gene structure C. MelodeLima1, L. Guéguen1, C. Gautier1 and D. Piau2 1Biométrie et Biologie Evolutive UMR CNRS 5558, Université Claude Bernard Lyon 1, France 2Institut Camille Jordan UMR CNRS 5208, Université Claude Bernard Lyon 1, France PRABI

  2. Conclusion • Direction of research Contents • Introduction • HMM for the genomic structure of DNA sequences • Discrimination method based on HMM

  3. We propose an analysis of the structural properties of genes, using a discrimination method based on HMMs Introduction • Intensive sequencing • Genes represent only 3% of the human genome Markovian models are widely used for the identification of genes

  4. Drawback: •  The distribution of the sojourn time in a given state is geometric • The empirical distribution of the length of the exons is not geometric ! Introduction Hidden Markov model  Advantages: Each state represents a different type of region in the sequence  The complexity of the algorithm is linear with respect to the length of the sequence

  5. HMM for the genomic structure of DNA sequences Structure of the HMM model t1 No CDS CDS 1-t1 1-t2 t2 Bases probabilities A pA C pC G pG T pT Bases probabilities A qA C qC G qG T qT CDS: coding sequence

  6. HMM for the genomic structure of DNA sequences Several biological properties of DNA sequences were taken into account • Model of order 5

  7. St-6 St-5 St-4 St-3 St-2 St-1 St Xt-6 Xt-5 Xt-4 Xt-3 Xt-2 Xt-1 Xt HMM for the genomic structure of DNA sequences Several biological properties of DNA sequences were taken into account • Model of order 5

  8. St-6 St-5 St-4 St-3 St-2 St-1 St Xt-6 Xt-5 Xt-4 Xt-3 Xt-2 Xt-1 Xt HMM for the genomic structure of DNA sequences Several biological properties of DNA sequences were taken into account • Model of order 5

  9. St-6 St-5 St-4 St-3 St-2 St-1 St Xt-6 Xt-5 Xt-4 Xt-3 Xt-2 Xt-1 Xt HMM for the genomic structure of DNA sequences Several biological properties of DNA sequences were taken into account • Model of order 5

  10. St-6 St-5 St-4 St-3 St-2 St-1 St Xt-6 Xt-5 Xt-4 Xt-3 Xt-2 Xt-1 Xt HMM for the genomic structure of DNA sequences Several biological properties of DNA sequences were taken into account • Model of order 5

  11. St-6 St-5 St-4 St-3 St-2 St-1 St Xt-6 Xt-5 Xt-4 Xt-3 Xt-2 Xt-1 Xt HMM for the genomic structure of DNA sequences Several biological properties of DNA sequences were taken into account • Model of order 5

  12. HMM for the genomic structure of DNA sequences Several biological properties of DNA sequences were taken into account • Length distributions of exons and introns according to their position in genes: Internal intron Initial intron Initial exon Terminal intron Internal exon Intergenic region Terminal exon Single exon

  13. Internal intron Initial intron Initial exon Terminal intron Internal exon Intergenic region Terminal exon Single exon HMM for the genomic structure of DNA sequences Several biological properties of DNA sequences were taken into account • Length distributions of exons and introns according to their position in genes:

  14. Internal intron Initial intron Initial exon Terminal intron Internal exon Intergenic region Terminal exon Single exon HMM for the genomic structure of DNA sequences Several biological properties of DNA sequences were taken into account • Length distributions of exons and introns according to their position in genes:

  15. Internal intron Initial intron Initial exon Terminal intron Internal exon Intergenic region Terminal exon Single exon HMM for the genomic structure of DNA sequences Several biological properties of DNA sequences were taken into account • Length distributions of exons and introns according to their position in genes:

  16. HMM for the genomic structure of DNA sequences Several biological properties of DNA sequences were taken into account • Length distributions of exons and introns according to their position in genes: Internal intron Initial intron Initial exon Terminal intron Internal exon Intergenic region Terminal exon Single exon • Direct and reverse strands

  17. p p p p 1-p 1-p Exon 1-p frame 0 frame 1 frame 2 HMM for the genomic structure of DNA sequences Several biological properties of DNA sequences were taken into account • Codons: 1-p

  18. HMM for the genomic structure of DNA sequences Sojourn time in a HMM state must follows a geometric law p Length of a hidden state T: sojourn time in a given state T follows a geometric law 1-p CDS Times of stay in state CDSProbability 1 1-p 2 p (1-p) 3 p2 (1-p) … n pn-1 (1-p) Geometric law

  19. Probability Length of the internal exons Method: estimation of the length of a region HMM for the genomic structure of DNA sequences Méthode • Geometric laws does not fit the empirical distribution of the length of exons

  20. Probability • We suggest to: Length of the internal exons State State 1 State 2 Method: estimation of the length of a region HMM for the genomic structure of DNA sequences Méthode • Geometric laws does not fit the empirical distribution of the length of exons

  21. Probabilityt Probability • Good fit with sums of • 5 geometric random variables Length of the internal exons • We suggest to: Length of the internal exons State State 1 State 2 Method: estimation of the length of a region HMM for the genomic structure of DNA sequences Méthode • Geometric laws does not fit the empirical distribution of the length of exons

  22. HMM for the genomic structure of DNA sequences Method: estimation of the length of a region • Data: Human genome * extracted from HOVERGEN • Different length distributions: * Sum of  geometric laws of equal parameter with  =1..7 * Sum of 2 or 3 geometric laws of different parameters • For each region: * We choose parameters that minimize the Kolmogorov-Smirnov distance * We do not use the maximum likelihood method

  23. HMM for the genomic structure of DNA sequences Results: Estimation of the length of a region Maximum likelihood estimation Kolmogorov-Smirnov estimation Probability Length of the initial exon

  24. HMM for the genomic structure of DNA sequences Probabilityt Length of the internal exons The model fits very well the empirical distribution Results: Estimation of the length distribution of internal exons Sum of 5 geometric laws p=1/26

  25. HMM for the genomic structure of DNA sequences Many small genes with single exons are pseudogenes Results: Estimation of the length distribution of intronless genes Sum of 2 geometric laws p=1/440

  26. Contents • Introduction • HMM for the genomic structure of DNA sequences • Discrimination method based on HMM • Conclusion • Direction of research

  27. Discrimination method based on HMM Method: A model for initial, internal, terminal exons • Emission probabilities for each state are estimated by the frequencies of words with 6 letters (model of order 5)

  28. Discrimination method based on HMM • Discrimination method to test the homogeneity between regions: D = { log P(S/ HMM1) - log P(S/ HMM2) } / |S| (Eq. 1) S is the test sequence of length |S| HMM1: Initial Exon HMM2: Internal exon Sequence is characterized by the HMM with the best likelihood likelihood Sequence Method: A model for initial, internal, terminal exons • Emission probabilities for each state are estimated by the frequencies of words with 6 letters (model of order 5)

  29. Discrimination method based on HMM Decision N1 initial exons N-N1 internal exons N-N1 N1 Each model is characterized by the frequency of sequence recognition Quality of the decision:We want to know if models are well adapted to their regions (HMMs are compared pair wise) {Initial exon sequences} N

  30. Discrimination method based on HMM Internal exon ≈ Terminal exon Initial exon ≠ Internal exon Initial exon ≠ Terminal exon • Results: Comparison of different HMMs on different test sequences

  31. Discrimination method based on HMM Internal exon ≈ Terminal exon Initial exon ≠ Internal exon Initial exon ≠ Terminal exon • Results: Comparison of different HMMs on different test sequences

  32. Discrimination method based on HMM Internal exon ≈ Terminal exon Initial exon ≠ Internal exon Initial exon ≠ Terminal exon • Results: Comparison of different HMMs on different test sequences

  33. Discrimination method based on HMM Internal exon ≈ Terminal exon Initial exon ≠ Internal exon Initial exon ≠ Terminal exon • Results: Comparison of different HMMs on different test sequences

  34. Discrimination method based on HMM HMM Start HMM End Initial exon HMM Initial exon HMM • Results:Break in the homogeneity of the first coding exon To determine the break point in first exon sequences, we consider different HMMs: k • The HMM representing the initial exon was split into 2 HMMs around the kth base • A “Start” HMM is trained on the first k bases • An “End” HMM is trained on the remaining bases

  35. Discrimination method based on HMM • Results:Break in the homogeneity of the first coding exon M_EI80 Other models

  36. Discrimination method based on HMM • Results:Break in the homogeneity of the first coding exon M_EI80 Other models

  37. Discrimination method based on HMM • Results:Break in the homogeneity of the first coding exon M_EI80 Other models

  38. Discrimination method based on HMM • Results:Break in the homogeneity of the first coding exon M_EI80 Other models

  39. Discrimination method based on HMM 25% 75% • Results:Initial exons with peptide signal (SignalP) HMM Start HMM End

  40. Discrimination method based on HMM without peptide signal 25% 10% 90% 75% HMM Start characterizes well the peptide signal • Result:Initial exons with peptide signal (SignalP) HMM Start HMM End

  41. Conclusion Modelling of the exons length distribution: • Sums of geometric laws fit well the distribution of exons lengths • The model has relatively few parameters • Sum of 5 geometric laws of the same parameter (internal exons) • Sum of 3 geometric laws of different parameters (terminal exons)

  42. Conclusion • Break of homogeneity of initial exonaround 80th base Peptide signal Modelling of the exons length distribution: • Sums of geometric laws fit well the distribution of exons lengths • The model has relatively few parameters • Sum of 5 geometric laws of the same parameter (internal exons) • Sum of 3 geometric laws of different parameters (terminal exons) • Discrimination method based on HMM: • Bad annotation in database of the intronless genes • Homogeneity between internal and terminal exons

  43. Contents • Introduction • HMM for the genomic structure of DNA sequences • Discrimination method based on HMM • Conclusion • Direction of research

  44. Direction of research Markovian models for the analysis of the organization of genomes Chromosome 9 Content of GC Versteeg 2003

  45. Direction of research Markovian models for the analysis of the organization of genomes Chromosome 9 Content of GC Genes density Versteeg 2003

  46. Direction of research Markovian models for the analysis of the organization of genomes Chromosome 9 Content of GC Genes density Size of introns Versteeg 2003

  47. Direction of research Markovian models for the analysis of the organization of genomes Chromosome 9 Content of GC Genes density Size of introns Repeated elements Versteeg 2003

  48. Direction of research Markovian models for the analysis of the organization of genomes Chromosome 9 Content of GC Genes density Size of introns Repeated elements Genes expression Versteeg 2003

  49. Direction of research Structure superposition in genomes A chromosome Isochore level Gene level Exon-intron level Codon level intron exon acc gcc agt tac ccc aga

  50. Direction of research Scan the genome • Build 3 HMMs adapted to the organization structure of each of the 3 isochores classes H, L, M H= [72%, 100%] M= ]56%, 72%[ L= [0%, 56%] • Human chromosomes are divided into overlapping 100 kb segments. Two successive segments overlap by half of their length. • Bayesian approach: for each segment and for each model (H, L and M), we compute the probability P[Model | Segment] Segment is characterized by the model with the best probability

More Related