230 likes | 374 Views
Genomic Sequence Analysis using Electron-Ion Interaction Potential. Masumi Kobayashi Performance Evaluation Laboratory University of Aizu. Purpose. To find the gene regions by using Lindley Equation and Electron-Ion Interaction Potential (EIIP).
E N D
Genomic Sequence Analysis using Electron-Ion Interaction Potential Masumi Kobayashi Performance Evaluation Laboratory University of Aizu
Purpose • To find the gene regions by using Lindley Equation and Electron-Ion Interaction Potential (EIIP). • To judge similarity of two DNA sequences that shortens the processing time by using Lindley equation and Electron-Ion Interaction Potential (EIIP).
DNA • DNA sequence consists of four nucleotide letters: A(adenine), T(thymine), G(guanine), and C(cytosine). • Base A is always paired with base T, and C is always paired with D, and DNA is double helix.
DNA Sequence and Amino Acid Sequence • A DNA sequence consists of a row of four nucleotides, and each nucleotide triplet is called a codon. And a codon corresponds to an amino acid. DNA Sequence |・・・|ATG|CGA|TAT|AAA|GCT|TTC|・・・| Amino Acid Sequence |・・・| M | R | L | K | A | F |・・・| Codon
Codon • 61 codons are transformed into amino acid. • For example, both TTT and TTC code for Phenylalanine(F). • 3 codons, TAA, TAG, and TGA are called Stop Codon.
The waiting time of the customer of queuing theory and a DNA sequence • In order to use Lindley equation, we need to describe the relation between the waiting time of the customer of queuing theory and a DNA sequence. • A score is given for the similarity of the amino acid of two target gene sequences, and sum of score is made to correspond to waiting time of queuing theory.
Lindley Equation : The score of the n-th letter. : The sum of the score to the n-th letter. Amino Acid Sequence Negative value
Electron-Ion Interaction Potential (EIIP) • Prof. Toyoizumi and Tuchiya showed a technique to find gene coding regions by using Lindley equation. But there is a problem, the determination of score required for Lindley equation is artificial. • In this research, we decide theoretical score by using Electron-Ion Interaction Potential. Each amino acid is represented by the EIIP value, which describes the average energy states of all valance electrons in particular amino acids.
Gene Finding Experiment • The target sequence of this experiment is the genome data of Escherichia coil O157:H7 Sakai. • Escherichia coil O157:H7 Sakai is a major food-born infection pathogen that causes diarrhea, coilitis, and hemolytic uremia syndrome. • We calculate using Lindley equation and EIIP.
Example of Amino Acid Scores and the Stop Codon Score (1) Score = EIIP - 0.0885 Negative Score Positive Score Stop Codon Score -2 × 0.0085
Example of Amino Acid Scores and the Stop Codon Score (2-1) Score = EIIP – 0.0045 Negative Score Positive Score Stop Codon Score -2 × 0.0445
Example of Amino Acid Scoresand the Stop Codon Score (2-2) Change the Stop Codon Score. -0.089 → -0.178 (-4 × 0.0445)
Threshold of Amino Acid Sequence • may become high by chance in the region that is meaningless at an amino acid sequence. • The threshold is used in order to distinguish from meaningless regions. • The score sequence of an amino acid sequence assumes that it is independent and identically distribution. • can be considered to be the waiting time of GI/GI/1 queuing system.
Threshold and the Probabilitythat will exceed the Threshold accidentally for any then The waiting time GI/GI/1 queuing system fills the following inequalities. is the probability judged to be a meaningful sequence although it is a meaningless sequence. The probability that will exceed (Threshold) by chance is 0.05.
Distinction of gene coding regions and junk regions by Threshold
Similarity Comparison Experiment • The target sequence of this experiment is the genome data of human - and -Hemoglobins. • Hemoglobin is contained in erythrocyte and consists of a “hem” containing iron, and a “globin” which is protein, and has the important role of carrying oxygen inside of the body. • We calculate using Lindley equation and EIIP.
Sequences of Human - and -Hemoglobins • The genome data that we use is a gene coding region of Human - and -Hemoglobins. • A gene coding region of Human -Hemoglobin VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH • A gene coding region of Human -Hemoglobin VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
Amino Acid and the Stop Codon Scores EIIP - 0.0532 -2 × 0.0532
Calculation Results of in -Hemoglobin and -Hemoglobin
The difference (absolute value) of calculation results of in -Hemoglobin and -Hemoglobin
Conclusion • We could find the gene regions from the DNA sequence by Lindley equation and EIIP. • We could show a technique of similarity comparison which shortened the processing time by Lindley equation and EIIP.