230 likes | 412 Views
Visualization and Classification of DNA sequences using Pareto learning Self Organizing Maps based on Frequency and Correlation Coefficient. Hiroshi Dozono Saga University. Introduction (1).
E N D
Visualization and Classification of DNA sequencesusing Pareto learning Self Organizing Maps based on Frequency and Correlation Coefficient Hiroshi Dozono Saga University
Introduction (1) • The first step of Genome analysis is DNA sequencing which identifies the sequence of nucleotides on DNA sequences. • About 10 years ago, DNA sequencing requires large costs and long time. • Recently, Next Generation Sequencing(NGS) can read the sequences very rapidly in low cost. • $100〜$1000 in 1 hour. • NGS produces large amounts of sequence data at once. • Gbytes 〜Tbytes
Introduction(2) • After reading the sequences, further analyses are conducted. • Identify the organisms • Identify the functions of genome • Remap the sequences on reference sequences • Comparison of the genomes among organisms • For the comparison of genomes, it will need large amount of computation to compare the sequences precisely. • The sequence alignment method is generally used. • The sequence alignment is effective for pairwise comparison or comparing small number of sequences. • It will need large computation for comparing large number of sequences • The statistical information of the sequences will be the indicator which can identify the similarity among the sequences.
DNA sequencing • DNA sequence • Sequence of 4 types of -nucleotide A, G, T, C • Complement nucleotide hybridizes each other. A-T G-C AGTCTTATCGATTAG ||||||||||||||| TCAGAATAGCTAATC • DNA sequencing - Genome analysis • Next generation sequencers can read all DNA sequences of a organism or some organisms at once. • Large amount of sequencing data (from some G to T bytes) is produced. • The result of sequencing is obtained as a collection of short fragment of the nucleotides A,G,T and C. • Effective method for identifying the features of the sequences is required.
Conventional DNA analysis • Sequencing • Reconstruction of the sequence • Identification of coding region which codes genes • Identification of the function of genes • It needs large computational costs after sequencing • Our approach aims to extract global features of the DNA sequencing without precise analysis.
Frequency based SOM • SOM which uses the Frequency of N-tuples in DNA sequences as input vector is proposed in T. Abe, T. Ikemura,et.al, Informatics for unreveiling hidden genome signatures, Genome Res., vol.13, p.693-702 • For N-tuples, the dimension of input vector is 4N
SOM based on correlation coefficients of nucleotides. Correlation Coefficients(CC) of DNA sequence ACGCTACTAG A 1000010010 ρAA(n) CC between A and n-shifted A C 0101001000 ρAC(n) CC between A and n-shifted C G 0010000001 : T 0000100100 ρTT(n) CC between T and n-shifted T For all combinations of A,G,T,C and from 1 to n shifts, 4x4xn correlation coefficients are calculated, and used as input vector of SOM. Compared with dimension of n-tuples(4n), dimension of CC is much smaller.
Using these equations, correlation coefficients can be calculated without converting DNA sequences to binary sequences.
Experimental results of SOM based on correlation coefficients • Settings of the experiments • Set 1: genes from amino acid metabolisms of 6 species • Set 2: genes from 7 metabolic pathway of homosapience • The sequences are segmented to 1000 bases.
Experimental results of Set 1(1) • The resolution and topology of these maps are almost compatible. • Map of frequencies of 4-tuples • From 6 species L=256 • Map of CC of 1-4 shifts • from 6 species L=2 • L=64
Experimental results of Set 1(2) • For small dimensions, CC shows better separation.
Experimental results of Set 2 • The genes from metabolic pathways of homosapience can not be clearly clustered.
Experiments of identification of sequences • 70% of the fragments of sequences are used for learning, and remainder are used for test. • The experiments are conducted using SOM and Supervised Parato learning SOM, which is proposed by “Dozono”, to combine the integration of multi-modal vector, the visualization and supervised learning.
Winner and updated units • Conventional SOM • Pareto learning SOM • Overlapped neighbors are updated more strongly. • It play a important role for integration of muti-modal vectors.
Supervised Pareto learning SOM(SP-SOM) • The category vector can be introduced as an independent vector to each input vector for P-SOM. • The category vector attracts the input vectors in same category closely on the map corporately with other input vectors. • The P-SOM learning algorithm becomes supervised. • Category of test vector xt is determined as follows. • where P(xt) is the Pareto optimal set of units for xt
Conclusions(1) • We proposed a preprocessing method for DNA sequences by using correlation coefficients of the occurrence of the nucleotides. • Using this method, the clustering results of the sequences were nearly compatible with those obtained using the frequencies of the N-tuples despite the difference in the length of input vectors.
Conclusions(2) • Pareto learning SOM method is applied to the classification of DNA sequences by using correlation coefficients and frequencies as input vectors. • Pareto learning SOM using CC as the input vector shows good performance for classification compared with that obtained with conventional SOMs, and frequencies.
Feature works • Application of this method to additional types sequence data, such as coding region and non-coding region, and to large data sets such as whole genome. • Improvement of the computational costs of P-SOMs, which are 5 times more than those of conventional SOMs.
Acknowledgements • This work was supported by JSPS KAKENHI Grant Number 24500279.