Hiroshi Dozono Saga University

Visualization and Classification of DNA sequencesusing Pareto learning Self Organizing Maps based on Frequency and Correlation Coefficient Hiroshi Dozono Saga University

Introduction (1) • The first step of Genome analysis is DNA sequencing which identifies the sequence of nucleotides on DNA sequences. • About 10 years ago, DNA sequencing requires large costs and long time. • Recently, Next Generation Sequencing(NGS) can read the sequences very rapidly in low cost. • $100〜$1000 in 1 hour. • NGS produces large amounts of sequence data at once. • Gbytes 〜Tbytes

Introduction(2) • After reading the sequences, further analyses are conducted. • Identify the organisms • Identify the functions of genome • Remap the sequences on reference sequences • Comparison of the genomes among organisms • For the comparison of genomes, it will need large amount of computation to compare the sequences precisely. • The sequence alignment method is generally used. • The sequence alignment is effective for pairwise comparison or comparing small number of sequences. • It will need large computation for comparing large number of sequences • The statistical information of the sequences will be the indicator which can identify the similarity among the sequences.

DNA sequencing • DNA sequence • Sequence of 4 types of -nucleotide A, G, T, C • Complement nucleotide hybridizes each other. A-T G-C AGTCTTATCGATTAG ||||||||||||||| TCAGAATAGCTAATC • DNA sequencing - Genome analysis • Next generation sequencers can read all DNA sequences of a organism or some organisms at once. • Large amount of sequencing data (from some G to T bytes) is produced. • The result of sequencing is obtained as a collection of short fragment of the nucleotides A,G,T and C. • Effective method for identifying the features of the sequences is required.

Conventional DNA analysis • Sequencing • Reconstruction of the sequence • Identification of coding region which codes genes • Identification of the function of genes • It needs large computational costs after sequencing • Our approach aims to extract global features of the DNA sequencing without precise analysis.

Frequency based SOM • SOM which uses the Frequency of N-tuples in DNA sequences as input vector is proposed in T. Abe, T. Ikemura,et.al, Informatics for unreveiling hidden genome signatures, Genome Res., vol.13, p.693-702 • For N-tuples, the dimension of input vector is 4N

SOM based on correlation coefficients of nucleotides. Correlation Coefficients(CC) of DNA sequence ACGCTACTAG A 1000010010 ρAA(n) CC between A and n-shifted A C 0101001000 ρAC(n) CC between A and n-shifted C G 0010000001 : T 0000100100 ρTT(n) CC between T and n-shifted T For all combinations of A,G,T,C and from 1 to n shifts, 4x4xn correlation coefficients are calculated, and used as input vector of SOM. Compared with dimension of n-tuples(4n), dimension of CC is much smaller.

Using these equations, correlation coefficients can be calculated without converting DNA sequences to binary sequences.

Experimental results of SOM based on correlation coefficients • Settings of the experiments • Set 1: genes from amino acid metabolisms of 6 species • Set 2: genes from 7 metabolic pathway of homosapience • The sequences are segmented to 1000 bases.

Experimental results of Set 1(1) • The resolution and topology of these maps are almost compatible. • Map of frequencies of 4-tuples • From 6 species L=256 • Map of CC of 1-4 shifts • from 6 species L=2 • L=64

Experimental results of Set 1(2) • For small dimensions, CC shows better separation.

Experimental results of Set 2 • The genes from metabolic pathways of homosapience can not be clearly clustered.

Experimental results of virus genome

Experiments of identification of sequences • 70% of the fragments of sequences are used for learning, and remainder are used for test. • The experiments are conducted using SOM and Supervised Parato learning SOM, which is proposed by “Dozono”, to combine the integration of multi-modal vector, the visualization and supervised learning.

Winner and updated units • Conventional SOM • Pareto learning SOM • Overlapped neighbors are updated more strongly. • It play a important role for integration of muti-modal vectors.

Supervised Pareto learning SOM(SP-SOM) • The category vector can be introduced as an independent vector to each input vector for P-SOM. • The category vector attracts the input vectors in same category closely on the map corporately with other input vectors. • The P-SOM learning algorithm becomes supervised. • Category of test vector xt is determined as follows. • where P(xt) is the Pareto optimal set of units for xt

Mapping results using Supervised Pareto-learning SOM

Experimental results of identification

Conclusions(1) • We proposed a preprocessing method for DNA sequences by using correlation coefficients of the occurrence of the nucleotides. • Using this method, the clustering results of the sequences were nearly compatible with those obtained using the frequencies of the N-tuples despite the difference in the length of input vectors.

Conclusions(2) • Pareto learning SOM method is applied to the classification of DNA sequences by using correlation coefficients and frequencies as input vectors. • Pareto learning SOM using CC as the input vector shows good performance for classification compared with that obtained with conventional SOMs, and frequencies.

Feature works • Application of this method to additional types sequence data, such as coding region and non-coding region, and to large data sets such as whole genome. • Improvement of the computational costs of P-SOMs, which are 5 times more than those of conventional SOMs.

Acknowledgements • This work was supported by JSPS KAKENHI Grant Number 24500279.

Hiroshi Dozono Saga University

Hiroshi Dozono Saga University

Presentation Transcript

SAGA

Hiroshi Koga Kansai University Faculty of Informatics

Stellar water fountains Hiroshi Imai (Kagoshima University)

Saga Crepúsculo

Saga

Packet saga

Hiroshi Ishii

KIYOTAKE HIROSHI

The saga of SAGA

Hiroshi Tanaka Department of Physics Sophia University, Tokyo, JAPAN

Mini Saga

Hiroshi Sakamoto ICEPP, University of Tokyo

Hiroshi Watanabe

Ryosuke Saga and Hiroshi Tsuji Osaka Prefecture University ---- Dongmin Shin IDS., SNU 2008.07.24.

Hiroshi Tanaka Department of Physics Sophia University, Tokyo, JAPAN

Hiroshi Tsurusaki