CISD 794 – Knowledge Discovery in Databases Prof. Junping Sun

CISD 794 – Knowledge Discovery in DatabasesProf. Junping Sun DENCLUE - Clustering DNA Sequences using FFT (Fast Fourier Transform ) Gilberto R dos Santos

Clustering • Cluster analysis is an unsupervised learning method that constitutes a cornerstone of an intelligent data analysis process. • Some DNA sequences may reach several millions of base pairs. Any search method tried to have a direct hit comparing two DNA sequences may have an algorithm that grows O[n*n], being n the number of bases in the DNA sequence. Several methods are being combined to optimize sequence search and matching. One initial approach is to build clusters of DNA sequences.

Functions in Time or Frequency • A physical process can be described either in the time domain, by the values of some quantity h as a function of time t, e.g., h(t), or else in the frequency domain, where the process is specified by giving its amplitude H (generally a complex number indicating phase also) as function of frequency f, that is H(f), with - ∞ < f < + ∞ • h(t) and H(f) are two different representations of the same function. • H(f) = h(t) e ** ( 2* π * i * f * t ) dt • h(t) = H(f) e ** ( -2 * π * i * f * t ) dt

FFT –Fast Fourier Transform Fourier Equation Where An and Bn are

DNA –Sequence – PDB: 1b05

DNA –Sequence – PDB: 1b25

DNA and FFT (DFT) • DFT /FFT (Discrete Fourier Transformation) can be used to define a wave that is the closest representation of a DNA sequence function h(t). DFT transformation technique can be used to reduce the search time of range queries. This method can be used in conjunction with any other existing method, or can be used as a filtration technique. The challenge of this research is to define how DFT (FFT) can represent a DNA sequence. The proximity search seeks the sequences close enough to a given query sequence either through direct alignment or using other heuristics. The alignment of biological sequences ( pairwise or multiple alignment) is the operation to place nucleotide or amino acid residues in columns inferring the closest common ancestral relationships. The best alignment usually refers to the one demonstrating the most likely evolutionary scenario.

Wave Composition

Data Structures Used • Data matrix • (two modes) • Dissimilarity matrix • (one mode)

Euclidean Distance • The Euclidean Distance for the matrix d(i,j) is • Dx (i,j) = sqrt ( (Xi1 – Xj1 ) ** 2 + (Xi2 – Xj2 ) ** 2 + .. + ( Xip – Xjp )** 2 ) • Dy (i,j) = sqrt ( (Yi1 – Yj1 ) ** 2 + (Yi2 – Yj2 ) ** 2 + .. + ( Yip – Yjp )** 2 ) • Dz (i,j) = sqrt ( (Zi1 – Zj1 ) ** 2 + (Zi2 – Zj2 ) ** 2 + .. + ( Zip – Zjp )** 2 )

Euclidean Distance cursor c1 is select * from DNA_SEQUENCE_FREQ_XY where frequency > 0 for update order by 1; cursor c2 is select * from DNA_SEQUENCE_FREQ_YZ where frequency > 0 for update order by 1; cursor c3 is select * from DNA_SEQUENCE_FREQ_ZX where frequency > 0 for update order by 1; d_v_dist_a number:=0; d_v_dist_b number:=0; v_dist_a number:=0; v_dist_b number:=0; r2 dna_variance_coef_xy%rowtype; rec3 dna_variance_coef_yz%rowtype; rec4 dna_variance_coef_zx%rowtype; begin for r1 in c1 loop select * into r2 from dna_variance_coef_xy where frequency = r1.frequency ; begin select /*+ INDEX ( DNA_SEQUENCE_FREQ_XY, DNA_SEQUENCE_FREQ_XY_IDX1 ) */ sum( ( r1.ACOEF - ACOEF ) * exp ( -1 * ( power ( (r1.ACOEF - ACOEF) ,2) / r2.var_acoef )) ), sum( ( r1.BCOEF - BCOEF ) * exp ( -1 * ( power ( (r1.BCOEF - BCOEF) ,2) / r2.var_bcoef )) ) into v_dist_a,v_dist_b from DNA_SEQUENCE_FREQ_XY where frequency = r1.frequency and pdb_id != r1.pdb_id; update DNA_SEQUENCE_FREQ_XY set gradient_ACOEF = round(v_dist_a), gradient_BCOEF = round(v_dist_b) where current of c1; exception when others then null; end; end loop; commit;

Euclidean Distance - sine coefficient – Function [X]  Y

Euclidean Distance - cosine coefficient – Function (X)  Y

DENCLUE – Clustering Based on Density Distribution Functions The influence of each data point can be formally modeled using a mathematical function (influence function), that describes the impact of a data point within its neighborhood. The overall density of the data space can be modeled analytically as the sum of the influence functions of all data points Clusters can then be determined mathematically by identifying density atractors, where density atractors are local maxima of the overall density functions. d(X,Y) should be reflexive and symetric  Euclidean distance Function. Density Attractor/Density-Attracted Points -local maximum of the density function -density-attracted points are determined by a gradient-based hill-climbing method Center-Defined Cluster A center-defined cluster with density-attractor x* ( ) is the subset of the database which is density-attracted by x*.

DNA Sequence – 1b5s ALA Residue

Denclue: Technical Essence • Uses grid cells but only keeps information about grid cells that do actually contain data points and manages these cells in a tree-based access structure. • Influence function: describes the impact of a data point within its neighborhood. • Overall density of the data space can be calculated as the sum of the influence function of all data points. • Clusters can be determined mathematically by identifying density attractors. • Density attractors are local maximal of the overall density function.

Gradient: The steepness of a slope • Example

Density Attractor Function A point x* Î is called a density-attractor for a given influence function, iff x* a local maximum of the density-function A point x Î is density-attracted to a density attractor x*, iff $k Î N : d ( , x*) <= x with

Center Defined Clusters A center-defined cluster (wrt to s , x ) for a density attractor x* is a subset C Í D, with x Î C being density-attracted by x* and (x*) >= x. Points x Î D are called outliers if they are density-attracted by a local maximum with ( ) < x

DENCLUE AlgorithmLocal Density Function Two cubes c1, c2 Î Cp are connected if d(mean(c1); mean(c2)) <= 4s near(x) = { x1 Î c1 | d(mean(c1),x) < ks (note: k=4)

Density Attractors

DENCLUE Algorithm (cont.) After determining the density-attractor x* for a point x and the point x is classified and attached to the cluster belonging to x*.

Gaussian Distance cursor c1 is select * from DNA_SEQUENCE_FREQ_XY where frequency > 0 for update order by 1; cursor c2 is select * from DNA_SEQUENCE_FREQ_YZ where frequency > 0 for update order by 1; cursor c3 is select * from DNA_SEQUENCE_FREQ_ZX where frequency > 0 for update order by 1; d_v_dist_a number:=0; d_v_dist_b number:=0; v_dist_a number:=0; v_dist_b number:=0; r2 dna_variance_coef_xy%rowtype; rec3 dna_variance_coef_yz%rowtype; rec4 dna_variance_coef_zx%rowtype; begin for r1 in c1 loop select * into r2 from dna_variance_coef_xy where frequency = r1.frequency ; begin select /*+ INDEX ( DNA_SEQUENCE_FREQ_XY, DNA_SEQUENCE_FREQ_XY_IDX1 ) */ sum(exp ( -1 * ( power ( (r1.ACOEF - ACOEF) ,2) / r2.var_acoef )) ), sum( exp ( -1 * ( power ( (r1.BCOEF - BCOEF) ,2) / r2.var_bcoef )) ) into v_dist_a,v_dist_b from DNA_SEQUENCE_FREQ_XY where frequency = r1.frequency and pdb_id != r1.pdb_id; update DNA_SEQUENCE_FREQ_XY set GAUSSIAN_ACOEF = round(v_dist_a), GAUSSIAN_BCOEF = round(v_dist_b) where current of c1; exception when others then null; end; end loop; commit; end; /

Gaussian Distance - sine coefficient  xy  ALA Residue

Gaussian Distance - cosine coefficient  xy  ALA Residue

Gradient Distance cursor c1 is select * from DNA_SEQUENCE_FREQ_XY where frequency > 0 for update order by 1; cursor c2 is select * from DNA_SEQUENCE_FREQ_YZ where frequency > 0 for update order by 1; cursor c3 is select * from DNA_SEQUENCE_FREQ_ZX where frequency > 0 for update order by 1; d_v_dist_a number:=0; d_v_dist_b number:=0; v_dist_a number:=0; v_dist_b number:=0; r2 dna_variance_coef_xy%rowtype; rec3 dna_variance_coef_yz%rowtype; rec4 dna_variance_coef_zx%rowtype; begin for r1 in c1 loop select * into r2 from dna_variance_coef_xy where frequency = r1.frequency ; begin select /*+ INDEX ( DNA_SEQUENCE_FREQ_XY, DNA_SEQUENCE_FREQ_XY_IDX1 ) */ sum( ( r1.ACOEF - ACOEF ) * exp ( -1 * ( power ( (r1.ACOEF - ACOEF) ,2) / r2.var_acoef )) ), sum( ( r1.BCOEF - BCOEF ) * exp ( -1 * ( power ( (r1.BCOEF - BCOEF) ,2)/ r2.var_bcoef )) ) into v_dist_a,v_dist_b from DNA_SEQUENCE_FREQ_XY where frequency = r1.frequency and pdb_id != r1.pdb_id; update DNA_SEQUENCE_FREQ_XY set gradient_ACOEF = round(v_dist_a), gradient_BCOEF = round(v_dist_b) where current of c1; exception when others then null; end; end loop; commit;

Gradient Gaussian Distance Cos  xy  ALA Residue

References • Flexible Pattern Matching in Strings: Practical On-line Search Algorithms for Texts and Biological Sequences Gonzalo Navarro and Mathieu Raffinot • Numerical Recipes in C – William H. Press / Saul A. Teukolsky / William T. Vetterling / Brian P. Flannery - http://library.lanl.gov/numerical • Introduction to Algorithms – Thomas H. Cormen / Charles E. Leiserson / Ronald L. Rivest • Introduction to Bioinformatics - Arthur M. Lesk • Handbook of Algorithms and Data Structures, G.H. Gonnet & R. Baeza-Yates. 6. Data Mining: Opportunities and Challenges - John Wang • Who is Fourier? A Mathematical Adventure –Alan Gleason 8. Mathematical Handbook for Scientists and Engineers – Granino A. Korn and Theresa M. Korn 9. An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg, Daniel A. Keim • Data Mining: Concepts and Techniques Jiawei Han and Micheline Kamber

CISD 794 – Knowledge Discovery in Databases Prof. Junping Sun