390 likes | 404 Views
Investigating the efficacy of the Kohonen algorithm for motifs detection in promoter regions to understand transcription expression clustering and promoter structure analysis. Numerical analysis and applications to microarray data on breast cancer progression.
E N D
Developing computational methodologies for genome transcript mapping Daniela Bianchi*°, Raffaele Calogero°, Brunello Tirozzi* *Department of Physics, University “La Sapienza” °Bioinformatics and Genomics Unit, Turin
Structure of the thesis Aim of the thesis • Part 1. • Transcription expression clustering • Part 2 • Promoter structure analysis Understanding the mechanisms underlying transcription expression modulation I have investigated the efficacy of Kohonen algorithm I have addressed the problem of motifs detection in putative promoter regions
WHAT DOES IT MEAN GENE CLUSTERING? • It means to divide the interval, to which the expression levels of genes belong, into an optimal partition. The partition is optimal if the associated global classification error E is minimal
CLASSIFICATION • Let Ia partition of the interval (0,A) in N disjoint intervals. • Let ωi i =1,…N, the “centers” of such intervals. • Then a given gene is classified ωiif x € Ii. • The classification error is |x-ωi|
ω1 ω2ω3ω4ω5ω6 Winner neuron Input layer Training data set KOHONEN NETWORK • It is an artificial neural network with a single layer of neurons. • Each neuron is associated with a weight ωi, the ‘center’ of atom Ii. • The number of neurons is equal to the number of atoms. • When an input pattern x is presented to the network this input is ‘mapped’ to the neuron with the ‘closest’ weigth. The weight changes during the learning process and tends to the values of the distribution of the input data.
LEARNING OF KOHONEN NETWORK • A data point x is presented to the network and all the differences • |ωi (0)- x| • are computed • The winner neuron is chosen, in such way it is the neuron v with minimal difference • |ωv (0)- x| • The weight of this neuron is changed, or in some case the weigths of neighboring neurons are changed. • This procedure is repeated with another input • At the end of process the set of input data is partitioned in disjoint sets (the clusters) and the weights associated with neurons are the values of the centers of partition ‘s groups (the fixed values which the weights converge).
UPDATE RULE • Each weight vector is updated by : ωi(n+1) = ωi(n) + η(n) Γ (i,v) ( ξ(n+1) - ωi(n) ) where: 0≤η(n)< 1 and η(n)≤η(n+1) η(n) is called the learning parameter and it is basic for the algorithm convergence Γ (i,v) is called the neighborhood function of the winner and determines the width of the activation area near the winner
Neuron v NEIGHBORHOOD FUNCTION (1/2) • A convenient choice is :
Neuron v NEIGHBORHOOD FUNCTION (2/2) Another choice is :
THE ORDER The Kohonen network has the property of order • Remembering that for order configuration in one dimension means : If Kohonen learning algorithm applied to one dimensional configuration of weights converges, the configuration orders itself at a certain step of the process. The same order is preserved at each subsequent step of the algorithm.
THE PARAMETER η(n) The convergence of Kohonen algorithm strongly depends on the rate of decay of the learning parameter η(n) I have seen numerically that the there is no convergence ( in mean o not almost everywhere) I have seen numerically that the there is convergence in mean (but not almost everywhere) I have seen numerically that the there is convergence almost everywhere
Λ (i,v) : the optimal choice of s depends on the number of weights we fix and the number of iterations. h(i,v,n) : the optimal choices are: Parameter of neighborhood function • The convergence of the Kohonen algorithm depends also on the values of parameters concerning the neighborhood function
Numerical analysis (1/3) • I run the algorithm 1000 times • I used sets of uniformly distributed data in (0,1) containing: 4000, 10000,20000,30000,60000,120000,150000,250000 elements • The procedure was done for all the mentioned η(n) and both Λ and h. I had 1000 cases of weights limit values at the end of the algorithm running Results: the mean value of these cases converged to the center of the optimal partition of the interval (0,1) for the previous mentioned η(n) and both Λ and h.
Numerical analysis (3/4) The average error of limit weights with respect to the exact values of the centers decreases on increasing the number of iterations; using Λ the error decreases more quickly. The almost everywhere convergence of the algorithm is obtained
APPLICATIONS • I applied Kohonen network to microarrays data generated using a breast cancer mouse model. • Data were derived by the paper Quaglino et al. JCI 2004. • The authors studied the effect of HER2 DNA vaccination to halt breast cancer progression in BALB-neuT mice. A small set of genes (34) only associated to the transcriptional profile of the vaccinated mice was identified by hierarchical agglomerative clustering.
RESULTS Using this approach I identified 34 genes described in the paper and I also managed to identify a subset of other vaccination specific genes (25) that could not be discovered using the clustering approach described in paper of Quaglino.
Conclusion (FIRST PART) • Kohonen network in one dimension converges almost everywhere for appropriate learning parameters and that makes it powerful and more adaptable • It has the drawback of the choice of number of clusters • The Kohonen algorithm in more than one dimension works well but there exists only a sufficient proof of the convergence
Promoter structure analysis It is basic to define the most likely transcription factor binding locations on the promoter
Binding site models • Matrix representation • Position weight matrix • Energy matrix • String-based representation • Consensus sequence
A| 6 6 6 2 70 12 15 13 4 6 49 2 2 C| 2 6 14 72 3 26 19 33 6 7 11 68 72 G| 72 68 4 4 3 29 30 22 7 66 16 5 2 T| 2 2 58 4 6 15 18 14 65 3 6 7 6 Position weight matrix (1/2) Every element of matrix is the number of times each nucleotide is found at every position of an alignment
Position weigth matrix (2/2) • From this matrix: • Position specific frequency matrix (PSFM) • Log-odds matrix • Information content
Searching binding sites • De novo method • Novel motifs which are enriched are found • Scanning method • Using a given motif model a genome sequence is scanned to find more motif matches
Gives the probability of finding nucleotide b at position l Motif detection To find instances of a given motif I have used a PSFM and a higher order background model. A higher background model means that a DNA sequence can be generated with a Markov model of order m Observation: this type of score is log likelihood ratio of observing the data given a motif model versus a model of DNA background. Therefore a high score at a specific position suggests high degree of the presence of motif in that particular location of DNA sequence I have computed score as Where:
Statistical significance of scores (1/2) • Is the score unlikely to have arisen by chance? To answer it is necessary to know the p-value of the score x P-value of the score x: (score of the match of the motif with a given DNA sequence) If the p-value is very low the motif is significantly represented in the DNA sequence
Statistical signficance of scores (2/2) P-value of the score of the motif match with a random sequence(identically independent model, iid; all the position in the sequence have the same distributions and are independent of each other) It is a good approach to reduce the number of false positive matches of the motif It does not give the right information on the over representation of the motif in a given DNA sequence
Extreme theory value • Extreme value theory (EVT) provides a framework to formalize the study of behaviour in the tails of a distribution. EVT allows us to use extreme observations to measure the density in the tail. Peak over threshold (POT) Analysis of large observations which exceed a high threshold The problem is to estimate the distribution function Fu Fu(y)=P(X-u≤y|X>u)=P(Y≤y|X>u), 0≤y≤xF-u y=x-u, xF : right endpoint
Procedure: • I have screened the set of DNA sequences against a set of PSFMs assigning a score to every DNA segment of length of motif to be detected. • I have applied the POT model to compute p-value of a given score both considering DNA sequences and random sequences. • I have tested the pvalues. If: the pvalue computed using the distribution of scores for random sequence is lower than the one computed using the distribution of scores for DNA sequence this latter pvalue is lower than a predefined threshold Then: the motif is detected in the position where I have computed the associated score. Applications • I have applied the mentioned studies to a set of promoter regions of human DNA (2836 sequences) containg at least one regulatory estrogen responsive element (ERE).
Explanation: The distribution of the sum of N iid variables tends to gaussian distribution with an error of If N= 10 the distribution of the sum is close to gaussian one, but it is not a gaussian. Preliminary analysis (1/2) • Making histograms, box-plots,tests for normality… The data set is not normally distributed but its distribution tends towards the gaussian distribution
Preliminary analysis (2/2) • Analysis on random sequences( all positions have the same letter distribution and are independet of each other) The distribution of the sum of iid variables over [0,a] is: I have checked: • the normality • type of extreme value distribution • the Von Mises condition
NDGR2 • CNTFR • DVL1 Results Focusing the attention on detection one ERE at once: • the distribution of maximum of scores using random sequences is almost like a Weibull distribution (only 2/100 are not like a Weibull) • the distribution of maximum of scores using DNA sequences is mainly like a Weibull. There are 413/2836 score sequences which are like a Frechet and 13 which are Gumbel. I show the result of : These genes have been proved to contain at least one ERE by ChIP
Results • Detection of two EREs I have applied the same procedure mentioned previously considering also the distance between the two EREs Only for some particular distances I have detected a couple of EREs The position for one of the two EREs of the couple is the same position found biologically. A relatevely high homology is conserved for the mouse ortholog in the 3 genes Only for the DVL1 gene the high homology is conserved for the three orthologs ( mouse, rat, dog)
Conclusion (1/2) • I have improved the detection of motifs: • Using a higher order of background • Computing scores for two or three motifs • Using POT on both DNA and random sequences • Testing the pvalue The model fits well: it is able to detect those gene validated by ChIP and which are considered in literature containing at least one ERE.
Conclusion (2/2) • Problems: • Analyzing more than three motifs the model creates files with critic dimension. Future work: • Detecting different motifs • Implementing new biologically results in the model • Implementing a single routine in R
The End Thanks to: Prof. B.Tirozzi Prof. R. Calogero The laboratory of bioinformatic in Orbassano