450 likes | 598 Views
Case Study Presentation Systematic Evaluation of Scaling Methods for Gene Expression Data Nachiket Gokhale, Graduate Student-Computer Science, University of Minnesota. Outline. Background Introduction Problems with Gene Expression Datasets Types of Datasets Column Scaling Methods
E N D
Case Study Presentation Systematic Evaluation of Scaling Methods for Gene Expression Data Nachiket Gokhale, Graduate Student-Computer Science, University of Minnesota
Outline • Background Introduction • Problems with Gene Expression Datasets • Types of Datasets • Column Scaling Methods • Row Transformation Methods • Evaluation Techniques • Non-Temporal Data Analysis • Temporal Data Analysis • Conclusion • References
Outline • Background Introduction • Problems with Gene Expression Datasets • Types of Datasets • Column Scaling Methods • Row Transformation Methods • Evaluation Techniques • Non-Temporal Data Analysis • Temporal Data Analysis • Conclusion • References
Background Introduction M conditions • Data points produced by a DNA micro-array experiment represents the ratio of expression levels of a particular gene under different experimental conditions. Thus the result of experiment on n genes is a series of n expression-level ratios. • Typically, the numerator of each ratio is the expression level of the gene under varying condition of interest, whereas the denominator is the expression level of the gene under some reference condition. N genes
Background Introduction genes diverse conditions • The data from a series of m such experiments is represented as a gene expression matrix, in which each of the n rows consists of an m-element expression vector for a single gene. Further the data can be collected at different times during the experiment. • Gene Expression datasets are generated using numerous micro-array technologies with numerous experiments. • Lastly, the expression-level matrices are pre-processed to account for the variations in the micro-array technologies and to try to keep the data sets homogenous. sets of specific conditions
Outline • Background Introduction • Problems with Gene Expression Datasets • Types of Datasets • Column Scaling Methods • Row Transformation Methods • Evaluation Techniques • Non-Temporal Data Analysis • Temporal Data Analysis • Conclusion • References
Problems with Gene Expression Datasets • The gene expression data obtained by different microarray experiments have considerable amount of systematic variation both within arrays and between arrays used during the experiments. • The normalization of these effects or the variations is extremely important in order to get an accurate comparison of genes’ relative expression profiles within and across conditions • Different types of sources of variations might include following: • Spatial location on the array: The position of DNA samples on the array chips (change in concentration/amount) causes errors. • Dye biases which vary with spot intensity: scanner settings, saturation effects, background fluorescence, linearity of detection response, and ambient conditions. • Printing/spotting quality: tilt in arrays, variation in intensity of measurements due to that. • Experimenter: manual errors.
Problems with Gene Expression Datasets • Even after preprocessing of the data sets before merging, dissimilarities between scales of measurement in different conditions introduces inconsistencies in the gene expression data. • This prominently happens because of accumulation of the gene expression data generated at different laboratories into a single data set. • As a result, different scaling and transformation techniques have been introduced to account for these scaling problems inherent in the experimentally produced gene expression data sets. • The aim of the paper was systematically evaluating these different scaling and transformation methodologies and give a comprehensive comparison between them for different gene expression datasets.
Outline • Background Introduction • Problems with Gene Expression Datasets • Types of Datasets • Column Scaling Methods • Row Transformation Methods • Evaluation Techniques • Non-Temporal Data Analysis • Temporal Data Analysis • Conclusion • References
Types of Datasets • Temporal Data: The experiments in this data sets indicate the expression behavior of the genes that are exposed to a specific condition at different instances of time. This gives a well-defined relationship between the successive columns of the expression matrix for the genes. • Non-temporal Data: These data sets are prepared based on the expression behavior of the genes that are exposed to a specific set of distinct conditions at the same instance of time. So there is no temporal relationships between the columns. • As a result of the differences between the two data sets based on temporal relationship, different scaling and transformation techniques are applied specific to the kind of datasets. Non-temporal Gene Expression Datasets Temporal Gene Expression Datasets
Outline • Background Introduction • Problems with Gene Expression Datasets • Types of Datasets • Column Scaling Methods • Row Transformation Methods • Evaluation Techniques • Non-Temporal Data Analysis • Temporal Data Analysis • Conclusion • References
Column Scaling Techniques • 1. Unitnorm Scaling: • A common way of getting the datasets to same scale is by dividing the data by its 2-norm i.e by converting the dataset into unit vectors. • Application: Text Mining
Column Scaling Techniques • 2. Z-score Scaling: • Another method of making different data values comparable is by shifting the values in a vector (column) by the mean of their values, and dividing them by the standard deviation of the vector. • Applications: Evaluation of protein structure alignment, scaling of gene expression data.
Column Scaling Techniques contd… • 3. Quantile Normalization: • This is a statistical technique which attempts to transform data from two different distributions to a common distribution by making the quantiles of the distributions equal. • The “quantilenorm” function in the MATLAB bioinformatics toolbox, which implements Bolstad (2001)’s formulation of this algorithm • Application: Scaling method for gene expression data.
Column Scaling Techniques contd… • In order to come up with an effective scaling technique, two factors namely the underlying distribution and the presence of outliers should be take into consideration. • 4. Sigmoid Scaling Function: • This method makes sure that the extreme or outlying values would not distort the data analysis significantly. The function used is: • The extreme input values (outliers) are bounded by +1 and -1. However some drawbacks of this function are as follows: • Final scaled value is not determined by considering the background distribution of x. • An input value of zero (representing neural gene expression) may be distorted by noise to a small non-zero result. Such values should be effectively considered as zeros in the gene expression matrix which is not done by Sigmoid function.
Column Scaling Techniques contd… • 5. Dsigmoid Scaling Function: • Sigmoid function is modified and broken down into two ranges namely [-1,0) and (0,1] to give the following Dsigmoid formulation: • The distribution of the above formulation resembles the normal distribution because of which the scaling of gene expression data turns out to be beneficial. The variables d and s account for the centering and steepness factors introduced by the Dsigmoid function. • Thus each value x along the vector is transformed using this Dsigmoid function to a new value. • The Sigmoid and the Dsigmoid column scaling methods were introduced in this paper for transforming gene expression matrices.
Column Scaling Techniques contd… • The nature of scaling given by Sigmoid and Dsigmoid is similar to the popularly known S-curve as shown below in the figure. • The Unitnorm, Z-score and Quantilenorm scaling methods are resistent to scaling variations prominently introduced due to data from different laboratories. • Sigmoid and Dsigmoid are better for elimination of noise/outliers from the datasets.
Outline • Background Introduction • Problems with Gene Expression Datasets • Types of Datasets • Column Scaling Methods • Row Transformation Methods • Evaluation Techniques • Non-Temporal Data Analysis • Temporal Data Analysis • Conclusion • References
Row Transformation Techniques • The explicit relationships in the temporal datasets need to be factored into scaling process which generates a time series that might be better suited a different applications. • 1. Smoothing by Moving Average: • A method of analyzing time series data is by smoothing the values in a sliding window of duration k by averaging them, known as the moving average (MA) method. • Application: Analyzing circardian gene expression data.
Row Transformation Techniques • 2. Differential Scaling: • This method transforms the original time series vector into a new vector using the difference formula and takes only the trend of change between the time points into account, and not the absolute values. • This helps to reduce the effect of offsets within the data and errors. • Application: Functional classification of gene expression data. • 3. Z-score Transformation: • Another good way of comparing time series is by considering just the deviations from the average. This is similar to the Z-score function explained in column scaling methods. • Application: Analyzing circardian gene expression data.
Row Transformation Techniquescontd • The transformation methods discussed above are applied separately to each time series, and the final transformed expression profile of each gene is obtained by concatenating the individual transformed time series. • For temporal dataset, the gene expression matrix is scaled in this study by transforming its rows, i.e., the expression profiles of individual genes, by using each of the above methods. • Following this the column scaling techniques are applied on the transformed times series. This helps to exploit the temporal relationships between the data sets followed by the column scaling techniques which further normalizes the datasets for better results in clustering.
Outline • Background Introduction • Problems with Gene Expression Datasets • Types of Datasets • Column Scaling Methods • Row Transformation Methods • Evaluation Techniques • Non-Temporal Data Analysis • Temporal Data Analysis • Conclusion • References
Evaluation Techniques • Lets suppose that a scaling method A when applied to an expression matrix M produces matrix MA. Then the pairwise links between genes ranked by the correlation of their expression profiles in MAis examined to see if the most highly ranked links tend to connect genes with similar function. • This evaluation process is applied to each of the scaled expression data matrices MA and the results are compiled for comparisons. • 1. Observed Functional Relationships: • Pairwise correlations are calculated among all expression profiles in the given data set, and the corresponding gene pairs are sorted in descending order according to their corresponding expression correlation. • Starting from the most similar gene pair, the total number of pairs which are known to be functionally related according to the above set, are cumulatively added. • A plot of the number of true functional relationships recovered versus the number of gene pairs analyzed in the order of decreasing similarity can then be used for performance evaluation.
Evaluation Techniques contd • Four types of interactions, namely (i) physical protein-protein interactions, (ii) metabolic pathway co-membership, (iii) regulation by the same promoter, and (iv) co-membership in sequence homology clusters were considered. Pairwise Correlation between expression profiles Sorting in Descending Order Cumulative Addition of the Pairs in order of higher Similarity Plotting :True Functional Relationships Recovered v/s # of gene pairs analyzed
Evaluation Techniques contd • 2. Similarity of Functional Labels: • 81 classes were used for annotation of genes to construct a vector per gene, which was then partitioned using CLUTO clustering toolkit. • Each cluster is treated as a clique and two genes are considered functionally related if they are part of the same clique. This set of relationships is then subjected to the same evaluation methodology identical to that adopted in the previous technique. • Another metric namely, SwissProt keyword recovery was used for comparison of the relationship matrices. Construction of 1 vector/gene + CLUTO Clustering Applying Observed Functional Relationships Technique
Outline • Background Introduction • Problems with Gene Expression Datasets • Types of Datasets • Column Scaling Methods • Row Transformation Methods • Evaluation Techniques • Non-Temporal Data Analysis • Temporal Data Analysis • Conclusion • References
Non-temporal Data Analysis • Unitnorm, Znorm, Sigmoid, Dsigmoid and Quantile column scaling methods were applied on four different non-temporal gene expression datasets followed by the two evaluation techniques of ObservedFuncRels and SimFuncLabels. Evaluation of Gerber Dataset
Non-temporal Data Analysis contd Evaluation of Hughes Dataset Evaluation of Iyer Dataset
Non-temporal Data Analysis contd Evaluation of SaldanhaDataset • It is observed that the Quantile, Znorm, Unitnorm, Dsigmoid and Sigmoid scaling methods produce functionally richer matrices than the raw data set. • Dsigmoid performs better than all other scaling methods except for this Saldanha data set where Unitnorm shows better performance.
Outline • Background Introduction • Problems with Gene Expression Datasets • Types of Datasets • Column Scaling Methods • Row Transformation Methods • Evaluation Techniques • Non-Temporal Data Analysis • Temporal Data Analysis • Conclusion • References
Temporal Data Analysis • Temporal datasets were subjected firstly to row transformations like MA, Ztrans and Diff which was subsequently followed by column scaling techniques (Unitnorm, Znorm, Quantile, Sigmoid and Dsigmoid). • Results are as follows: Evaluation of Zhu Dataset
Temporal Data Analysis contd Evaluation of Shapira Dataset • There exist some transformation and scaling combination which gives better performance than raw dataset. Row Transformation before column scaling provides significantly better results than just application of scaling techniques. • The Ztrans pattern of transformation produces the best results among all the row transformation methods under consideration.
Outline • Background Introduction • Problems with Gene Expression Datasets • Types of Datasets • Column Scaling Methods • Row Transformation Methods • Evaluation Techniques • Non-Temporal Data Analysis • Temporal Data Analysis • Conclusion • References
Conclusion • An evaluation of several commonly used scaling and transformation methods for gene expression data, such as z-score scaling, quantile normalization, diff transformation, and two new scaling methods, sigmoid and Dsigmoid (double sigmoid) was extensively explored. • It was evident that the performance of these methods vary significantly across different data sets, but Dsigmoid scaling and z-score transformation generally perform well respectively for the two types of gene expression data, namely temporal and non-temporal. • The properties of the scaling and row transformation methods govern their performance and exploring the reasons of why the methods behave the way they behave would be something in the pipeline.
Outline • Background Introduction • Problems with Gene Expression Datasets • Types of Datasets • Column Scaling Methods • Row Transformation Methods • Evaluation Techniques • Non-Temporal Data Analysis • Temporal Data Analysis • Conclusion • References
References G. Pandey, L. N. Ramakrishnan, M. Steinbach, and V. Kumar. Systematic evaluation of scaling methods for gene expression data. Technical Report 07-015, CS Deptt, Univ of Minnesota, 2007 http://compbio.soe.ucsc.edu/genex/expressdata.html ; wikipedia; google. B. M. Bolstad. Probe level quantile normalization of high density oligonucleotide array data. Unpublished. Available at http://bmbolstad.com/stuff/qnorm.pdf, 2001. http://www-users.cselabs.umn.edu/classes/Spring-2009/csci5461/index.php?page=schedule M. Izumo, T. R. Sato, M. Straume, and C. H. Johnson. Quantitative analyses of circadian gene expression in mammalian cell cultures. PLoS Comp Biol, 2(10):e136, 2006. http://www-users.cselabs.umn.edu/classes/Spring-2011/csci8980-mbds/index.php?page=slides G. Karypis. CLUTO - a clustering toolkit. Technical Report 02-017, CS Deptt, Univ of Minnesota, 2002. http://www.dallasfed.org/data/basics/moving.html
Questions ? Thank You!