650 likes | 805 Views
COMPUTATIONAL GENOME ANALYSIS PROJECT: Microarray data analysis. Müge Erdoğmuş Zeynep Işık. DISEASE: EMPHYSEMA. Emphysema is a lung disease that is included in a group of diseases that are called chronic obstructive pulmonary disease.
E N D
COMPUTATIONAL GENOME ANALYSIS PROJECT: Microarray data analysis Müge Erdoğmuş Zeynep Işık
DISEASE: EMPHYSEMA • Emphysema is a lung disease that is included in a group of diseases that are called chronic obstructive pulmonary disease. • The ability of the lungs to expel air is diminished for the patients with emphysema. • Lungs loose their elasticity thus, they become less contractile. • In emphysema, the lung tissues which are responsible for supporting the physical shape and function of the lungs are damaged.
EMPHYSEMA • The lung tissue around the smaller airways ; bronchioles and the alveolitargets for the destructions. • Normally the lungs are very elastic and spongy but not in emphsema!
Causes of Emphysema • Alpha-1-antitrypsin deficiency • Cigarette smoking 1)It damages the lung tissue in various ways The cells in the airway which are responsible for the clearance of the mucus and other secretions are influenced by cigarette smoking. 2) Enhanced mucus secretion rich source of food for bacteria but immune cells are negatively influenced by the cigarette in their fight for infection. Destructive enzymes from the immune cells loss of proteins associated with elasticity.
The Microarray Experiment • The gene expression dataset is composed of 30 samples that are retrieved from NCBI’s Gene Expression Omnibus. • The RNA transcripts that are utilized for measuring the expression signals are taken from Homo-sapien organism. • 18 slidesSeverely emphysematous tissue removed at LVRS 12 slides normal or mildly emphysematous lung tissue taken from smokers with nodules suspicious for lung cancer. • More than 33,000 best characterized human genes were represented in the dataset that include 1,000,000 unique oligonucleotide features.
Data Analysis Read XLS Files Normalization Outlier Removal Check Normal Dist. Hypothesis Testing PCA Correlation Based Classification
Data Normalization • Why do we normalize data? • Scale data for reasonable comparisons • Map the expression values of each probe between [0,1] range • Do not disturb the underlying distribution of the data
Outlier Removal • Why do we need to remove outliers? • They can distort the mean of the data -> wrong clusterings, wrongs significance values • What is an outlier? • Expression values that are three or more standard deviation away from the mean are outliers • Replace outliers with the mean of remaining expression values for the probe • Detected outliers in 5331 probes
Clustering Before Feature Reduction • Clustering on samples using all existing features to see how bad the situation is • K-means clustering with euclidean distance • For selection of initial k centers • KCENTRES algorithm • select k center objects from the distance matrix such that the distance between the most distant object and the center that is closest to that object is minimized.
Clustering Before Feature Reduction • Really low clustering performance • Data set is noisy • Reduce features so as to have only the pobes that are really significant • Differentially expressed among classes • ...can be used to differentiate between the two classes of samples
Searching for Normally Distributed Probes • Why do we need to identify which probes are normally distributed and which ones are not? • Apply different tests of significance • Lilliefors test (95% significance) • Tests the normality of the distribution via examining the signal intensity data for a probe • Modified version of Kolmogorov-Smirnov test • No need to specify the parameters of the underlying distribution of the data • approximates the underlying distribution • 14787 probes have normal distribution • 7428 probes do not have normal distribution
Tests of Significance • For probes that have normal distribution • T-test or Z-test • 30 samples -> t-test (95% significance) • For probes that do not have normal distribution • Non-parametric test • Wilcoxon rank-sum test (95%significance) • ... sorts all intensity values for a probe • ... gives each intensity value a rank • ... sums up the ranks for the signal values for both classes • ... compares the sums to decide whether the two samples come from the same distribution.
Tests of Significance • 2339 of 22215 probes are differentially expressed • eliminated 19876 probes
Clustering Before Feature Reduction • Clustering on samples using only differentially expressed features to see whether feature reduction proved to be useful • K-means clustering with same procedure
Further Feature Reduction • 2339 features is a high number • Try to reduce number of features while preserving clustering accuracy • Two methods • Correlation based feature reduction • Principal component analysis
Correlation Based Feature Reduction • Extract uncorelated features • Cluster uncorrelated features • K-means • SOM • Prior to clustering find value of k from hierarchical clustering • Different distance metrics • Different hierarchical clustering methods • From each cluster of each clustering select certain genes • These genes will be used for classification
Correlation Based Feature Reduction • Extract uncorelated features • Find correlation matrix • Keep one of the highly correlated features and remove the others. • Highly correlated = >85% • Keep the feature that is closest to all other features • 1689 probes are left in our feature set
Correlation Based Feature Reduction • Prior to clustering find value of k from hierarchical clustering • Different distance metrics • Manhattan • Euclidean • Mahalanobis • Chebyshev • Correlation coefficients • Different hierarchical clustering methods • Complete linkage (max distance clustering) • Average linkage (average distance clustering)
Correlation Based Feature Reduction Complete Linkage with Manhattan Distance
Correlation Based Feature Reduction Average Linkage with Manhattan Distance
Correlation Based Feature Reduction Complete Linkage with Euclidean Distance
Correlation Based Feature Reduction Average Linkage with Euclidean Distance
Correlation Based Feature Reduction Complete Linkage with Chebyshev Distance
Correlation Based Feature Reduction Average Linkage with Chebyshev Distance
Correlation Based Feature Reduction Complete Linkage with Correlation Coeff.
Correlation Based Feature Reduction Average Linkage with Correlation Coeff.
Correlation Based Feature Reduction • complete linkage method is more successful in separating between clusters • the probes that are very similar to each other are put into the same clusters, the ones that are very different are put into different clusters • focused on the trees that are formed using complete linkage with euclidean distance and correlation coefficients as distance metrics • From which level we should cut the trees? • Examine the trees
Correlation Based Feature Reduction • cut from a level above the lower bound lines • have a small number of clusters • insufficient to explain the closeness of probes. • cut from a level below the upper bound lines • have a high number of clusters • forcing the clustering algorithm to divide clusters that consist of very similar samples • lower bound = 40 clusters • upper bound = 95 clusters
Correlation Based Feature Reduction • To find optimal k value • run several k-means algorithms for each k value between 40 and 95 • produces the clustering of highest quality • High quality = small intra-cluster distance • k is found to be 80 • store the clustering result that is produced when k = 80 • run SOM clustering with 9x9 map and store the clustering result
Correlation Based Feature Reduction • How do we select the signature probes from the clusters? • select the ones that are most significant • Select how many probes form each cluster ? • select the “n” most significant probes from each cluster, where “n” is depenedent on the quality of the cluster • Quality of cluster = intra-cluster distance • Quality value is high -> intra-cluster similarity is low -> cluster is loose • Quality value is low -> intra-cluster similarity is high -> cluster is tight • Take more probes from clusters that are loose in order to represent those clusters better
Correlation Based Feature Reduction • From clusters formed by k-means • 144 probes • From clusters formed by SOM • 141 probes • we formed another set of 88 probes that are found both in probes selected from k-means clustering and SOM clustering (named as common probes)
Principal Component Analysis • set of statistically significant probes are directly given to prtools' pca function • 99%, 90% and 85% data preservation • Number of resulting principal components
Classification • Feature sets from feature reduction based on correlation that can be used by classifiers • K-means probe set • Som probe set • Common probe set • Algorithms • Linear classifier • Support vector machine • 1-narest neighbor classifier • 3-nearest neighbor classifier
Classification • 30 samples in our data set -> k-fold cross validation • bias caused by random selection of samples for training and testing sets • Repeatedly perform 100 classifications for each classifier • Report the average classification error
Classification • Three sets of principal components from feature reduction with principal component analysis can be used by classifiers • Algorithms • Linear classifier • Support vector machine • 1-narest neighbor classifier • 3-nearest neighbor classifier
Classification • in all cases support vector machines provides us with the best clustering results
Classification • performance of support vector machines that utilize the principal component sets is worse than the ones that utilize the probe sets that are formed as the result of correlation based feature reduction method • performance of support vector classifiers that utilize the k-means, SOM and common probe sets are more or less similar • classify samples with 99% accuracy on the average • use set of common probes as signature genes • aim is to reduce the number of features without sacrificing classification performance
Final feature reduction • 88 features is still a high number • use Fisher’s linear discriminant to further reduce number of features • resulting signature gene set consisted of 26 probes • performance of classification even improved when the number of probes is reduced • able to classify the 30 samples with 99.7-100% accuracy on the average