210 likes | 557 Views
MINING THE GENE EXPRESSION MATRIX: INFERRING GENE RELATIONSHIPS FROM L ARGE SCALE GENE EXPRESSION DATA. Patrik D'haeseleer, Xiling Wen, Stefanie Fuhrman, and Roland Somogyi Information Processing in Cells and Tissues, pp. 203-212, 1998 Presented by Bin He. Motivations.
E N D
MINING THE GENE EXPRESSION MATRIX:INFERRING GENE RELATIONSHIPS FROM LARGE SCALE GENE EXPRESSION DATA Patrik D'haeseleer, Xiling Wen, Stefanie Fuhrman, and Roland Somogyi Information Processing in Cells and Tissues, pp. 203-212, 1998 Presented by Bin He
Motivations • it is necessary to determine large-scale temporal gene expression patterns • to decipher the logic of gene regulation, we should aim to be able to monitor the expression level of all genes simultaneously
Gene time series • assay the expression levels of large numbers of genes in a tissue at different time points • Gene time series the relative amounts of mRNA produced at these time points provide a gene expression time series for each gene
Gene Expression Matrix • Wen, X., Fuhrman, S., Michaels, G.S., Carr, D.B., Smith, S., Barker, J.L., and Somogyi, R., 1997, Large-scale temporal gene expression mapping of CNS development, Proc. Natl. Acad. Sci., in press
Previous Approach • Euclidean distance and information theoretic measures to cluster the genes into related expression time series • A significant problem with this approach is the variety of measures that can be used • Each measure produces a unique clustering of gene expression patterns
Contributions • determining significant relationships between individual genes, based on: • linear correlation • rank correlation • information theory
Linear correlation ------positive correlation • positive linear correlation
Linear correlation ------negative correlation • negative linear correlation
Linear correlation ------restriction • for 112 different genes, 112x111/2 = 6216 pairs of expression time series need to be examined • to restrict the number of relationships, we might want to test which correlations are significantly larger than a certain value
Linear correlation ------restriction • For instance, to find those relationships in which at least 50% of the variance is explained by the correlation, i.e. rho2>0.5, we need |r|>0.96 to reject at the 1% significance level the null hypothesis that |rho|<0.7071
Linear correlation ------visualization • residual variance based distance measurment • d=1-r2 • d=0 if perfectly correlated, d=1 if uncorrelated • multidimensional scaling • map time series into a two-dimensional plane
Linear correlation ------visualization • Multidimensional scaling of 34 time series with high correlation
Nonlinear correlation ------Model • Spearman rank correlation, rs • measurement for monotonic relationships • can be used for non-Gaussian distributions • 491 pairs of expression time series, involving 98 genes, which have a significant rs, ranging from -0.979 to 0.996
Nonlinear correlation ------Example • High rank correlation but low linear correlation between mGluR1 and GRa2
Information Theory ------mutual information • if H(A) and H(B) are the entropies of sources A and B respectively, and H(A,B) the joint entropy of the sources, then M(A,B) = H(A) + H(B) - H(A,B) • discrete form is much easier to use • We need discretize the time series by partitioning the expression levels into bins
Information Theory ------Bin size • The fewer bins we use to discretize the data, the more information about the original time series we ignore. • On the other hand, too fine a binning will leave us with too few points per bin to get a reasonable estimate of the frequency of each bin
Information Theory ------Mapping • Some time series map to the same discretized series • In total, from 112 unique continuous-valued time series we get 91 discretized time series
Information Theory ------Mapping • eliminate one-to-one mapping by permuting the bin numbers • H(A)=H(B)=M(A,B) • row 3 and row 4 • replace such time series by one single series, leaving us with a set of 77 unique, non-equivalent time series.
Information Theory ------Measurement • symmetric measures • M(A,B)/max(H(A),H(B)) • M(A,B)/H(A,B) • asymmetric measures • Relative mutual information R(A,B) = M(A,B)/H(B) • R(A,B) = 1.0, means that all the information about time series B is contained in time series A
Conclusion • Linear correlation can be used very effectively to detect linear relationships • detect relationships not captured by Euclidean distance, such as high negative correlations • Rank correlation can be used to detect non-linear relationships • much more robust with respect to the distribution of expression levels • Information theory can be used to detect genes whose (binned) expression patterns share information • It will detect any mapping from time series A to B