120 likes | 231 Views
University at Buffalo. The State University of New York. 08.25.03. Interactive Exploration of Coherent Patterns in Time-series Gene Expression Data. Daxin Jiang Jian Pei Aidong Zhang Computer Science and Engineering University at Buffalo. Microarray Technology.
E N D
University at Buffalo The State University of New York 08.25.03 Interactive Exploration of Coherent Patterns in Time-series Gene Expression Data Daxin Jiang Jian Pei Aidong Zhang Computer Science and Engineering University at Buffalo
Microarray Technology http://www.ipam.ucla.edu/programs/fg2000/fgt_speed7.ppt • Microarray technology • Monitor the expression levels of thousands of genes in parallel • Gene Expression Data Matrix • Each row represents a gene Gi ; • Each column represents an experiment condition Sj ; • Each cell Xij is a real value representing the gene expression level of gene Gi under condition Sj; • Xij > 0: over expressed • Xij < 0: under expressed • A time-series gene expression data matrix typically contains O(103) genes and O(10) time points. Gene expression data matrix
Examples of co-expressed genes and coherent patterns in gene expressiond data Coherent Patterns and Co-expressed Genes Parallel Coordinates for a gene expression data • Why coherent patterns and co-expressed genes interesting? • Co-expression may indicates co-function; • Co-expression may also indicates co-regulation • Coherent patterns may correspond to important cellular process
Hierarchies of Co-expressed Genes and Coherent Patterns • Hierarchies of co-expressed genes and coherent patterns are typical • The interpretation of co-expressed genes and coherent patterns mainly depends on the domain knowledge • Flexible tools are needed to interactively unfold the hierarchies of co-expressed genes and derive coherent patterns
High Connectivity of the Data • Groups of co-expressed genes may be highly connected by a large amount of “intermediate” genes • Two genes with completely different patterns can typically be connected by a “bridge” • It is often hard to find the clear borders among the clusters Two genes with complete different patterns connected by a “bridge”
Distance Measure • We measure the similarity and distance between two genes (objects) as follows • The similarity and distance measure defined above are consistent, i.e., given objects O1, O2 ,O3 ,and O4, similarity(O1,O2) > similarity(O3,O4) if and only if distance(O1,O2) < distance(O3,O4) dP(Oi’,Oj’) Is the Pearson’s Correlation Coefficient between Oi’ and Oj’ dE(Oi’,Oj’) Is the Euclidean distance between Oi’ and Oj’ O’ is the transformation of object O by transforming each attribute d as , And are the mean and the standard deviation of all the attributes of O, respectively.
Definition of Density • We choose the density definition by Denclue[1] • The Gussian influence function • Given a data set D d(Oi,Oj) is the distance between Oi and Oj, and is a parameter • [1] Hinneburg, A. et al. An efficient approach to clustering in large multimedia database with noise. Proc. 4th Int. Con. on Knowledge discovery and data mining, 1998.
Attraction Tree • Genes with high density “attract” other genes with low density • The “attractor” of object O is the object with the largest attraction to O • We can derive an attraction tree based on the attraction between the objects • The weight for each edge e(Oi,Oj) on the attraction tree is defined as the similarity between Oi and Oj.
Coherent Pattern Index Graph • We search the attraction tree based on the weight of edges and order the genes in the “index list” • For each gene gi in the index list g1…gn, the “coherent pattern index” is defined as • The graph plotting the coherent pattern index value w.r.t. the index list is called the “coherent pattern index graph” • A pulse in the coherent pattern index graph indicates a coherent expression pattern where p is a parameter, Sim(gi) is the similarity between gi and its parent gj on the attraction tree. Sim(gi) is set to 0 if i1 or I>n.
An Example The coherent pattern index graph A sample data set • The weight of edges on the attraction tree characterizes the coherence relationship between genes (represented by purple, cyan and brown lines) • The three pulses in the coherent graph index graph indicate the three patterns in the data set • Genes between two neighboring pulses are co-expressed genes and share coherent patterns The attraction tree
(a) (b) Interactive Exploration -- GeneXplorer • The coherent pattern index graph gives indications on how to split the genes into co-expressed groups • Suppose the user accept the 5 pulses suggested in figure (a), and click on the 2nd pulse • The system will “zoom in” the coherent pattern index graph for genes between the 1st pulse and the 2nd pulse (figure (b)) • The user can select clicking on the pulses in figure (b) and further split the genes until no split is necessary Interactive exploration on Iyer’s data[2] • [2] Iyer, V.R. et al. The transcriptional program in the response of human fibroblasts to serum. Science, 283:83–87, 1999.
Comparison With Other Approaches • We compare the patterns discovered from the Iyer’s data[2] by different approaches with the ground truth by Eisen et al.[3] • GeneXplorer identifies more patterns in the ground truth and does not report any false patterns • Pattern 5 in the ground truth is only reported by GeneXplorer • The only pattern in the ground truth (pattern 9) missed by GeneXplorer is missed by any other method Each cell represents the similarity between the pattern reported by different approaches and the corresponding pattern in the ground truth (if any) Conclusions: • The coherent pattern index graph is effective to give users highly confident indication of the existence of coherent patterns • The GeneXplorer provides interactive exploration to integrate user’s domain knowledge • [3] Eisen M.B. et al. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA, Vol. 95:14863–14868, 1998.