290 likes | 467 Views
Gene Interaction Analysis Using k-way Interaction Loglinear Model: A Case Study on Yeast Data. Xintao Wu UNC Charlotte Daniel Barbara George Mason Univ. Liying Zhang Memorial Sloan Kettering Cancer Center Yong Ye UNC Charlotte. Microarray data.
E N D
Gene Interaction Analysis Using k-way Interaction Loglinear Model: A Case Study on Yeast Data Xintao Wu UNC Charlotte Daniel Barbara George Mason Univ. Liying Zhang Memorial Sloan Kettering Cancer Center Yong Ye UNC Charlotte Machine Learning in Bioinformatics’03 Washington D.C.
Microarray data • The raw microarray images are transformed to gene expression matrices where • The rows denote genes • The columns denote various samples, conditions, or time points • corresponds to the expression value of the sample on gene • Comparison with market basket data Machine Learning in Bioinformatics’03 Washington, D.C.
Background -- Clustering • Clustering over genes • CAST Ben-Dor et al 1999 • MST Xu et al 2002 • HCS Hartuv & Shamir 2000 • CLICK Shamir & Shamir 2000 • Drawback • Each gene is assigned to only one cluster, however, a gene can be characterized by several pathways (e.g., p53 protein) • Impossible to determine interactions of genes in one cluster Machine Learning in Bioinformatics’03 Washington, D.C.
Background — Interaction analysis • Association rule, Creighton & Hanash 03 • Need to descretize data • Associations instead of interaction • Undirected • Graphical gaussian model, Kishino & Waddell 00 • No need to descretize data • Only pairwise interactions • Undirected • Bayesian network, Segal et al 03 • Pairwise interactions • Directed • High complexity Machine Learning in Bioinformatics’03 Washington, D.C.
Background -- Association Rule • An association rule X Y satisfies with minimum confidence and support • support, s = P(XUY), probability that a transaction contains {X U Y} • confidence, c = P(Y|X), conditional probability that a transaction having X also contains Y • Efficient algorithms • Apriori by Agrawal & Srikant, VLDB94 • FP-tree by Han, Pei & Yin, SIGMOD 2000 • etc. • Example of rules discovered in Microarray • when gene A and B are over expressed within a sample, then often gene C is over expressed too. • Pros • One gene can be assigned to any number of rules (pathways). • Cons • Gene co-expression instead of interaction Customer buys both Customer buys Y Customer buys X Machine Learning in Bioinformatics’03 Washington, D.C.
Criticism to Support and Confidence • Example 1: (Aggarwal & Yu, PODS98) • Among 5000 students • 3000 play basketball • 3750 eat cereal • 2000 both play basket ball and eat cereal • play basketball eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%. • play basketball not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence Machine Learning in Bioinformatics’03 Washington, D.C.
Criticism to Support and Confidence • We need a measure of dependent or correlated events • P(Y|X)/P(Y) is also called the lift of rule X => Y Machine Learning in Bioinformatics’03 Washington, D.C.
Criticism to lift • Suppose a triple ABC is unusually frequent because • Case 1: AB and/or AC and/or BC are unusually frequent • Case 2: there is something special about the triple that all three occur frequently. • Example 2: (DuMouchel & Pregibon, KDD 01) • Suppose in a db of patient adverse drug reactions, A and B are two drugs, and C is the occurrence of kidney failure • Case 1: A and B may act independently upon the kidney, many occurrences of ABC is because A and B are sometimes prescribed together • Case 2: A and B may have no effect on the kidney if taken alone, but when taken together a drug interaction occurs that often leads to kidney failure • Case 3: A and B may have small effect on the kidney if taken alone, but when taken together, there is a strong effect. Machine Learning in Bioinformatics’03 Washington, D.C.
an estimate of the number of transactions containing the item set over and above those that can be explained by the pairwise associations of the items Shrinkage estimates, (or we can use raw count) Predicted count of all-two-factor model based on two-way distributions Criticism to lift • EXCESS2 Machine Learning in Bioinformatics’03 Washington, D.C.
Motivation • EXCESS2 • By analyzing residues, we can pick up the multi-item associations that can not be explained by all the pairwise associations included in the all-2-way model. • can separate case 2 and 3 from case 1. • do not include multi-way interactions • Our contribution • Extend all-two-factor model to general k-way loglinear model • Apply association rule to identify gene sets for further analysis Machine Learning in Bioinformatics’03 Washington, D.C.
1-factor effect main effect 2-factor effect which shows the dependency within the distributions of A,B. Saturated log-linear model Machine Learning in Bioinformatics’03 Washington, D.C.
Computing -term • Linear constraints of coefficients • UpDown method (Sarawagi et al, EDBT98) Loglinear parameters sum to 0 over all indices Machine Learning in Bioinformatics’03 Washington, D.C.
k-way loglinear model • Comparison with lift, EXCESS2 Independence model pairwise model 3-way model Machine Learning in Bioinformatics’03 Washington, D.C.
Our Method • Step 1, transform gene expression raw data to build a boolean matrix • Step 2, apply Apriori method to find all frequent gene sets • Step 3, for k=1 to K • For each large gene set • Fit k-way interaction model • If its standard residue • Include s into Machine Learning in Bioinformatics’03 Washington, D.C.
Preprocessing • The expression values need to be discretized into catagories, e.g., overexpressed, normal, underexpressed. • >0.2 overexpressed • (-0.2, 0.2) normal • <-0.2 underexpressed Machine Learning in Bioinformatics’03 Washington, D.C.
Contingency table • For each frequent itemset s discovered by Apriori, we need to build a contingency table for further k-way interaction analysis • Note application of loglinear modeling is constrained by the size of samples as • Loglinear modeling requires the size of samples should be larger than the number of cells in the contingency table Frequent set Machine Learning in Bioinformatics’03 Washington, D.C.
Examine residues • Analysis of residues may reveal cell-by-cell comparisons of observed and fitted frequencies. • Standard residue is asymptotically normal with mean 0 Machine Learning in Bioinformatics’03 Washington, D.C.
Experimental Results • Yeast Data • 6316 genes, 300 samples • >0.2 over expressed, (-0.2,0.2) normal, <-0.2 underexpressed • Many frequent gene sets can be screened by all k-way interaction model when k is increased. The size of item sets which can not be interpreted by k-way model The size of frequent item sets from Apriori Machine Learning in Bioinformatics’03 Washington, D.C.
Experimental Results The frequencies and estimates from all k-way interactions ORF naming Machine Learning in Bioinformatics’03 Washington, D.C.
Experimental results • Our results agree to some previously known biological interactions • Refer paper for details • Our results also reveal some previously unknown interactions that have solid biological explanations • Refer paper for details Machine Learning in Bioinformatics’03 Washington, D.C.
Obs. 2, We can compare the interactions by their magnitude of -terms derived from the saturated models Lattice for -term of saturated model (2-category case) Obs. 1, each of -term has only one absolute value because each gene can only have two states: over express or under express All (4.560) C (1.493) B (1.407) A (0.284) D (-0.144) CD (0.245) AD (-0.006) AC (0.681) BC (-0.765) BD (-0.296) AB (-0.044) ABC (0.233) ACD (-0.118) ABD (-0.185) BCD (-0.093) ABCD (0.038) Machine Learning in Bioinformatics’03 Washington, D.C.
Two-category vs. multi-category • Two-category: we can directly compare the interactions based on -terms derived from loglinear models • E.g. , we can derive positive interaction between AC, negative interaction between AC, no significant interaction between BC, and positive three-factor interaction among ABC • Not enough for analysis at finer level, e.g., what is the effect of weak-over expressed of gene A and B on gene C? • Multi-category: we can not directly compare as the d.f. (and variance) is different for each interaction. • The values do not necessarily imply that the interaction of AC is greater than that of CD. • Test statistic needs to be formed. Machine Learning in Bioinformatics’03 Washington, D.C.
Framework (ongoing) Machine Learning in Bioinformatics’03 Washington, D.C.
Preprocessing • Preprocessing is used to get subset of genes for further interaction analysis. • Hierarchical clustering • Association rule • Specified by domain user based on known pathways • Preprocessing is necessary as • Graphical gaussian modeling is bounded by the size of samples • Loglinear modeling is bounded by the number of cell of contingency table, i.e., the size of samples should be 5 times larger than that of cells in contingency tables. Machine Learning in Bioinformatics’03 Washington, D.C.
Interaction Modeling • Graphical gaussian modeling is used to generate pairwise interactions for a relatively large subset of genes. • No information loss • Efficient • The independence graph may also indicate the interactions among several pathways. • The independence graph is decomposed to get components. • Loglinear modeling is used to generate multi-way interactions among genes in each component. Machine Learning in Bioinformatics’03 Washington, D.C.
Snapshot of prototype system Machine Learning in Bioinformatics’03 Washington, D.C.
Thank you ! Machine Learning in Bioinformatics’03 Washington, D.C.
Graphical gaussian modeling • GGM assumes a family of normal distributions for underlying data constrained to satisfy the pairwise condidtional independence restrictions inherent in the independence graph. • The microarray expression data, which are log-transformed from raw image data, satisfy near multivariate normal distribution • Partial correlation • The correlation between two variables after the common effects of the third variables are removed • With a set of gene, where is the xy-th element of the inverse of variance matrix ( ) • No edge is included in the graph if is less than some threshold Machine Learning in Bioinformatics’03 Washington, D.C.
Loglinear modeling • The difference from market basket data is that each gene can have multiple categories (e.g., over-expressed, normal, under-expressed) which depend on discretization strategy. Machine Learning in Bioinformatics’03 Washington, D.C.