260 likes | 398 Views
Transactional Database Transformation and Its Application in Prioritizing Human Disease Genes. Yang Xiang Department of Biomedical Informatics, The Ohio State University Homepage: http://bmi.osu.edu/~yxiang
E N D
Transactional Database Transformation and Its Application in Prioritizing Human Disease Genes Yang Xiang Department of Biomedical Informatics, The Ohio State UniversityHomepage: http://bmi.osu.edu/~yxiang Joint work with Philip R.O. Payne and Kun HuangTo appear in IEEE/ACM Transactions on Computational Biology and Bioinformatics
Motivation: Netflix problem • The Netflix Problem: Given the current user ratings, how to recommend movies to users? Movies Users
Motivation: Matrix Completion • 0s are unsampled entries, other values are sampled entries Movies Users • Can we recover such kinds of matrices?
Matrix Completion Theory and methods • If the number m of sampled entries obeysfor some positive numerical constance C, then with very high probability, most n*n matrices of rank r can be perfectly recovered. [Candès et al. Exact Matrix Completion via convex optimization, Foundations of Computational Mathematics, 9(6), 717-772.] • Matrix completion methods (http://perception.csl.uiuc.edu/matrix-rank/sample_code.html#MC) • Singular Value Thresholding • OptSpace • Acceloerated Proximal Gradient • Subspace Evolution and Transfer • Grouse
Transactional Database (0,1)-matrix Bipartite graph Bread 1 Milk 2 Diaper 3 Beer 4 Eggs 5 Coke Transactional Database Apples (0,1)-matrix
Question: Can (0,1)-matrix be completed? Consider each transaction is a customer. What is each customer’s altitude towards un-purchased items (i.e., 0 entries)? It does not make a good sense to use the sampling model here as for the matrix completion, i.e., non-zero is a sample entry and zero is a unsampled entry.
Our proposal: (0,1)-matrix transformation • An entry is evaluated by its support patterns (independent evidence). • P is a supporting pattern for entry (i,j) if and only if P covers (i,j) and, M(x,y)=1 for any entry (x,y)ϵP\{(i,j)} • Since the value of (i,j) is not considered for a supporting pattern, the supporting pattern of an entry is independent of the entry value.
Support Pattern Measurementused in this work Biomedical Informatics question: How to efficiently transform M into F defined above, such that F can unbiasedly predict the unkown gene-phenotype relationships?
Find support patterns and calculate F (i,j) for one entry c e i j b g d f h a e i j b c g d f h a 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 Find support patterns for the magenta entry (4,d) b c e g f 2 b 2 3 c 3 5 e f 6 5 g 8 6 Find the maximum edge bicliqueF (4,d)=6 8
Maximal biclique and maximum edge biclique • A biclique is maximal if it cannot be extended. • Maximum edge bicliqueis a maximal biclique with the maximum number of edges. • Listing all maximal biclique is a NP-hard problem. Find one maximum edge biclique is NP-hard too.
Solutions for listing all maximal bicliques • Associate Rule Mining • Frequent ItemsetAn itemset whose support is no less than a minimum support (minsup) threshold. In the transaction example, set minsup=3, then {beer} is a frequent itemset. {beer, coke} is too. • (Frequent) Closed ItemsetAn itemset is closed if none of its immediate supersets has the same support as the itemset • Maximal Frequent ItemsetAn itemset is maximal frequent if none of its immediate supersets is frequent
Solutions for listing all maximal bicliques • A close itemset with its supporting transaction set exactly corresponds to a maximal biclique in the corresponding bipartite graph • Using frequent closed itemsetto approximate closed itemset. • MAFIA: Mining frequent itemset, frequent closed itemset, and maximal frequent itemset. http://himalaya-tools.sourceforge.net/Mafia/ Bread 1 Milk 2 Diaper 3 Beer 4 Eggs 5 Coke Apples
Solution Summary for one entry (i,j) • Construct a submatrix corresponding to the entry (i,j). • Using frequent closed mining tools to build frequent closed itemsets (set the support threshold as low as the computer can handle) • Build supporting transactions for the frequent closed itemsets, thus we obtained all the candidate maximal bicliques. • Find the maximum edge biclique and get the F (i,j) value.
How about all entries? • The previous solution is for one entry. How about all entries in a m*n matrix? • Simply repeating the previous calculation for m*n times is not a wise choice.
IndEvi Algorithm in a Nutshell • Assume input is a set of maximal cliques of the original (0,1)-matrix. • Project each maximal clique horizontally and vertically. Let C be the maximal clique as shown by the shaded area. Can you figure out how to calculate FC(i,j) for an entry (i,j)? • Each entry will remember the largest FC(i,j).with respect to all Cs.Please refer to the paper for the algorithm detail.
IndEviRe Algorithm: Independent Evidence Reconstruction • IndEvi algorithm ensure an entry (i,j) remember the largest FC(i,j) value, and the corresponding reference to C in the set of maximal bicliques. • IndEviRe algorithm reconstructs the support pattern according to the reference and the value of (i,j).
Application in Prioritizing Human Disease Genes • Transactional data: gene-to-phenotype (G2P) dataset from http://human-phenotype-ontology.org (10/03/2010) • Closed itemset generator: MAFIAhttp://himalaya-tools.sourceforge.net/Mafia/ • Platform: Linux, C++, STL • Cross-validate Platform (10/04/2010): www.geneanswers.com (GACOM)
Measurement: Fold Enrichment Intuitively, fold enrichment measures how good known disease genes are ranked among all genes
Results • Among all 34503(=|E|) known gene-phenotype relations, 4598(=|E’|) of them with gene ranked among the top 0.1107% (=y%) of the 1807 candidate genes for it, achieving a 120.4 (x/y=13.3264/0.1107) fold-enrichment. • Rank Cutoff
Case Study: Osteoarthritis • Supporting pattern (by IndEviRe) for TNXB: {COL3A1, COL5A1, COL5A2, TNXB}*{AUTOSOMAL DOMINANT INHERITANCE, ECCHYMOSES, JOINT DISLOCATION, MITRAL VALVE PROLAPSE, SOFT SKIN, OSTEOARTHRITIS} • Supporting pattern (by IndEviRe) for VWF: {COL3A1, COL5A1, COL5A2, TNXB, VWF}*{AUTOSOMAL DOMINANT INHERITANCE, ECCHYMOSES, , MITRAL VALVE PROLAPSE, OSTEOARTHRITIS}
Conclusion • The supporting patterns for an entry in (0,1)-matrix is a good resource for knowledge inference. • Frequent closed itemsetmining provide a practical platform for solving our problems. • IndEvi and IndEviRe algorithms can efficiently calculate F score and reconstruct evidence for any entry, with the input of maximal bicliques. The result for an entry is independent of its original value (0 or 1). Only one call of frequent closed itemset mining on the original matrix is necessary. • Readers may revise the F function for different applications. • The algorithm is simple to implement, and the result is easy to analyze. Our method has a wide range of applications. • The study on human gene-phenotype data shows that our method is efficient and effective.
Thanks! Questions?