200 likes | 345 Views
Exploratory Tools for Follow-up Studies to Microarray Experiments. Kaushik Sinha Ruoming Jin Gagan Agrawal Helen Piontkivska Ohio State and Kent State Universities. Overall Motivation. Biological literature is vast Need tools to find interesting patterns from literature
E N D
Exploratory Tools for Follow-up Studies to Microarray Experiments Kaushik Sinha Ruoming Jin Gagan Agrawal Helen Piontkivska Ohio State and Kent State Universities
Overall Motivation • Biological literature is vast • Need tools to find interesting patterns from literature • Specific Example • Identify genes from DNA microarray and other gene and protein assays • Next step • What is known about these genes? • How are these genes related to each other or other genes identified in similar studies? • Which other genes are most similar
Outline • Hypergraph Mining • Similarity Measures • Evaluation and Observations
Hypergraph Mining: Motivating Example • Micro array experiment - suspects that a small set of genes are related to a disease • Confirm by searching existing literature - expect related genes to appear together in literature • However, suppose Gene A and C are related and both of them are weakly related to another term B • In literature, one would expect • A,C appear together OR/AND • A,B appear together • B,C appear together • How do we efficiently conclude that A,C are actually related?
Hypergraph Mining • Basic Motivation • To find useful “Transitive Relation” (hyperedges) among genes • Example (Gene-Disease Relationship) • Gene A is related to a term B • Term B is related to a gene C • Is Gene A related to Gene C ? • Gene Source • Microarray Experiments • Information Source • Online Literature abstracts
Formal Problem Definition • Given • A dictionary KT • A set KM of user provided keywords (KTכKM) • Collection of literature abstracts - each abstract is represented as a set of words from dictionary • Task • To find hyperedges exceeding user defined threshold, each of which involves a set of key words from KM and are potentially connected by another set of linking words from KT-KM
Relationship to Work on Frequent Pattern Mining • Frequent itemset mining • Can represent each document abstract as a transaction with several keywords • Find sets of keywords that appear together and often • Cannot capture cross relationships • Differences • How do we define support ? • How do we prune search space
Solution Approach • Define • total weight=support + cross support • Support: set of keywords appear together in one document • Cross support: set of keywords can be partitioned • each partition appears in different document • Common linking words • Issues • Since downclosure property does not hold for total weight modified downclosure property can be defined
Idea • Support satisfies downclosure property • Let X be a set, Ω be its power set. A function f : Ω→R+satisfies downclosure property if for all A,B ∈ Ω , A כ B ,f(B)>f(A) • Cross support can be designed to be restricted below a particular value, i.e., it is bounded • Form a function h as addition of two functions h=f+g • f satisfies downclosure property • g is bounded • h satisfies modified down closure property • For any θ≥0, if h(Kn) ≥θ then f(Kn-1) ≥ max{0,(θ-sup(g))} • This property can be used to devise efficient algorithm
Outline • Hypergraph Mining • Similarity Measures • Evaluation and Observations
Similarity Measure among Sets of Genes • Given two list of gene names • Need to find most similar genes, based on literature abstract occurrences • Standard statistics approach • Each file containing gene names can be considered as a Discrete Random Variable (DRV) • Each such DRV can take several values (gene names) • For two such files X,Y and for any pair (x,y), • joint probability mass function p(x,y)=P(X=x,Y=y) • Compute from online abstracts based on co-occurrence
Probability Computation • Assume, • File X has n gene names xi, i ∈{1,…,n} • File Y has m gene names yj, j ∈{1,…,m} • M(i,j) is the number of times (xi,yj) appears together in transactions (article abstracts) • Then, • p(xi,yj)=M(i,j)/{∑i∑jM(i,j)}
Expectation Computation • Now define, • Z=g(X,Y), where g: X x Y →[0,∞) • Clearly, Z is a random variable • Expectation of Z is, • E(Z)=E(g(X,Y))=∑i∑j (g(xi,yj)M(i,j)/Mt) • Where, Mt=∑i∑jM(i,j) • Expected value of Z can directly be used as a similarity measure • Different choices of g, give rise to different similarity measures
Some Choices of function g • First Choice, • Choose g=M(i,j) • This choice leads to similarity measure, se1= ∑i∑j M(i,j)2 /Mt • Second Choice, • Choose g=tot_length(xi,yj), where tot_length (xi,yj) is the sum of transaction lengths where (xi,yj) co-occur • The idea is longer the transaction length, higher the chance of having related linking key words • This choice leads to similarity measure, se2= ∑i∑j tot_length(xi,yj)*M(i,j)/Mt
Extending the notion towards gene ranking • Extend to rank genes from a list Y • Most similar to the genes from list X • Here, instead of Y as a random variable, for each yj ∈Y, consider Uj as a random variable taking value only yj • Find the similarity measure between X and Uj for all j∈{1,…,m} • Sort the genes from list Y according to decreasing similarity measure
Datasets Used two sets of 21 and 31 genes • These genes are differentially expressed between prostate epithelial and stromal cells in prostate cancer patients • Dr Gail Frazer’s lab, Kent State University • A standard dictionary, as reported in literature, containing 300 genes was used • These genes were significantly up or down regulated in tumor and adjacent normal tissues when compared with a normal donor tissue • Each literature abstract was represented in a bag of word format containing words, • where each word comes from a dataset or the dictionary or is a GO term
Results: Hypergraph Mining • Results show the linking GO terms and linking genes from the dictionary for 21 and 31 dataset obtained by hypergraph mining
Results: Similarity Measures • 4 sets of 300 genes each ,- A,B,C,D were formed • A is the dictionary of 300 genes as mentioned before • B,C,D were randomly chosen from superarray’s DNA micro-array experiments • The task is to identify which of A,B,C,D is most similar to the 21 or 31 dataset • As one would expect, A is most similar to the 21 dataset as shown below • It also shows that some naïve similarity measure, such as s1, fails to capture this • Sometimes, this tool discovers some interesting result,- • For 31 dataset, randomly chosen list C was most similar • This has been justified by checking the functionalities of top ranked genes from list C
Results: Ranking • Results of the ranked genes from the most similar list to either 21 or 31 data set • Linking words from hypergraph mining were also found within top 20 genes
Summary • Biological Literature is large and complex • Need data mining tools to summarize interesting patterns • Proposed hypergraph mining and similarity metrics • Initial results are promising