1 / 20

Exploratory Tools for Follow-up Studies to Microarray Experiments

Exploratory Tools for Follow-up Studies to Microarray Experiments. Kaushik Sinha Ruoming Jin Gagan Agrawal Helen Piontkivska Ohio State and Kent State Universities. Overall Motivation. Biological literature is vast Need tools to find interesting patterns from literature

Download Presentation

Exploratory Tools for Follow-up Studies to Microarray Experiments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploratory Tools for Follow-up Studies to Microarray Experiments Kaushik Sinha Ruoming Jin Gagan Agrawal Helen Piontkivska Ohio State and Kent State Universities

  2. Overall Motivation • Biological literature is vast • Need tools to find interesting patterns from literature • Specific Example • Identify genes from DNA microarray and other gene and protein assays • Next step • What is known about these genes? • How are these genes related to each other or other genes identified in similar studies? • Which other genes are most similar

  3. Outline • Hypergraph Mining • Similarity Measures • Evaluation and Observations

  4. Hypergraph Mining: Motivating Example • Micro array experiment - suspects that a small set of genes are related to a disease • Confirm by searching existing literature - expect related genes to appear together in literature • However, suppose Gene A and C are related and both of them are weakly related to another term B • In literature, one would expect • A,C appear together OR/AND • A,B appear together • B,C appear together • How do we efficiently conclude that A,C are actually related?

  5. Hypergraph Mining • Basic Motivation • To find useful “Transitive Relation” (hyperedges) among genes • Example (Gene-Disease Relationship) • Gene A is related to a term B • Term B is related to a gene C • Is Gene A related to Gene C ? • Gene Source • Microarray Experiments • Information Source • Online Literature abstracts

  6. Formal Problem Definition • Given • A dictionary KT • A set KM of user provided keywords (KTכKM) • Collection of literature abstracts - each abstract is represented as a set of words from dictionary • Task • To find hyperedges exceeding user defined threshold, each of which involves a set of key words from KM and are potentially connected by another set of linking words from KT-KM

  7. Relationship to Work on Frequent Pattern Mining • Frequent itemset mining • Can represent each document abstract as a transaction with several keywords • Find sets of keywords that appear together and often • Cannot capture cross relationships • Differences • How do we define support ? • How do we prune search space

  8. Solution Approach • Define • total weight=support + cross support • Support: set of keywords appear together in one document • Cross support: set of keywords can be partitioned • each partition appears in different document • Common linking words • Issues • Since downclosure property does not hold for total weight modified downclosure property can be defined

  9. Idea • Support satisfies downclosure property • Let X be a set, Ω be its power set. A function f : Ω→R+satisfies downclosure property if for all A,B ∈ Ω , A כ B ,f(B)>f(A) • Cross support can be designed to be restricted below a particular value, i.e., it is bounded • Form a function h as addition of two functions h=f+g • f satisfies downclosure property • g is bounded • h satisfies modified down closure property • For any θ≥0, if h(Kn) ≥θ then f(Kn-1) ≥ max{0,(θ-sup(g))} • This property can be used to devise efficient algorithm

  10. Outline • Hypergraph Mining • Similarity Measures • Evaluation and Observations

  11. Similarity Measure among Sets of Genes • Given two list of gene names • Need to find most similar genes, based on literature abstract occurrences • Standard statistics approach • Each file containing gene names can be considered as a Discrete Random Variable (DRV) • Each such DRV can take several values (gene names) • For two such files X,Y and for any pair (x,y), • joint probability mass function p(x,y)=P(X=x,Y=y) • Compute from online abstracts based on co-occurrence

  12. Probability Computation • Assume, • File X has n gene names xi, i ∈{1,…,n} • File Y has m gene names yj, j ∈{1,…,m} • M(i,j) is the number of times (xi,yj) appears together in transactions (article abstracts) • Then, • p(xi,yj)=M(i,j)/{∑i∑jM(i,j)}

  13. Expectation Computation • Now define, • Z=g(X,Y), where g: X x Y →[0,∞) • Clearly, Z is a random variable • Expectation of Z is, • E(Z)=E(g(X,Y))=∑i∑j (g(xi,yj)M(i,j)/Mt) • Where, Mt=∑i∑jM(i,j) • Expected value of Z can directly be used as a similarity measure • Different choices of g, give rise to different similarity measures

  14. Some Choices of function g • First Choice, • Choose g=M(i,j) • This choice leads to similarity measure, se1= ∑i∑j M(i,j)2 /Mt • Second Choice, • Choose g=tot_length(xi,yj), where tot_length (xi,yj) is the sum of transaction lengths where (xi,yj) co-occur • The idea is longer the transaction length, higher the chance of having related linking key words • This choice leads to similarity measure, se2= ∑i∑j tot_length(xi,yj)*M(i,j)/Mt

  15. Extending the notion towards gene ranking • Extend to rank genes from a list Y • Most similar to the genes from list X • Here, instead of Y as a random variable, for each yj ∈Y, consider Uj as a random variable taking value only yj • Find the similarity measure between X and Uj for all j∈{1,…,m} • Sort the genes from list Y according to decreasing similarity measure

  16. Datasets Used two sets of 21 and 31 genes • These genes are differentially expressed between prostate epithelial and stromal cells in prostate cancer patients • Dr Gail Frazer’s lab, Kent State University • A standard dictionary, as reported in literature, containing 300 genes was used • These genes were significantly up or down regulated in tumor and adjacent normal tissues when compared with a normal donor tissue • Each literature abstract was represented in a bag of word format containing words, • where each word comes from a dataset or the dictionary or is a GO term

  17. Results: Hypergraph Mining • Results show the linking GO terms and linking genes from the dictionary for 21 and 31 dataset obtained by hypergraph mining

  18. Results: Similarity Measures • 4 sets of 300 genes each ,- A,B,C,D were formed • A is the dictionary of 300 genes as mentioned before • B,C,D were randomly chosen from superarray’s DNA micro-array experiments • The task is to identify which of A,B,C,D is most similar to the 21 or 31 dataset • As one would expect, A is most similar to the 21 dataset as shown below • It also shows that some naïve similarity measure, such as s1, fails to capture this • Sometimes, this tool discovers some interesting result,- • For 31 dataset, randomly chosen list C was most similar • This has been justified by checking the functionalities of top ranked genes from list C

  19. Results: Ranking • Results of the ranked genes from the most similar list to either 21 or 31 data set • Linking words from hypergraph mining were also found within top 20 genes

  20. Summary • Biological Literature is large and complex • Need data mining tools to summarize interesting patterns • Proposed hypergraph mining and similarity metrics • Initial results are promising

More Related