240 likes | 346 Views
Graph and Topological Structure Mining on Scientific Articles. Fan Wang, Ruoming Jin, Gagan Agrawal and Helen Piontkivska The Ohio State University The Kent State University. Presenter: Fan Wang The Ohio State University. Outline. Introduction Topological Structure Mining
E N D
Graph and Topological Structure Mining on Scientific Articles Fan Wang, Ruoming Jin, Gagan Agrawal and Helen Piontkivska The Ohio State University The Kent State University Presenter: Fan Wang The Ohio State University
Outline • Introduction • Topological Structure Mining • Data Preprocessing and Graph Representations • Experiment Results and Pattern Analysis • Conclusion
Introduction • Huge number of genes in literature • Associated with targeted disease or functionality • Finding interaction among genes manually • Time consuming • Error Prone
Introduction • Well-known relationship among chemokine ligands • Mining these relations from literature documents • Mining frequent patterns from graph datasets • Convenient representation • Lots of research in subgraph mining
Introduction • Our Goal • Find commonly occurring interactions • Represent them visually • Capture the co-occurrence of scientific terms • Graph representation of scientific document • Mining frequent topological structures
Outline • Introduction • Topological Structure Mining • Data Preprocessing and Graph Representations • Experiment Results and Pattern Analysis • Conclusion
Topological Structure Mining • Disadvantages of subgraph mining • Exact matching • Missing potential patterns • Focusing on the topological relationship • Incorporating approximate matching
Topological Structure Mining G X G is a subgraph of Y Y X is a (0,3) topological structure of Y
Topological Structure Mining • Definition • Given a collection of graphs, two parameters l and h, and a threshold θ. A (l,h)-topological structure whose support is greater than or equal to θis called a frequent topological structure. • Given a set of graphs, in our KDD05 paper, an algorithm TSMiner finding frequent topological structures is implemented
Our Work • Using topological structure mining • Challenges • How to create graphs? • What are the keywords? • How to insert edges into graphs?
Outline • Introduction • Topological Structure Mining • Data Preprocessing and Graph Representations • Experiment Results and Pattern Analysis • Conclusion
Data Preprocessing and Graph Representation • One graph for each document • Nodes are keywords of interest • Edges inserted based on occurrence of the keywords • Run topological structure mining algorithm
Data Preprocessing • Four dictionaries of keywords • Short Dictionary • 321 genes expressed between prostate epithelial and stromal cells • Long Dictionary • 2600 human genes found in supperarray’s DNA microarray experiment • Confusion Dictionary • Gene names easily confused with ordinary words • GO Dictionary • GO terms (molecular function, biological process and cellular component)
Graph Representations • Edge Construction Methods • Sentence-based Method • Two keywords in one sentence • Mutual Information Method • The mutual information of two keywords greater than a threshold • Sliding Window Method • Two keywords located within a sliding window with a pre-defined size
Outline • Introduction • Topological Structure Mining • Data Preprocessing and Graph Representations • Experiment Results and Pattern Analysis • Conclusion
Experiment Results • Focusing on articles containing at least one of the 5 genes • CCL5, TF, IGF1, MYLK, IGFBP3 • Generating graph for each article • Finding frequent topological structures
Results • Sliding window method wins • Largest number of frequent patterns • Best scalability • Topological structure mining giving us more frequent patterns • Large number doesn’t mean high biological significance
Pattern Analysis • ONLY be found by topological structure mining • ONLY be found by sliding window method • Restoring nodes revealing interesting patterns
Outline • Introduction • Topological Structure Mining • Data Preprocessing and Graph Representations • Experiment Results and Pattern Analysis • Conclusion
Conclusion • Sliding window method is the best • The most number of frequent patterns • The highest quality of frequent patterns • Topological structures found corresponding well to known relationships • Topological mining being a very valuable tool for biological researchers
Three Edge Construction Methods • Interestingness of Edges • Counting the number of distinct edges • Computing the average interestingness of edges for all patterns found by using each edge construction method