290 likes | 387 Views
Discovery from Linking Open Data (LOD) Annotated Datasets. Louiqa Raschid University of Maryland PAnG /PSL/ANAPSID/ Manjal. Agenda. Motivation Challenges Solution approaches. Emergence of biological datasets in the cloud of Linked Data.
E N D
Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal
Agenda • Motivation • Challenges • Solution approaches
Emergence of biological datasets in the cloud ofLinked Data. Biological objects (e.g., genes or proteins) or clinical trials are annotated with controlled vocabulary terms from ontologies such as GO, MeSH, SNOMED, NCI Thesaurus. Links form a graph that captures meaningful knowledge. Sense making of annotation graphs can explain phenomena, identify anomalies and potentially lead to discovery.
Agenda • Motivation • Drug re-purposing • Cross ontology patterns and literature imprint • Cross genome analysis • Challenges • Solution approaches
Signature: Set of mRNAs that increase or decrease in patients and is significant w.r.t the general population.Compute similarity score [-1, +1]
Of 16,000 pairings, 2664 were significant (q < 0.05); half with an opposite relationship. 53 diseases had significant candidate therapeutic drug-disease relationships.
Sirota et al Findings • Efficacy (literature) for 2 drugs: topiramate and prednisolone. • Evaluated efficacy of cimetidine (over getfinib) for lung adenocarcinoma. • Methodology does not provide avenues for explanation, validation or discovery.
Limitations and Extensions • Sirota et al. • Anomaly in drug cluster but their methodology does not allow further investigation. • Sims et al. • Methodology is limited to co-occurrence analysis. • Cannot exploit heterogeneous evidence from LOD sources. • Cannot exploit knowledge in ontologies. • Finding patterns in graph datasets and visualization and explanation.
Agenda • Challenges • Exploiting LOD to create datasets. • Knowledge captured in ontologies. • Similarity metrics/distances tuned for ontologies. • Discovering and validating patterns in graphs. • Literature imprint. • Heterogeneous evidence. • Reasoning with uncertainty.
Solution Approaches • PAnG • PSL • Manjal • ANAPSID • Thanks to our collaborators / domain experts: • Olivier Bodenreider, NLM, NIH • Sherri de Coronado, NCI, NIH • Andreas Thor, University of Leipzig • Louiqa Raschid ++ at UMD • LiseGetoor ++ at UMD • PadminiSrinivasan++ University of Iowa • Maria Esther Vidal ++ Universidad Simon Bolivar
Solution approaches Manjal – Text Mining for MEDLINE Annotation Visualizer – Visualize and explore annotations and patterns PSL: Annotation computation by knowledge propagation PANG: Pattern identification using dense subgraphs and graph summaries. Patterns in ANnotation Graphs Integrated access for heterogeneous data sources: adaptive query processing for SPARQL endpoints TheArabidopsisInformationResource Gene Ontology Clinical Trials
Motivation: Gene Annotation Graphs Anno-tations • Genes are annotated with Gene Ontology (GO) and Plant Ontology (PO) terms • Prediction of new annotations as hypothesis for experiments • Link prediction is predicting new functional annotations for a gene
Link Prediction Framework Filter Link Prediction GraphSumma-rization Dense Subgraph Link Prediction ScoringFunction Distance Restriction Cost Model Graphsummary Ranked List of pre-dictedLinks DenseSubgraph Tripartite Anno-tation Graph (TAG) • Dense Subgraph (optional) • Focus on highly connected subgraphs • Graph summarization: • Identify basic pattern (structure) of the graph • Link Prediction • Predicted links reinforce underlying graph pattern
Dense Subgraph [1] Saha et al. Dense subgraphs with restrictions and applications to gene annotation graphs. RECOMB, 2010 • Motivation: graph area that is rich or dense with annotation is an “interesting region” • Density of a subgraph = number of induced edges / number of vertices • Tripartite graph with node set (A, B, C) is converted into bipartite graph with (A, C) • Weighted edges = number of shared b’s • Apply technique of [1] • Distance restriction for DSG possible • Hierarchically arranged ontology terms • All node pairs of A and C are within a given distance
Graph Summarization HY5 PO_9006 PHOT1 PO_20030 CIB5 CRY2 PO_37 COP1 PO_20038 CRY1 = HY5 PO_9006 PHOT1 PO_20030 CIB5 CRY2 PO_37 COP1 PO_20038 CRY1 • Minimum description length approach [2] • Loss-free; employs cost model • Graph summary = Signature + Corrections • Signature: graph pattern / structure • Super nodes = complete partitioning of nodes • Super edges = edges between super nodes = all edges between nodes of super nodes • Corrections: edges e between individual nodes • Additions: e G but e signature • Deletions: e G but e signature [2] Navlakha et.al. Graph summarization with bounded error. SIGMOD, 2008
HY5 PO_9006 PHOT1 PO_20030 CIB5 CRY2 PO_37 COP1 PO_20038 CRY1 = HY5 HY5 PO_9006 PO_9006 PHOT1 PHOT1 PSL PO_20030 PO_20030 CIB5 CIB5 CRY2 CRY2 PO_37 PO_37 COP1 COP1 PO_20038 PO_20038 DSG+GS CRY1 CRY1
Different retrieved sets of lung cancer related clinical trials Idenitfy 100 clinical trials using the search keyword “lung cancer” in CONDITION. Retrieve CT, CONDITION and INTERVENTION. Created a dense subgraph (almost clique; highly connected subgraph). Created a graph summary to visualize the output. Retrieve 100 trials using “lung carcinoma” in the CONDITION field. Retrieve 100 trails using “lung carcinoma” in any field.
Retrieve 100 clinical trials using search keyword “lung cancer”. Created a dense subgraph (almost clique; highly connected subgraph). Created a graph summary to visualize the output.
100 clinical trials using search keyword “lung carcinoma” for CONDITION.
100 clinical trials using search keyword “lung carcinoma” for ALL FIELDS.
Questions? PAnG/PSL/ANAPSID/Manjal