Discovery from Linking Open Data (LOD) Annotated Datasets

Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal

Agenda • Motivation • Challenges • Solution approaches

Emergence of biological datasets in the cloud ofLinked Data. Biological objects (e.g., genes or proteins) or clinical trials are annotated with controlled vocabulary terms from ontologies such as GO, MeSH, SNOMED, NCI Thesaurus. Links form a graph that captures meaningful knowledge. Sense making of annotation graphs can explain phenomena, identify anomalies and potentially lead to discovery.

Agenda • Motivation • Drug re-purposing • Cross ontology patterns and literature imprint • Cross genome analysis • Challenges • Solution approaches

Signature: Set of mRNAs that increase or decrease in patients and is significant w.r.t the general population.Compute similarity score [-1, +1]

Of 16,000 pairings, 2664 were significant (q < 0.05); half with an opposite relationship. 53 diseases had significant candidate therapeutic drug-disease relationships.

Sirota et al Findings • Efficacy (literature) for 2 drugs: topiramate and prednisolone. • Evaluated efficacy of cimetidine (over getfinib) for lung adenocarcinoma. • Methodology does not provide avenues for explanation, validation or discovery.

Sirota et al:Identified anomaly in this cluster

Limitations and Extensions • Sirota et al. • Anomaly in drug cluster but their methodology does not allow further investigation. • Sims et al. • Methodology is limited to co-occurrence analysis. • Cannot exploit heterogeneous evidence from LOD sources. • Cannot exploit knowledge in ontologies. • Finding patterns in graph datasets and visualization and explanation.

Agenda • Challenges • Exploiting LOD to create datasets. • Knowledge captured in ontologies. • Similarity metrics/distances tuned for ontologies. • Discovering and validating patterns in graphs. • Literature imprint. • Heterogeneous evidence. • Reasoning with uncertainty.

Solution Approaches • PAnG • PSL • Manjal • ANAPSID • Thanks to our collaborators / domain experts: • Olivier Bodenreider, NLM, NIH • Sherri de Coronado, NCI, NIH • Andreas Thor, University of Leipzig • Louiqa Raschid ++ at UMD • LiseGetoor ++ at UMD • PadminiSrinivasan++ University of Iowa • Maria Esther Vidal ++ Universidad Simon Bolivar

Solution approaches Manjal – Text Mining for MEDLINE Annotation Visualizer – Visualize and explore annotations and patterns PSL: Annotation computation by knowledge propagation PANG: Pattern identification using dense subgraphs and graph summaries. Patterns in ANnotation Graphs Integrated access for heterogeneous data sources: adaptive query processing for SPARQL endpoints TheArabidopsisInformationResource Gene Ontology Clinical Trials

Motivation: Gene Annotation Graphs Anno-tations • Genes are annotated with Gene Ontology (GO) and Plant Ontology (PO) terms • Prediction of new annotations as hypothesis for experiments • Link prediction is predicting new functional annotations for a gene

Link Prediction Framework Filter Link Prediction GraphSumma-rization Dense Subgraph Link Prediction ScoringFunction Distance Restriction Cost Model Graphsummary Ranked List of pre-dictedLinks DenseSubgraph Tripartite Anno-tation Graph (TAG) • Dense Subgraph (optional) • Focus on highly connected subgraphs • Graph summarization: • Identify basic pattern (structure) of the graph • Link Prediction • Predicted links reinforce underlying graph pattern

Dense Subgraph [1] Saha et al. Dense subgraphs with restrictions and applications to gene annotation graphs. RECOMB, 2010 • Motivation: graph area that is rich or dense with annotation is an “interesting region” • Density of a subgraph = number of induced edges / number of vertices • Tripartite graph with node set (A, B, C) is converted into bipartite graph with (A, C) • Weighted edges = number of shared b’s • Apply technique of [1] • Distance restriction for DSG possible • Hierarchically arranged ontology terms • All node pairs of A and C are within a given distance

Graph Summarization HY5 PO_9006 PHOT1 PO_20030 CIB5 CRY2 PO_37 COP1 PO_20038 CRY1 = HY5 PO_9006 PHOT1 PO_20030 CIB5 CRY2 PO_37 COP1 PO_20038 CRY1 • Minimum description length approach [2] • Loss-free; employs cost model • Graph summary = Signature + Corrections • Signature: graph pattern / structure • Super nodes = complete partitioning of nodes • Super edges = edges between super nodes = all edges between nodes of super nodes • Corrections: edges e between individual nodes • Additions: e  G but e  signature • Deletions: e  G but e  signature [2] Navlakha et.al. Graph summarization with bounded error. SIGMOD, 2008

HY5 PO_9006 PHOT1 PO_20030 CIB5 CRY2 PO_37 COP1 PO_20038 CRY1 = HY5 HY5 PO_9006 PO_9006 PHOT1 PHOT1 PSL PO_20030 PO_20030 CIB5 CIB5 CRY2 CRY2 PO_37 PO_37 COP1 COP1 PO_20038 PO_20038 DSG+GS CRY1 CRY1

Distance metrics

Different retrieved sets of lung cancer related clinical trials Idenitfy 100 clinical trials using the search keyword “lung cancer” in CONDITION. Retrieve CT, CONDITION and INTERVENTION. Created a dense subgraph (almost clique; highly connected subgraph). Created a graph summary to visualize the output. Retrieve 100 trials using “lung carcinoma” in the CONDITION field. Retrieve 100 trails using “lung carcinoma” in any field.

Retrieve 100 clinical trials using search keyword “lung cancer”. Created a dense subgraph (almost clique; highly connected subgraph). Created a graph summary to visualize the output.

100 clinical trials using search keyword “lung carcinoma” for CONDITION.

100 clinical trials using search keyword “lung carcinoma” for ALL FIELDS.

Questions? PAnG/PSL/ANAPSID/Manjal

Discovery from Linking Open Data (LOD) Annotated Datasets

Discovery from Linking Open Data (LOD) Annotated Datasets

Presentation Transcript

Linking Open Drug Data (HCLSIG LODD)

National Open Data Portal and LOD

Linking Data from Multiple Sources

Semantic Web and Linked Data

Linking Social, Open, and Enterprise Data

Discussion: Convergence of the Multilingual Web and Linked Open Data 11 June 2012

Getting triples from records: the role of ISBD

PROPOSAL FOR AIMS and strategy of the EUROCRIS Linked open data Task group (LOD-TG)

Linking Data from ScienceDirect Articles

OpenEI and Linked Open Data

Linked Open Library Data @hbz

Linking Data to Open Access Publications

Linking Open Data

LOD 100² : Data Intelligent Families in Mega Project

From Data to Discovery

View-Dependent LOD

Linking Open Data with Location: Gazetteers and the Semantic Web

From Data to Discovery

Linking Open Drug Data (HCLSIG LODD)

Linked Open Data Dataset from Related Documents