Relationship Web: Spinning the Semantic Web from Trailblazing to Complex Hypothesis Evaluation August 2007

Relationship Web: Spinning the Semantic Web from Trailblazing to Complex Hypothesis Evaluation August 2007 Amit Sheth Kno.e.sis Center, Wright State University, Dayton, OH . This talk also represents work of several members of Kno.e.sis team, esp. Cartic Ramakrishnan. http://knoesis.wright.edu

Not data (search), but integration, analysis and insight, leading to decisionsanddiscovery

“An object by itself is intensely uninteresting”. Grady Booch, Object Oriented Design with Applications, 1991 Objects of Interest (Desire?) Entities | Integration Relationships | Analysis, Insight Keywords | Search Changing the paradigm from document centric to relationship centric view of information

Data captured per year = 1 exabyte (1018)(Eric Neumann, Science, 2005) Multiple formats: Structured, unstructured, semi-structured Multimodal: text, image, a/v, sensor, scientific/engineering Thematic, Spatial, Temporal Enterprise to Globally Distributed Death by Data: Size, Heterogeneity & Complexity

Moving from Syntax/Structure to Semantics Is There A Silver Bullet?

Semantics: Meaning & Use of Data Semantic Web: Labeling data on the Web so both humans and machines can use them more effectively i.e., Formal, machine processable description  more automation; emerging standards/technologies (RDF, OWL, Rules, …) Approach & Technologies

How? Ontology: Agreement with Common Vocabulary & Domain Knowledge Semantic Annotation: metadata (manual & automatic metadata extraction) Reasoning: semantics enabled search, integration, analysis, mining, discovery Is There A Silver Bullet?

Time, Space Gene Ontology, Glycomics, Proteomics Pharma Drug, Treatment-Diagnosis Repertoire Management Equity Markets Anti-money Laundering, Financial Risk, Terrorism Biomedicine is one of the most popular domains in which lots of ontologies have been developed and are in use. See: http://obo.sourceforge.net/browse.html Clinical/medical domain is also a popular domain for ontology development and applications: http://www.openclinical.org/ontologies.html Ontology Examples

is a focused ontology for the description of glycomics models the biosynthesis, metabolism, and biological relevance of complex glycans models complex carbohydrates as sets of simpler structures that are connected with rich relationships An ontology for structure and function of Glycopeptides Published through the National Center for Biomedical Ontology (NCBO) More at: http://knoesis.wright.edu/research/bioinformatics/ GlycO

An ontology for capturing process and lifecycle information related to proteomic experiments Two aspects of glycoproteomics: What is it?→ identification How much of it is there? → quantification Heterogeneity in data generation process, instrumental parameters, formats Need data and process provenance→ ontology-mediated provenance Hence, ProPreO models both the glycoproteomics experimental process and attendant data Approx 500 classes, 3million+ instances Published through the National Center for Biomedical Ontology (NCBO) and Open Biomedical Ontologies (OBO) More info. On Knowledge Representation in Life Sciences at Kno.e.sis ProPreO ontology

N-glycan_beta_GlcNAc_9 N-glycan_alpha_man_4 GNT-Vattaches GlcNAc at position 6 N-acetyl-glucosaminyl_transferase_V UDP-N-acetyl-D-glucosamine + alpha-D-Mannosyl-1,3-(R1)-beta-D-mannosyl-R2 <=> UDP + N-Acetyl-$beta-D-glucosaminyl-1,2-alpha-D-mannosyl-1,3-(R1)-beta-D-mannosyl-$R2 UDP-N-acetyl-D-glucosamine + G00020 <=> UDP + G00021 N-Glycosylation metabolic pathway GNT-Iattaches GlcNAc at position 2

Pathway Steps - Reaction Evidence for this reaction from three experiments Pathway visualization tool by M. Eavenson and M. Janik, LSDIS Lab, Univ. of Georgia

Abundance of this glycan in three experiments Pathway Steps - Glycan Pathway visualization tool by M. Eavenson and M. Janik, LSDIS Lab, Univ. of Georgia

Extracting Semantic Metadata from semistructured and structured sources Semagix Freedom for building ontology-driven information system © Semagix, Inc.

Information Extractionfor Metadata Creation WWW, Enterprise Repositories Nexis UPI AP Digital Videos Data Stores Feeds/ Documents Digital Maps . . . . . . Digital Images Digital Audios . . . Create/extract as much (semantics)metadata automatically as possible EXTRACTORS METADATA

Automatic Semantic Metadata Extraction/Annotation

Semantic Annotation – Elsevier iConsult content

Cell Culture extract Glycoprotein Fraction proteolysis Glycopeptides Fraction 1 Separation technique I n Glycopeptides Fraction PNGase n Peptide Fraction Separation technique II n*m Peptide Fraction Mass spectrometry ms data ms/ms data Data reduction Data reduction ms peaklist ms/ms peaklist binning Peptide identification Glycopeptide identification and quantification N-dimensional array Peptide list Data correlation Signal integration N-Glycosylation Process (NGP)

Semantic Web Process to incorporate provenance Biological Sample Analysis by MS/MS Agent Raw Data to Standard Format Agent Data Pre- process Agent DB Search (Mascot/Sequest) Agent Results Post-process (ProValt) O I O I O I O I O Storage Raw Data Standard Format Data Filtered Data Search Results Final Output Biological Information ISiS – Integrated Semantic Information and Knowledge System Semantic Annotation Applications

ProPreO: Ontology-mediated provenance Semantic Extraction/Annotation of Experimental Data parent ion charge 830.9570 194.9604 2 580.2985 0.3592 688.3214 0.2526 779.4759 38.4939 784.3607 21.7736 1543.7476 1.3822 1544.7595 2.9977 1562.8113 37.4790 1660.7776 476.5043 parent ion m/z parent ionabundance fragment ion m/z fragment ionabundance ms/ms peaklist data Mass Spectrometry (MS) Data

Semantic Annotation Facilitates Complex Queries • Evaluate the specific effects of changing a biological parameter: Retrieve abundance data for a given protein expressed by three different cell types of a specific organism. • Retrieve raw data supporting a structural assignment: Find all the raw ms data files that contain the spectrum of a given peptide sequence having a specific modification and charge state. • Detect errors: Find and compare all peptide lists identified in Mascot output filesobtained using a similar organism, cell-type, sample preparation protocol, and mass spectrometry conditions. A Web ServiceMust Be Invoked ProPreO concepts highlighted in red

Example of Relevant Subgraph Discoverybased on evidence

Anecdotal Example UNDISCOVERED PUBLIC KNOWLEDGE Discovering connections hidden in text 24

Schema-Driven Extraction of Relationships from Biomedical Text Cartic Ramakrishnan, Krys Kochut, Amit P. Sheth: A Framework for Schema-Driven Relationship Discovery from Unstructured Text. International Semantic Web Conference 2006: 583-596 [.pdf]

Method – Parse Sentences in PubMed SS-Tagger (University of Tokyo) SS-Parser (University of Tokyo) • Entities (MeSH terms) in sentences occur in modified forms • “adenomatous” modifies “hyperplasia” • “An excessive endogenous or exogenous stimulation” modifies “estrogen” • Entities can also occur as composites of 2 or more other entities • “adenomatous hyperplasia” and “endometrium” occur as “adenomatous hyperplasia of the endometrium” (TOP (S (NP (NP (DT An) (JJ excessive) (ADJP (JJ endogenous) (CC or) (JJ exogenous) ) (NN stimulation) ) (PP (IN by) (NP (NN estrogen) ) ) ) (VP (VBZ induces) (NP (NP (JJ adenomatous) (NN hyperplasia) ) (PP (IN of) (NP (DT the) (NN endometrium) ) ) ) ) ) )

Method – Identify entities and Relationships in Parse Tree Modifiers TOP Modified entities Composite Entities S VP UMLS ID T147 NP VBZ induces NP PP NP NN estrogen NP IN by JJ excessive PP DT the ADJP NN stimulation MeSHID D004967 IN of JJ adenomatous NN hyperplasia NP JJ endogenous JJ exogenous CC or MeSHID D006965 NN endometrium DT the MeSHID D004717

Resulting Semantic Web Data in RDF Modifiers Modified entities Composite Entities

4733 documents 9284 documents 5 documents Now possible – Extracting relationships between MeSH terms from PubMed UMLS Semantic Network complicates Biologically active substance affects causes causes Disease or Syndrome Lipid affects instance_of instance_of ??????? Fish Oils Raynaud’s Disease MeSH PubMed

Once you have Semantic Web Data

Approaches for Weighted Graphs QUESTION 1: Given an RDF graph without weights can we use domain knowledge to compute the strength of connection between any two entities? QUESTION 2: Can we then compute the most “relevant” connections for a given pair of entities? QUESTION 3: How many such connections can there be? Will this lead to a combinatorial explosion? Can the notion of relevance help?

Overview Problem: Discovering relevant connections between entities All Paths problem is NP-Complete Most informative paths are not necessarily the shortest paths Possible Solution: Heuristics-based Approach* Find a smart, systematic way to weight the edges of the RDF graph so that the most important paths will have highest weight Adopt algorithms for weighted graphs Model graph as an electrical circuit† with weight representing conductance and find paths with highest current flow – i.e. top-k • *Cartic Ramakrishnan, William Milnor, Matthew Perry, Amit Sheth. "Discovering Informative Connection Subgraphs in Multi-relational Graphs", SIGKDD Explorations Special Issue on Link Mining, Volume 7, Issue 2, December 2005 • †Christos Faloutsos, Kevin S. McCurley, Andrew Tomkins: Fast discovery of connection subgraphs. KDD 2004: 118-127 33

Graph Weights What is a good path with respect to knowledge discovery? Uses more specific classes and relationships e.g. Employee vs. Assistant Professor Uses rarer facts Analogous to information gain Involves unexpected connections e.g. connects entities from different domains 34

Class and Property Specificity (CS, PS) • More specific classes and properties convey more information • Specificity of property pi: • d(pi) is the depth of pi • d(piH) is the depth of the property hierarchy • Specificity of class cj: • d(ci) is the depth of cj • d(ciH’) is the depth of the class hierarchy • Node is weighted and this weight is propagated to edges incident to the node 35

Instance Participation Selectivity (ISP) • Rare facts are more informative than frequent facts • Define a type of an statement RDF <s,p,o> • Triple π= <Ci,pj,Ck> • typeOf(s) = Ci • typeOf(o) = Ck • | π | = number of statements of type π in an RDF instance base • ISP for a statement: σπ = 1/|π| 36

π = <Person, lives_in, City> π’ = <Person, council_member_of, City> σπ =1/(k-m) and σπ’ = 1/m, and if k-m>m then σπ’> σπ 37

Span Heuristic (SPAN) RDF allows Multiple classification of entities Possibly classified in different schemas Tie different schemas together Refraction is Indicative of anomalous paths SPAN favors refracting paths Give extra weight to multi-classified nodes and propagate it to the incident edges 38

Going Further What if we are not just interested in knowledge discovery style searches? Can we provide a mechanism to adjust relevance measures with respect to users’ needs? Conventional Search vs. Discovery Search Yes! … SemRank* * Kemafor Anyanwu, Angela Maduko, Amit Sheth. “SemRank: Ranking Complex Relationship Search Results on the Semantic Web”, The 14th International World Wide Web Conference, (WWW2005), Chiba, Japan, May 10-14, 2005 40

Low Information Gain Low Refraction Count High S-Match High Information Gain High Refraction Count High S-Match adjustable search mode 41

Blazing Semantic Trails in Biomedical Literature Cartic Ramakrishnan, Amit P. Sheth: Blazing Semantic Trails in Text: Extracting Complex Relationships from Biomedical Literature. Tech. Report #TR-RS2007 [.pdf]

“The physician, puzzled by her patient's reactions, strikes the trail established in studying an earlier similar case, and runs rapidly through analogous case histories, with side references to the classics for the pertinent anatomy and histology. The chemist, struggling with the synthesis of an organic compound, has all the chemical literature before him in his laboratory, with trails following the analogies of compounds, and side trails to their physical and chemical behavior.” [V. Bush, As We May Think. The Atlantic Monthly, 1945. 176(1): p. 101-108. ] Relationships -- Blazing the Trails

Original documents PMID-15886201 PMID-10037099

Complex relationships connecting documents – Semantic Trails

Semantic Trail

Semantic Trails can be built over a Web of Semantic (Meta)Data extracted (manually, semi-automatically and automatically) and gleaned from Structured data (e.g., NCBI databases) Semi-structured data (e.g., XML based and semantic metadata standards for domain specific data representations and exchanges) Unstructured data (e.g., Pubmed and other biomedical literature) and Various modalities (experimental data, medical images, etc.) Semantic Trails over all types of Data

Relationship Web Semantic Metadata can be extracted from unstructured (eg, biomedical literature), semi-structured (eg, some of the Web content), structured (eg, databases) data and data of various modalities (eg, sensor data, biomedical experimental data). Focusing on the relationships and the web of their interconnections over entities and facts (knowledge) implicit in data leads to a Relationship Web. Relationship Web takes you away from “which document” could have information I need, to “what’s in the resources” that gives me the insight and knowledge I need for decision making. Amit P. Sheth, Cartic Ramakrishnan: Relationship Web: Blazing Semantic Trails between Web Resources. IEEE Internet Computing, July 2007. 48

Prototype Semantic Web application demonstration 2 Demonstration of Semantic Trailblazing using a Semantic Browser This application demonstrating use of ontology-supported relationship extraction (represented in RDF) and their traversal in context (as deemed relevant by the scientists), linking parts of knowledge represented in one biomedical document (currently a sentence in an abstract in Pubmed) to parts of knowledge represented in another document. This is a prototype and lot more work remains to be done to build a robust system that can support Semantic Trailblazing. For more information: Cartic Ramakrishnan, Krys Kochut, Amit P. Sheth: A Framework for Schema-Driven Relationship Discovery from Unstructured Text. International Semantic Web Conference 2006: 583-596 [.pdf] Cartic Ramakrishnan, Amit P. Sheth: Blazing Semantic Trails in Text: Extracting Complex Relationships from Biomedical Literature. Tech. Report #TR-RS2007 [.pdf] 49

Applications “Everything's connected, all along the line. Cause and effect. That's the beauty of it. Our job is to trace the connections and reveal them.” Jack in Terry Gilliam’s 1985 film - “Brazil” Applications

Relationship Web: Spinning the Semantic Web from Trailblazing to Complex Hypothesis Evaluation August 2007

Relationship Web: Spinning the Semantic Web from Trailblazing to Complex Hypothesis Evaluation August 2007

Presentation Transcript

Creating and Exploiting a Web of Semantic Data

Hypothesis Testing

Framing, spinning and padding

Evaluation of Semantic Search

Overview of Semantic Tools

Data Triangulation in a User Evaluation of the Sealife Semantic Web Browsers

The Semantic Web

Alternative representations: Semantic networks

Semantic Analysis (Generating An AST)

Dr. Bhavani Thuraisingham

Friction Spinning

Trailblazing…

Hypothesis Testing:

Social Semantic Desktop Reference Architecture Evaluation

Fiber spinning

Semantic Web and Knowledge Representation

Semantic Agent Programming Language (S-APL): A Middleware Platform for the Semantic Web

Evaluation

Semantic Web

Measuring the Semantic Similarity of Texts

Geospatial Semantic Web

The Research Hypothesis