810 likes | 1.01k Views
Semantic Web & Semantic Web Processes. A course at Universidade da Madeira, Funchal, Portugal June 16-18, 2005 Dr. Amit P. Sheth Professor, Computer Sc., Univ. of Georgia Director, LSDIS lab CTO/Co-founder, Semagix , Inc. Special Thanks: Cartic Ramakrishnan , Karthik Gomadam.
E N D
Semantic Web & Semantic Web Processes A course at Universidade da Madeira, Funchal, Portugal June 16-18, 2005 Dr. Amit P. Sheth Professor, Computer Sc., Univ. of Georgia Director, LSDIS lab CTO/Co-founder, Semagix, Inc Special Thanks: Cartic Ramakrishnan, Karthik Gomadam
Agenda 1 Part I • What is Semantic Web? • What makes the Semantic web • Ontologies – importance of relationships and knowledge • Representation and Languages • Why XML is not enough • Describe semantic web resources- RDF and RDFS • OWL • Query processing and storage Part II • Metadata, Enabling techniques and technologies • Ontology and knowledge engineering: ontology design, ontology population maintaining, ontology freshness • Automated metadata extraction and annotation • Computation and reasoning with focus on relationships • Example commercial Semantic Web platform
Agenda 2 Part III • Semantic web applications: search, integration, analysis • Pan-Web and consumer-centric • Enterprise Part IV • Semantic Web Services and Processes • What are Web Services ? • What are Web processes ? • Creating Web processes: Annotation, discovery, composition, etc. • Semantic Web Service/Process tools
Part I • What is Semantic Web? • What makes the Semantic web • Ontologies – importance of relationships and knowledge • Types and examples of ontologies • Metadata and Semantic Annotation -- metadata classifications • Representation and Languages • Why XML is not enough • RDF - Describe semantic web resources and RDFS - RDF as a triple, RDF as a graph (show example RDF/S) • OWL • RDF Query processing and storage
Semantics (Ontology, Context, Relationships, KB) Generation III 2000s MediaAnywhere InfoQuilt, OBSERVER Semagix Freedom , Semantic Web technologies and platforms Metadata (Domain model) VisualHarness InfoHarness AdaptX/Harness Generation II 1990s Metadata based integration, Mediator Systems, Digital Libraries Data (Schema, “semantic data modeling) Generation I 1980s Mermaid DDTS Intervisio Heterogeneous databases/ Federated Databases Research Three generation of Information Systems: Where we have come from, where we are going
Broad Scope of Semantic (Web) Technology Current Semantic Web Focus Formal Semantic Web Processes Semi-Formal Degree of Agreement Agreement About Qos Informal Execution Scope of Agreement Function Common Sense Gen. Purpose,Broad Based Domain Industry Data/ Info. Task/ App Lots of Useful Semantic Technology (interoperability, Integration) Other dimensions: how agreements are reached, … Cf: Guarino, Gruber
What is the Semantic Web? • "The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation." -- Tim Berners-Lee, James Hendler, Ora Lassila, The Semantic Web, Scientific American, May 2001 • Ontologies • RDF/RDFS or OWL Syntax – machine processable • Semantic Metadata – annotation of web resources
“An ontology is a specification of a conceptualization” (T. Gruber) • A conceptualization is the way we think about a domain • A specification provides a formal way of writing it down Building Ontologies from the Ground Up When users set out to model their professional activity – Mark Mussen
Everything that can be expressed in the language Ontology Constraining Possible Interpretations Of what can Be expressed Conceptualization and Ontology http://www.w3c.it/events/minerva20040706/guarino.pdf
Central Role of Ontology • Ontology represents agreement, represents common terminology/nomenclature • Ontology is populated with extensive domain knowledge or known facts/assertions • Key enabler of semantic metadata extraction from all forms of content: • unstructured text (and 150 file formats) • semi-structured (HTML, XML) and • structured data • Ontology is in turn the center piece that enables • resolution of semantic heterogeneity • semantic integration • semantically correlating/associating objects and documents
Types of Ontologies (or things close to ontology) • Upper ontologies: modeling of time, space, process, etc • Broad-based or general purpose ontology/nomenclatures: Cyc, CIRCA ontology (Applied Semantics), SWETO, WordNet ; • Domain-specific or Industry specific ontologies • News: politics, sports, business, entertainment • Financial Market • Terrorism • Pharma • GlycO, ProPreO • (GO (a nomenclature), UMLS inspired ontology, …), MGED • Application Specific and Task specific ontologies • Anti-money laundering • Equity Research • Repertoire Management • Financial irregularity Fundamentally different approaches in developing ontologies at the two end of the above spectrum
Building ontology Three broad approaches: • social process/manual: many years, committees • Can be based on metadata standard • automatic taxonomy generation (statistical clustering/NLP): limitation/problems on quality, dependence on corpus, naming • Descriptional component (schema) designed by domain experts; Description base (assertional component, extension) using automated processes from trusted knowledge sources Option 2 is being investigated in several research projects; Option 3 is currently supported by Semagix Freedom
Part of the CYC Upper Ontology http://www.cyc.com/cyc/technology/whatiscyc_dir/whatdoescycknow
SWETO (Semantic Web Testbed Ontology) Current Status • Developed using Semagix technology for free non-commercial usage by the SW community; some initial users • V1.4 population includes over 800,000 entities and over 1,500,000 explicit relationships among them • Continue to populate the ontology with diverse sources thereby extending it in multiple domains, new smaller and larger release due soon; RDF and OWL versions • Significant information for provenance/trust support [UMBC partnership] • 97% of disambiguation performed automatically, 2% manually; not quite a high-quality as an evaluation testset (e.g., low connectivity) • Working on test harness, quality measures, and benchmarks
TAMBIS BioPAX EcoCyc Expressiveness Range: Knowledge Representation and Ontologies KEGG Thesauri “narrower term” relation Disjointness, Inverse,part of… Frames (properties) Formal is-a CYC Catalog/ID DB Schema UMLS RDF RDFS DAML Wordnet OO OWL IEEE SUO Formal instance General Logical constraints Informal is-a Value Restriction Terms/ glossary GO SWETO GlycO SimpleTaxonomies ExpressiveOntologies Pharma Ontology Dimensions After McGuinness and Finin
Gene Ontology (GO) • Comprises three independent “ontologies” • molecularfunction of gene products • cellularcomponent of gene products • biological process representing the gene product’s higher order role. • Uses these terms as attributes of gene products in the collaborating databases (gene product associations) • Allows queries across databases using GO terms, providing linkage of biological information across species http://www.geneontology.org/
GO = Three Ontologies • Molecular Function • elemental activity or task • example: DNA binding • Cellular Component • location or complex • example: cell nucleus • Biological Process • goal or objective within cell • example: secretion http://www.geneontology.org/
GlycO • GlycO: a domain Ontology embodying knowledge of the structure and metabolisms of glycans • Contains 770 classes – describe structural features of glycans • URL: http://lsdis.cs.uga.edu/projects/glycomics/glyco is a focused ontology for the description of glycomics • models the biosynthesis, metabolism, and biological relevance of complex glycans • models complex carbohydrates as sets of simpler structures that are connected with rich relationships
GlycO statistics: Ontology schema can be large and complex • 770 classes • 142 slots • Instances Extracted with Semagix Freedom: • 69,516 genes (From PharmGKB and KEGG) • 92,800 proteins (from SwissProt) • 18,343 publications (from CarbBank and MedLine) • 12,308 chemical compounds (from KEGG) • 3,193 enzymes (from KEGG) • 5,872 chemical reactions (from KEGG) • 2210 N-glycans (from KEGG)
GlycO taxonomy The first levels of the GlycO taxonomy Most relationships and attributes in GlycO GlycO exploits the expressiveness of OWL-DL. Cardinality constraints, value constraints, Existential and Universal restrictions on Range and Domain of properties allow the classification of unknown entities as well as the deduction of implicit relationships.
N-glycan_beta_GlcNAc_9 N-glycan_alpha_man_4 GNT-Vattaches GlcNAc at position 6 N-acetyl-glucosaminyl_transferase_V UDP-N-acetyl-D-glucosamine + alpha-D-Mannosyl-1,3-(R1)-beta-D-mannosyl-R2 <=> UDP + N-Acetyl-$beta-D-glucosaminyl-1,2-alpha-D-mannosyl-1,3-(R1)-beta-D-mannosyl-$R2 UDP-N-acetyl-D-glucosamine + G00020 <=> UDP + G00021 A biosynthetic pathway GNT-Iattaches GlcNAc at position 2
The impact of GlycO • GlycO models classes of glycans with unprecedented accuracy • Implicit knowledge about glycans can be deductively derived • Experimental results can be validated according to the model
N-GlycosylationProcess (NGP) Cell Culture By N-glycosylation Process, we mean the identification and quantification of glycopeptides extract Glycoprotein Fraction proteolysis Glycopeptides Fraction 1 Separation technique I n Glycopeptides Fraction PNGase n Peptide Fraction Separation technique II n*m Peptide Fraction Mass spectrometry ms data ms/ms data Data reduction Data reduction ms peaklist ms/ms peaklist binning Peptide identification Glycopeptide identification and quantification N-dimensional array Peptide list Data correlation Signal integration
ProPreO - Experimental Proteomics Process Ontology • ProPreO models the phases of proteomics experiment using five fundamental concepts: • Data: (Example: a peaklist file from ms/ms raw data) • Data_processing_applications: (Example: MASCOT* search engine) • Hardware: embodies instrument types used in proteomics (Example: ABI_Voyager_DE_Pro_MALDI_TOF) • Parameter_list: describes the different types of parameter lists associated with experimental phases • Task: (Example: component separation, used in chromatography) *http://www.matrixscience.com/
Semantic Annotation of Scientific Data <ms/ms_peak_list> <parameter instrument=micromass_QTOF_2_quadropole_time_of_flight_mass_spectrometer mode = “ms/ms”/> <parent_ion_mass>830.9570</parent_ion_mass> <total_abundance>194.9604</total_abundance> <z>2</z> <mass_spec_peak m/z = 580.2985 abundance = 0.3592/> <mass_spec_peak m/z = 688.3214 abundance = 0.2526/> <mass_spec_peak m/z = 779.4759 abundance = 38.4939/> <mass_spec_peak m/z = 784.3607 abundance = 21.7736/> <mass_spec_peak m/z = 1543.7476 abundance = 1.3822/> <mass_spec_peak m/z = 1544.7595 abundance = 2.9977/> <mass_spec_peak m/z = 1562.8113 abundance = 37.4790/> <mass_spec_peak m/z = 1660.7776 abundance = 476.5043/> <ms/ms_peak_list> 830.9570 194.9604 2 580.2985 0.3592 688.3214 0.2526 779.4759 38.4939 784.3607 21.7736 1543.7476 1.3822 1544.7595 2.9977 1562.8113 37.4790 1660.7776 476.5043 ms/ms peaklist data Annotated ms/ms peaklist data
Semantic annotation of Scientific Data <ms/ms_peak_list> <parameter instrument=“micromass_QTOF_2_quadropole_time_of_flight_mass_spectrometer” mode = “ms/ms”/> <parent_ion_mass>830.9570</parent_ion_mass> <total_abundance>194.9604</total_abundance> <z>2</z> <mass_spec_peak m/z = 580.2985 abundance = 0.3592/> <mass_spec_peak m/z = 688.3214 abundance = 0.2526/> <mass_spec_peak m/z = 779.4759 abundance = 38.4939/> <mass_spec_peak m/z = 784.3607 abundance = 21.7736/> <mass_spec_peak m/z = 1543.7476 abundance = 1.3822/> <mass_spec_peak m/z = 1544.7595 abundance = 2.9977/> <mass_spec_peak m/z = 1562.8113 abundance = 37.4790/> <mass_spec_peak m/z = 1660.7776 abundance = 476.5043/> <ms/ms_peak_list> Annotated ms/ms peaklist data
Syntax for Onologies and Metadata • Why not use XML? • Why use OWL? • Or for that matter why RDF? • So many questions …
From XML to OWL NO SEMANTICS • XML • surface syntax for structured documents • imposes no semantic constraints on the meaning of these documents. • XML Schema • is a language for restricting the structure of XML documents. • RDF • is a datamodel for objects ("resources") and relations between them, • provides a simple semantics for this datamodel • these datamodels can be represented in an XML syntax. • RDF Schema • is a vocabulary for describing properties and classes of RDF resources • with a semantics for generalization-hierarchies of such properties and classes. • OWL • adds more vocabulary for describing properties and classes: • relations between classes (e.g. disjointness), • cardinality (e.g. "exactly one"), • equality, richer typing of properties, • characteristics of properties (e.g. symmetry), and enumerated classes. Expressive Power Relationships as first class objects– key to Semantics SEMANTICS http://en.wikipedia.org/wiki/Semantic_web#Components_of_the_Semantic_Web
From an alphabet to a Language • XML • “XML is only the first step to ensuring that computers can communicate freely. XML is an alphabet for computers and as everyone traveling in Europe knows, knowing the alphabet doesn’t mean you can speak Italian of French.” – Business Week, March 18th 2002 • Example cited by Nicola Guarino in http://www.w3c.it/events/minerva20040706/guarino.pdf • RDF/RDFS and OWL would therefore be akin to the language computers use to communicate • And ontologies represented in these languages would be akin to the exact interpretations of the concepts being communicated
Syntax for Onologies and Metadata • RDF • A simple W3C standard used to describe Web resources • Relationships in RDF (Properties), are binary relationships between two resources or a resource and a literal • Resources take on the roles of Subject and Object respectively. • The Subject, Predicate and Object compose an RDF statement http://www.w3.org/RDF/
What is RDF? • Resource Description Framework • Proposed as the base semantic web language • Data model for describing properties of resources • Statements about properties and values of web resources • Machine-understandable metadata
RDF Elements • Resource: • Something that can be described/referenced • Identified by a URI • Property: • Relationship from a resource to a value: • Another resource • An atomic value/literal • Statement: • resource -> property -> value
RDF Model • Formal Data Model • Directed labeled graph • Nodes: resources or literals • Edges: properties (relationships/attributes) • Labels: URIs of nodes and edges • Collection of triples • subject (resource) • predicate (property) • object (resource or literal) • W3C recommendation
RDF Syntax • Formal syntax • Encoded in XML • Unambiguous property names and values • RDF adds rules for interpretation • W3C recommendation
Example <sample:Athlete rdf:about="&sample;Kobe_Bryant"> <rdfs:label xml:lang="en">Kobe Bryant</rdfs:label> <sample:plays_for rdf:resource="&sample;LA_Lakers"/> </sample:Athlete> <sample:Athlete rdf:about="&sample;Shaquille_ONeal"> <rdfs:label xml:lang="en">Shaquille O'Neal</rdfs:label> <sample:plays_for rdf:resource="&sample;Miami_Heat"/> </sample:Athlete> <sample:Team rdf:about="&sample;LA_Lakers" <rdfs:label xml:lang="en">LA Lakers</rdfs:label> </sample:Team> <sample:Team rdf:about="&sample;Miami Heat" <rdfs:label xml:lang="en">Miami Heat</rdfs:label> <sample:competes_with rdf:resource="&sample;LA_Lakers"/> </sample:Team> <sample:Coach rdf:about="&sample;sample1_Instance_8" <rdfs:label xml:lang="en">sample1_Instance_8</rdfs:label> <sample:coaches rdf:resource="&sample;LA_Lakers"/> </sample:Coach>
What is RDFS? • RDF Vocabulary Description Language • (RDF Schema) • Extension of RDF: same data model • graph or triples • A hierarchy of classes • A hierarchy of properties relating classes • W3C recommendation
RDF Schema RDF Instances
“Abdulaziz” “Marwan” “Alomari” “Al-Shehhi” typeOf(instance) String purchased Passenger Ticket subClassOf(isA) fname for String subPropertyOf number lname forflight String paidby purchased no creditedto Flight Bank Account String Customer Payment amount holder float ffid FFlyer fflierno FFNo String CCard Cash Client &r4 ffid “XYZ123” &r11 holder fflierno “M’mmed” fname purchased &r2 paidby &r3 &r1 “Atta” creditedto lname paidby purchased fname for &r5 &r6 lname fname paidby &r7 purchased &r8 &r9 holder lname
RDFS Core Classes • rdfs:Class • Class of resources that are RDF classes • Instance of rdfs:Class • rdfs:Resource • All things being described • The class type of everything in RDF(S) • Instance of rdfs:Class • rdf:Property • Class of RDF properties • Instance of rdfs:Class http://www.w3.org/TR/rdf-schema/
RDFS Core Properties • rdfs:type • A resource is an instance of a class • Instance of rdf:Property • rdfs:subClassOf • All instances of a class are also instances of another class • Instance of rdf:Property • rdfs:subPropertyOf • All resources related by one property are also related by another property • Instance of rdf:Property
RDF Core Properties • rdfs:range • All values of a property are instances of one or more class • The value MUST be an instance of all range classes • Instance of rdf:Property • rdfs:domain • All resources with the given property are instances of one or more class • The resource MUST be an instance of all domain classes • Instance of rdf:Property
OWL, W3C definition • “language for defining structured, Web-based ontology which enables richer integration and interoperability of data across application boundaries” http://www.w3.org/2004/OWL/
OWL Use Cases • Web portals • Multimedia Collections • Corporate web site management • Design documentation • Agents and services • Ubiquitous computing
OWL Design Goals • Shared ontologies • Ontology evolution • Ontology interoperability • Inconsistency detection • Expressivity vs. scalability • Ease of use • Compatibility with other standards • Internationalization
What’s in OWL, but not in RDF • Ability to be distributed across many systems • By means of owl:imports (similar to ‘include’ in C/C++) • Scalable to Web needs (?) • Compatible with Web standards for: • accessibility, and • Internationalization • Open and extensible