1.7k likes | 1.71k Views
Explore the challenges and opportunities of dynamic linked datasets and remote change management in the context of the Semantic Web. Learn about the benefits, methodologies, and best practices for handling evolving data.
E N D
Data and Knowledge Evolution Giorgos Flourisfgeo@ics.forth.gr Open Data Tutorials, May 2013 Slides available at: http://www.ics.forth.gr/~fgeo/Publications/WOD13.ppt
World Wide Web • WWW (and HTML) focus on human readability • Page presentation (fonts, colors, images, …) • Human understanding • Presentation Semantical content • Content is not formally described (for a machine to understand) • WWW contains documents, not data
Problems with the Current Web • Search and access becomes difficult • Software ignorant of the semantical content of a web page • Keyword search • High recall, low precision • Terminological issues • Synonyms (heart disease = cardiac disease) • Hyponyms/hypernyms (parliament members are politicians) • Queries on the semantical content cannot be made • Fetch articles that support B. Obama’s foreign policy • Fetch the home pages of all members of the Greek Parliament
Semantic Web • The Semantic Web is an extension of the current webin which information is given well-defined meaning, better enabling computers and people to workin cooperation[BLHL01] • The Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries http://www.w3.org/2001/sw/ • [Semantic Web] is a collaborative effort led by W3C with participation from a large number of researchers and industrial partnershttp://www.w3.org/2001/sw/
Semantic Web in Practice • Web of data, rather than documents • HTML for presentation • Semantical languages for semantical content • Readable and understandable by humans and machines • Semantic Web languages, protocols, etc • Web page annotation (metadata descriptions etc) • Publication of data on the Internet • Efficient communication and manipulation of data over the Internet • Different applications • Efficient searching • Sharing of data (e-science, e-government, remote learning, …) • Linked Open Data (more on that later)
Ontologies and Data (Datasets) • An ontology is an explicit specification of a shared conceptualizationof a domain [Gru93] • Precise, logical account of the intended meaning of terms • Common (shared) interpretation of terms • Formal vocabulary for information exchange (humans/machines) • Ontologies (vocabularies) allow the description of data • Terminology: • Ontology = vocabulary = schema • Data = instances • Dataset = data and the related ontology (i.e., a dataset may contain schema and/or data)
Dataset Dynamics • Datasets change constantly • World changes (dynamic models) • View on the world changes (new knowledge, measurements, etc) • Perspective and usage changes • Example: • Gene Ontology (information about gene products): daily versions • DBPedia: 1,4 updates/second (http://live.dbpedia.org/LiveStats/) [MLA+12] • Need methodologies to cope with the problems related to dynamicity • Evolution (modify a dataset in response to a change) • Versioning (keep track of versions and their relations) • Debugging, cleaning, repairing, quality (maintain consistency and quality in a dynamic environment) • Change monitoring, detection and propagation (identify changes and use them to synchronize remote datasets) • …
Linked (Open) Data • Datasets can be interlinked • Sharing knowledge • Reusing knowledge • Modular development • Reuse of schemas • Linked Open Data (LOD) movement • Constantly growing • 31 billion triples and 295 datasets as of September 2011
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/ Linked Open Data Cloud Diagram
Linked Open Data Challenges • Both a blessing and a curse • Added-value benefits • Discovery of unknown correlations, connections, relationships • Vast amount of interrelated knowledge • No central control, everyone can publish and relate to others • Quality of datasets lies/depends on different providers • A change in one dataset affects all related ones • Several new problems related to dynamics • Propagation of changes among interrelated datasets • Maintaining the quality of local datasets • Co-evolution
Scope: Dynamic Linked Datasets You are here Dynamic Datasets LinkedDatasets
Purpose of This Talk • To survey different research areas related to dynamic LOD • Remote Change Management • Repair • Data and Knowledge Evolution • Categorize and classify works in each field • Broad but shallow description • Several references for more in-depth study • No claims of completeness (references are just indicative) • Two relevant surveys: [FMK+08, ZAA+13] • Emphasis on some related work done in FORTH • Will avoid technical discussion • References will be given for further details
Defining Remote Change Management • Managing the effects of remote changes on interlinked datasets • Remote changes have profound effects on local datasets • Good practices are important • Proper versioning, change logging, adaptation to remote changes, … • Attention exploded after the success of the LOD paradigm • Related research questions • How should I version my data? • How can I efficiently monitor changes in my dataset? • How can I detect changes in remote datasets? • How does the evolution of remote datasets affect my data? • How can I efficiently propagate changes from one dataset to another?
RD1 LD1 RD0 LD0 Remote Change Management: Visualization Remote Site Versioning, Change Monitoring Change Detection Local Site Change Propagation
Remote Change Management: Structure • Three subfields • Versioning • Change monitoring and detection • Change propagation • Structure • Introduction, definition of subfields • Literature review • An approach for change detection [PFF+13]
Defining Repair • Assessing and improving the quality and the semantical or structural integrity of the data • Maintaining consistency, coherency, validity • Restoring consistency, coherency, validity, when violated • Assessing and improving quality • Preserve quality/integrity in the face of remote changes • Related research questions • How can I preserve the integrity and quality of my data in a dynamic and interlinked environment? • How can I guarantee consistency and validity? • How can I restore consistency and validity, if violated?
Repair: Visualization D1 D0 Repair Process(Cleaning, Debugging, Repairing, Quality Enhancement) Assessment Module (Diagnosis, Quality Assessment)
Repair: Structure • Four subfields • Cleaning • Debugging • Validity repair • Quality enhancement • Structure • Introduction, definition of subfields • Literature review • An approach for validity repair [RFC11]
Defining Evolution • Modifying a dataset in response to a change in the domain or its conceptualization • Identify the result of applying new information on the dataset • Determine the result of change propagation from remote datasets • Understand the process of change • Related research questions • What is the semantics of evolution and change? • How can I efficiently compute the ideal evolution result?
D1 D0 Evolution: Visualization Real World EvolutionAlgorithm Delete_Class(…)Pull_Up_Class(…)Rename_Class(…)… Dataset
Evolution: Summary • Evolution topics • Understanding the evolution challenges • Understanding the process of change • Balancing between philosophical and practical considerations • Cross-fertilization with belief change • Structure • Introduction, connection with belief change • Understanding the process of change • Literature review
General Structure of this Talk • Introduction to RDF/S, DLs, OWL • Remote change management • Introduction, definition of subfields • Literature review • An approach for change detection [PFF+13] • Repair • Introduction, definition of subfields • Literature review • An approach for validity repair [RFC11] • Data and Knowledge Evolution • Introduction, connection with belief change • Understanding the process of change • Literature review The final few slides contain citations for the references in this talk Part I(2 hours) Part II(1 hour)
Talk Structure (A) • Introduction to RDF/S, DLs, OWL • Remote change management • Introduction, definition of subfields • Literature review • An approach for change detection [PFF+13] • Repair • Introduction, definition of subfields • Literature review • An approach for validity repair [RFC11] • Data and Knowledge Evolution • Introduction, connection with belief change • Understanding the process of change • Literature review
Datasets • Basic structures • Classes (or concepts): collections of objects (e.g., Actor, Politician) • Properties (or roles): binary relationships between objects (e.g., started_on, member_of) • Instances (or individuals): objects (e.g., Giorgos, B. Obama) • Relations between them • Subsumption (Parliament_Member subclass of Politician), instantiation (B. Obama instance of Politician), … • The allowed relations and their semantics depend on the language • Different representation languages for LOD • RDF/S, OWL
instantiation subsumption Visualization, Triples, Serialization Visualization Triple Representation Serialization (RDF/XML) Period <rdfs:Class rdf:ID=“Period”> </rdfs:Class> <rdf:Property rdf:ID=“participants”> <rdfs:domain rdf:resource=“Onset”/> <rdfs:range rdf:resource=“Actor”/> </rdf:Property> <G_Birth rdf:about Birth> <participants> <Giorgos rdf:about Actor/> </participants> </G_Birth> <rdfs:Class rdf:ID=“Event”> <rdfs:subClassOf rdf:resource=“Period”/> </rdfs:Class> Define classes [Period type Class] Define properties [participants type Property] [participants domain Onset] [participants range Actor] Instantiate/define individuals [G_Birth type Birth] [Giorgos type Actor] [G_Birth participants Giorgos] Define hierarchies [Event subClass Period] Actor Event participants started_on Onset Existing Stuff Birth participants Giorgos G_Birth
RDF and RDFS • An RDF dataset consists of triples • RDFS adds semantics • Subsumption hierarchies (classes and properties) • Transitive • Instantiation • Inheritance, implicit instantiation • Sometimes more than subsumption/instantiation is needed • Combining concepts, roles to form more complex relations • Concept definitions: a mother is a female who has a child • Other knowledge: all items stored in warehouse X are flammable • Constraints on data • Each person must have one mother
Extensions of RDF/S: DLs (1/2) • Description Logics (DLs) • http://dl.kr.org/ • Formal underpinning of web representation languages • Family of logical formalisms • Well-defined semantics • Model-theoretic reasoning based on interpretations • Formally studied • Expressiveness, reasoning tools, computational complexity, … • Components • Individuals: specific objects (instances) – Giorgos • Concepts: sets of individuals (classes) – Parent • Roles: sets of pairs of individuals (properties) – has_child • Operators: , ⊓, , {.}, ⊤, … • Connectives: ⊑, ≡, …
Extensions of RDF/S: DLs (2/2) • Definitions, partial definitions, constraints, subsumptions, … • A mother is a female who has a child • Mother ≡ has_child ⊓ Female • Each person must have one mother • Person ⊑ has_child-1.Mother • A great variety of DLs (trade-off involved) • Different properties • Different expressive power • Different reasoning complexity
Extensions of RDF/S: OWL • OWL (Web Ontology Language) • http://www.w3.org/2004/OWL/ • General-purpose representation language • Compatible with the architecture of the Semantic Web • A family of languages • Flavors: OWL-Lite, OWL-DL, OWL Full • Profiles: OWL 2 EL, OWL 2 QL, OWL 2 RL • Different expressiveness (and complexity) • Each corresponds to a specific DL • Useful from a modeling perspective • Expressive but not too complex • Appealing computationally
Representation Languages in LOD • Mostly RDF • With RDFS semantics • Instantiations • Class subsumption • Property subsumption is rare • Some OWL • Mostly OWL Lite • Extensive use of owl:sameAs • Often abusing it [HHM+10] • OWL 2 profiles are gaining ground
Talk Structure (B1) • Introduction to RDF/S, DLs, OWL • Remote change management • Introduction, definition of subfields • Literature review • An approach for change detection [PFF+13] • Repair • Introduction, definition of subfields • Literature review • An approach for validity repair [RFC11] • Data and Knowledge Evolution • Introduction, connection with belief change • Understanding the process of change • Literature review
DR DL uses Motivation for Remote Change Management • Crucial problem for dynamic linked datasets • Linking: datasets linked to other datasets (e.g., vocabularies) • Dynamics: changes cause problems to linked datasets • No central curation or control • No control over (or knowledge of) other datasets’ evolution process • Curators don’t bother annotating and logging changes • Temporal and versioning information is usually missing [RPH+12] • Remote change management seeks solutions to allow: • Keeping track of versions • Restoring previous versions • Assessing compatibility of versions • Monitoring and detecting changes • Tracing back the evolution history (of datasets, concepts, …) • For visualization and understanding • Propagating changes to synchronize linked datasets
Subfields of Remote Change Management • Remote Change Management • Versioning • Keep track of versions • Change monitoring and detection • Monitoring: record changes as they happen • Detection: identify changes after they happen • Change propagation • Propagate changes across linked datasets for synchronization purposes
Versioning • Versioning • Keep track of versions • Identify different versions of a dataset • Enable transparent access to the “correct” version (smooth interoperation) • Issues involved • Identification • Determine which versions to store and how to identify them • Manually or automatically (syntactical, semantical considerations) • Packaging of changes • Relation between versions • A sequence or a tree • Compatibility information • Backwards/forwards compatibility and how to determine it (often manually) • Dataset-wide compatibility or fine-grained compatibility (e.g., at resource level) • Metadata on the different versions • Transparent access • Relate versions with (compatible) data sources, applications etc
DR DL uses Change Monitoring and Detection • Change monitoring • Record changes as they happen • Manual (error-prone and often incorrect) • Automatic (not used in practice) • In the good will of the dataset owner • Sometimes change logs are inaccessible • Change detection • Identify changes after they happen • Based on the previous and current versions • In both cases, a change language is required • Supported set of changes, along with their semantics • Can be low-level or high-level
Change Propagation • Change propagation • Communicate changes to linked datasets for synchronization • Push-based or pull-based propagation • Push-based: locally-initiated, via “registration” or via monitoring and versioning • Pull-based: consumer-initiated • Communication based on deltas (rather than versions) • Reduce communication overhead • Reduce storage requirements • On average, 2-3% of a dataset changes between versions [OK02] • Deltas are based on a language of changes
Talk Structure (B2) • Introduction to RDF/S, DLs, OWL • Remote change management • Introduction, definition of subfields • Literature review • An approach for change detection [PFF+13] • Repair • Introduction, definition of subfields • Literature review • An approach for validity repair [RFC11] • Data and Knowledge Evolution • Introduction, connection with belief change • Understanding the process of change • Literature review
Versioning Approaches (1/3) • Capture different aspects of versioning, such as: • Detecting versions • Storing versions efficiently • Allow cross-snapshot queries • Find gene products whose functions have not changed in the last 50 versions • Determine price fluctuation for x along different versions of the product catalog • Early versioning approaches inspired by SVN • Good for files, not directly adaptable to semantical languages • SHOE language [HH00] • Machine-readable version information (e.g., compatibility) • Provided by curator as SHOE statements • Memento [SSN+10] • Fine-grained versioning at URI level (resources, web pages) • Machine-readable version information, in the HTTP header • Timestamps, traversal information (prior/current versions) etc
Versioning Approaches (2/3) • Theoretical foundations for versioning [HP04] • Formal definitions to capture notions such as: • Compatibility (between versions) • Commitment (resources committing to a certain ontology) • Ontology perspectives (the part of the web committing to an ontology) • Temporal approaches [HS05, PTC05, KLGE07] • For capturing temporal relations between versions • For allowing cross-snapshot queries • Versioning in multi-editor environments [RSDT08] • Via change monitoring
Versioning Approaches (3/3) • Automatically detecting version relationships [AAM09] • Using heuristics based on URIs • Study of “relatedness” between versions [CQ13] • A model of “relatedness” between vocabularies from various sources • Similar to links in web pages • POI: Partial Order Index [TTA08] • Efficient method for storing versions and their differences • Stores several versions, exploiting their common triples for efficient storage
Change Languages (1/2) • Change languages necessary for monitoring, detection, propagation • Granularity • Low-level (or atomic, or elementary) • Simple add/remove operations • Add(s,p,o), Delete(s,p,o) • Simple to detect and define • Focus on machine-readability: determinism, well-defined semantics • High-level (or complex, or composite) • More coarse-grained, compact, closer to editor’s perception and intuition • Generalize_Domain(P,A), Delete_Class(A) • More interesting; harder to detect and define • Focus on human-understandability: often unclear and/or informal semantics
Change Languages (2/2) • Many different high-level languages (no standard) • [HGR12, JAP09, PFF+13, SK03, AH06, DA09, PTC07, …] • Some are domain-specific (e.g., [HGR12]) • Some are dynamic (e.g,, [AH06, DA09, PTC07]) • Allow custom, user-defined changes • Some allow terminological changes (e.g., [PFF+13]) • Rename, merge, split • Common, but tough to detect (easily confused with add/delete)
Representation Issues • Deltas are just sets of changes from the change language • Changes usually represented using a change ontology • Ontology represents changes • A specific change is an instance of such an ontology • Deltas associated with sets of such instances • Different proposals [NCLM06, KFKO02, KN03, PT05] • Allows the manipulation and communication of deltas/changes using standard Semantic Web technologies
Change Monitoring Approaches • Using a version log [PT05] • Logging actions on the dataset • Use it for change detection, as well as proper versioning • Good quality, high-level change monitoring • Based on a dynamic language of changes • Using migration specifications [ZZL+03] • Similar to logs, but with a more formal structure • DBPedia change monitoring [MLA+12] • http://live.dbpedia.org/ • Live versions, as opposed to “standard” versions
Low-Level Change Detection (1/2) • SemVersion [VWS+05] • Developed in Karlsruhe (FZI, AIFB) • Low-level change detection tool for RDF • Provides also versioning functionalities • Allows cross-snapshot queries • For RDF [ILK12] • Low-level change detection based on set difference • Aggregating and compressing deltas • Also dealing with versioning issues • For RDF/S [ZTC11] • Takes into account semantics (RDFS inference) • Four different methods to compute deltas (all based on set difference) • Formal analysis of these methods’ properties and semantics • Extension: effect of blank nodes on change detection [TLZ12]
Low-Level Change Detection (2/2) • Bubastis (http://www.ebi.ac.uk/fgpt/sw/bubastis/index.html) • Simple diff tool (triple-based comparison) • Basically RDF, but also supports OWL • For DL-Lite [KWZ08] • Formal, semantical approach • For EL [KWW08] • Uses a concept-based description of changes • For propositional knowledge bases [FMV10] • Propositional, but generic; it can be applied to DLs • Formal analysis of the problem • Also dealing with propagation semantics
High-Level Change Detection (1/2) • For OWL: PromptDiff [NKKM04], OntoView [KFKO02] • Employ heuristics and probabilistic methods • Evaluation using precision/recall metrics against a gold standard • Integrated into tools that also provide versioning functionalities • For RDF/S [PFF+13] • Dealing with both machine-readability and human-understandability • Also dealing with propagation (applying changes) • To be discussed in detail later • COnto-Diff [HGR12] • Rule-based approach • Also dealing with propagation
Change Propagation Approaches • Usually part of other tools [SMMS02, MMS+03] • Versioning, monitoring tools (push-based propagation) • Detection tools (pull-based propagation) • Evolution and repair tools (pull-based propagation) • Adapt your data to be “compatible” with the new remote version • SparqlPush [PM10] • Push-based propagation of changes on SPARQL “views” • PRISM, PRISM++ [CMZ08, CMDZ10] • High-level language of schema changes for relational data • Also supports changes on the integrity constraints • Identifies and propagates the changes required in the data for abiding to the new schema • Query and update rewriting • For applications that try to access the old schema
Other Change Management Approaches • Complete approach for XML [SP10] • Representing changes inline with the data using a graph (“evograph”) • Supports different change representation languages (both low-level and high-level) • Timestamps changes • Monitoring: evograph can be used to log the changes • Propagation: changes can be accessed and propagated • Versioning: timestamps in changes can be used to generate snapshots (versions) at different times • Allows cross-snapshot queries • Fairly generic, can be adapted for RDF
Talk Structure (B3) • Introduction to RDF/S, DLs, OWL • Remote change management • Introduction, definition of subfields • Literature review • An approach for change detection [PFF+13] • Repair • Introduction, definition of subfields • Literature review • An approach for validity repair [RFC11] • Data and Knowledge Evolution • Introduction, connection with belief change • Understanding the process of change • Literature review