1.72k likes | 1.87k Views
Data and Knowledge Evolution. Giorgos Flouris fgeo@ics.forth.gr. Open Data Tutorials, May 2013. Slides available at: http://www.ics.forth.gr/~fgeo/Publications/ WOD13 .p pt. World Wide Web. WWW (and HTML) focus on human readability Page presentation (fonts, colors, images, …)
E N D
Data and Knowledge Evolution Giorgos Flourisfgeo@ics.forth.gr Open Data Tutorials, May 2013 Slides available at: http://www.ics.forth.gr/~fgeo/Publications/WOD13.ppt
World Wide Web • WWW (and HTML) focus on human readability • Page presentation (fonts, colors, images, …) • Human understanding • Presentation Semantical content • Content is not formally described (for a machine to understand) • WWW contains documents, not data
Problems with the Current Web • Search and access becomes difficult • Software ignorant of the semantical content of a web page • Keyword search • High recall, low precision • Terminological issues • Synonyms (heart disease = cardiac disease) • Hyponyms/hypernyms (parliament members are politicians) • Queries on the semantical content cannot be made • Fetch articles that support B. Obama’s foreign policy • Fetch the home pages of all members of the Greek Parliament
Semantic Web • The Semantic Web is an extension of the current webin which information is given well-defined meaning, better enabling computers and people to workin cooperation[BLHL01] • The Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries http://www.w3.org/2001/sw/ • [Semantic Web] is a collaborative effort led by W3C with participation from a large number of researchers and industrial partnershttp://www.w3.org/2001/sw/
Semantic Web in Practice • Web of data, rather than documents • HTML for presentation • Semantical languages for semantical content • Readable and understandable by humans and machines • Semantic Web languages, protocols, etc • Web page annotation (metadata descriptions etc) • Publication of data on the Internet • Efficient communication and manipulation of data over the Internet • Different applications • Efficient searching • Sharing of data (e-science, e-government, remote learning, …) • Linked Open Data (more on that later)
Ontologies and Data (Datasets) • An ontology is an explicit specification of a shared conceptualizationof a domain [Gru93] • Precise, logical account of the intended meaning of terms • Common (shared) interpretation of terms • Formal vocabulary for information exchange (humans/machines) • Ontologies (vocabularies) allow the description of data • Terminology: • Ontology = vocabulary = schema • Data = instances • Dataset = data and the related ontology (i.e., a dataset may contain schema and/or data)
Dataset Dynamics • Datasets change constantly • World changes (dynamic models) • View on the world changes (new knowledge, measurements, etc) • Perspective and usage changes • Example: • Gene Ontology (information about gene products): daily versions • DBPedia: 1,4 updates/second (http://live.dbpedia.org/LiveStats/) [MLA+12] • Need methodologies to cope with the problems related to dynamicity • Evolution (modify a dataset in response to a change) • Versioning (keep track of versions and their relations) • Debugging, cleaning, repairing, quality (maintain consistency and quality in a dynamic environment) • Change monitoring, detection and propagation (identify changes and use them to synchronize remote datasets) • …
Linked (Open) Data • Datasets can be interlinked • Sharing knowledge • Reusing knowledge • Modular development • Reuse of schemas • Linked Open Data (LOD) movement • Constantly growing • 31 billion triples and 295 datasets as of September 2011
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/ Linked Open Data Cloud Diagram
Linked Open Data Challenges • Both a blessing and a curse • Added-value benefits • Discovery of unknown correlations, connections, relationships • Vast amount of interrelated knowledge • No central control, everyone can publish and relate to others • Quality of datasets lies/depends on different providers • A change in one dataset affects all related ones • Several new problems related to dynamics • Propagation of changes among interrelated datasets • Maintaining the quality of local datasets • Co-evolution
Scope: Dynamic Linked Datasets You are here Dynamic Datasets LinkedDatasets
Purpose of This Talk • To survey different research areas related to dynamic LOD • Remote Change Management • Repair • Data and Knowledge Evolution • Categorize and classify works in each field • Broad but shallow description • Several references for more in-depth study • No claims of completeness (references are just indicative) • Two relevant surveys: [FMK+08, ZAA+13] • Emphasis on some related work done in FORTH • Will avoid technical discussion • References will be given for further details
Defining Remote Change Management • Managing the effects of remote changes on interlinked datasets • Remote changes have profound effects on local datasets • Good practices are important • Proper versioning, change logging, adaptation to remote changes, … • Attention exploded after the success of the LOD paradigm • Related research questions • How should I version my data? • How can I efficiently monitor changes in my dataset? • How can I detect changes in remote datasets? • How does the evolution of remote datasets affect my data? • How can I efficiently propagate changes from one dataset to another?
RD1 LD1 RD0 LD0 Remote Change Management: Visualization Remote Site Versioning, Change Monitoring Change Detection Local Site Change Propagation
Remote Change Management: Structure • Three subfields • Versioning • Change monitoring and detection • Change propagation • Structure • Introduction, definition of subfields • Literature review • An approach for change detection [PFF+13]
Defining Repair • Assessing and improving the quality and the semantical or structural integrity of the data • Maintaining consistency, coherency, validity • Restoring consistency, coherency, validity, when violated • Assessing and improving quality • Preserve quality/integrity in the face of remote changes • Related research questions • How can I preserve the integrity and quality of my data in a dynamic and interlinked environment? • How can I guarantee consistency and validity? • How can I restore consistency and validity, if violated?
Repair: Visualization D1 D0 Repair Process(Cleaning, Debugging, Repairing, Quality Enhancement) Assessment Module (Diagnosis, Quality Assessment)
Repair: Structure • Four subfields • Cleaning • Debugging • Validity repair • Quality enhancement • Structure • Introduction, definition of subfields • Literature review • An approach for validity repair [RFC11]
Defining Evolution • Modifying a dataset in response to a change in the domain or its conceptualization • Identify the result of applying new information on the dataset • Determine the result of change propagation from remote datasets • Understand the process of change • Related research questions • What is the semantics of evolution and change? • How can I efficiently compute the ideal evolution result?
D1 D0 Evolution: Visualization Real World EvolutionAlgorithm Delete_Class(…)Pull_Up_Class(…)Rename_Class(…)… Dataset
Evolution: Summary • Evolution topics • Understanding the evolution challenges • Understanding the process of change • Balancing between philosophical and practical considerations • Cross-fertilization with belief change • Structure • Introduction, connection with belief change • Understanding the process of change • Literature review
General Structure of this Talk • Introduction to RDF/S, DLs, OWL • Remote change management • Introduction, definition of subfields • Literature review • An approach for change detection [PFF+13] • Repair • Introduction, definition of subfields • Literature review • An approach for validity repair [RFC11] • Data and Knowledge Evolution • Introduction, connection with belief change • Understanding the process of change • Literature review The final few slides contain citations for the references in this talk Part I(2 hours) Part II(1 hour)
Talk Structure (A) • Introduction to RDF/S, DLs, OWL • Remote change management • Introduction, definition of subfields • Literature review • An approach for change detection [PFF+13] • Repair • Introduction, definition of subfields • Literature review • An approach for validity repair [RFC11] • Data and Knowledge Evolution • Introduction, connection with belief change • Understanding the process of change • Literature review
Datasets • Basic structures • Classes (or concepts): collections of objects (e.g., Actor, Politician) • Properties (or roles): binary relationships between objects (e.g., started_on, member_of) • Instances (or individuals): objects (e.g., Giorgos, B. Obama) • Relations between them • Subsumption (Parliament_Member subclass of Politician), instantiation (B. Obama instance of Politician), … • The allowed relations and their semantics depend on the language • Different representation languages for LOD • RDF/S, OWL
instantiation subsumption Visualization, Triples, Serialization Visualization Triple Representation Serialization (RDF/XML) Period <rdfs:Class rdf:ID=“Period”> </rdfs:Class> <rdf:Property rdf:ID=“participants”> <rdfs:domain rdf:resource=“Onset”/> <rdfs:range rdf:resource=“Actor”/> </rdf:Property> <G_Birth rdf:about Birth> <participants> <Giorgos rdf:about Actor/> </participants> </G_Birth> <rdfs:Class rdf:ID=“Event”> <rdfs:subClassOf rdf:resource=“Period”/> </rdfs:Class> Define classes [Period type Class] Define properties [participants type Property] [participants domain Onset] [participants range Actor] Instantiate/define individuals [G_Birth type Birth] [Giorgos type Actor] [G_Birth participants Giorgos] Define hierarchies [Event subClass Period] Actor Event participants started_on Onset Existing Stuff Birth participants Giorgos G_Birth
RDF and RDFS • An RDF dataset consists of triples • RDFS adds semantics • Subsumption hierarchies (classes and properties) • Transitive • Instantiation • Inheritance, implicit instantiation • Sometimes more than subsumption/instantiation is needed • Combining concepts, roles to form more complex relations • Concept definitions: a mother is a female who has a child • Other knowledge: all items stored in warehouse X are flammable • Constraints on data • Each person must have one mother
Extensions of RDF/S: DLs (1/2) • Description Logics (DLs) • http://dl.kr.org/ • Formal underpinning of web representation languages • Family of logical formalisms • Well-defined semantics • Model-theoretic reasoning based on interpretations • Formally studied • Expressiveness, reasoning tools, computational complexity, … • Components • Individuals: specific objects (instances) – Giorgos • Concepts: sets of individuals (classes) – Parent • Roles: sets of pairs of individuals (properties) – has_child • Operators: , ⊓, , {.}, ⊤, … • Connectives: ⊑, ≡, …
Extensions of RDF/S: DLs (2/2) • Definitions, partial definitions, constraints, subsumptions, … • A mother is a female who has a child • Mother ≡ has_child ⊓ Female • Each person must have one mother • Person ⊑ has_child-1.Mother • A great variety of DLs (trade-off involved) • Different properties • Different expressive power • Different reasoning complexity
Extensions of RDF/S: OWL • OWL (Web Ontology Language) • http://www.w3.org/2004/OWL/ • General-purpose representation language • Compatible with the architecture of the Semantic Web • A family of languages • Flavors: OWL-Lite, OWL-DL, OWL Full • Profiles: OWL 2 EL, OWL 2 QL, OWL 2 RL • Different expressiveness (and complexity) • Each corresponds to a specific DL • Useful from a modeling perspective • Expressive but not too complex • Appealing computationally
Representation Languages in LOD • Mostly RDF • With RDFS semantics • Instantiations • Class subsumption • Property subsumption is rare • Some OWL • Mostly OWL Lite • Extensive use of owl:sameAs • Often abusing it [HHM+10] • OWL 2 profiles are gaining ground
Talk Structure (B1) • Introduction to RDF/S, DLs, OWL • Remote change management • Introduction, definition of subfields • Literature review • An approach for change detection [PFF+13] • Repair • Introduction, definition of subfields • Literature review • An approach for validity repair [RFC11] • Data and Knowledge Evolution • Introduction, connection with belief change • Understanding the process of change • Literature review
DR DL uses Motivation for Remote Change Management • Crucial problem for dynamic linked datasets • Linking: datasets linked to other datasets (e.g., vocabularies) • Dynamics: changes cause problems to linked datasets • No central curation or control • No control over (or knowledge of) other datasets’ evolution process • Curators don’t bother annotating and logging changes • Temporal and versioning information is usually missing [RPH+12] • Remote change management seeks solutions to allow: • Keeping track of versions • Restoring previous versions • Assessing compatibility of versions • Monitoring and detecting changes • Tracing back the evolution history (of datasets, concepts, …) • For visualization and understanding • Propagating changes to synchronize linked datasets
Subfields of Remote Change Management • Remote Change Management • Versioning • Keep track of versions • Change monitoring and detection • Monitoring: record changes as they happen • Detection: identify changes after they happen • Change propagation • Propagate changes across linked datasets for synchronization purposes
Versioning • Versioning • Keep track of versions • Identify different versions of a dataset • Enable transparent access to the “correct” version (smooth interoperation) • Issues involved • Identification • Determine which versions to store and how to identify them • Manually or automatically (syntactical, semantical considerations) • Packaging of changes • Relation between versions • A sequence or a tree • Compatibility information • Backwards/forwards compatibility and how to determine it (often manually) • Dataset-wide compatibility or fine-grained compatibility (e.g., at resource level) • Metadata on the different versions • Transparent access • Relate versions with (compatible) data sources, applications etc
DR DL uses Change Monitoring and Detection • Change monitoring • Record changes as they happen • Manual (error-prone and often incorrect) • Automatic (not used in practice) • In the good will of the dataset owner • Sometimes change logs are inaccessible • Change detection • Identify changes after they happen • Based on the previous and current versions • In both cases, a change language is required • Supported set of changes, along with their semantics • Can be low-level or high-level
Change Propagation • Change propagation • Communicate changes to linked datasets for synchronization • Push-based or pull-based propagation • Push-based: locally-initiated, via “registration” or via monitoring and versioning • Pull-based: consumer-initiated • Communication based on deltas (rather than versions) • Reduce communication overhead • Reduce storage requirements • On average, 2-3% of a dataset changes between versions [OK02] • Deltas are based on a language of changes
Talk Structure (B2) • Introduction to RDF/S, DLs, OWL • Remote change management • Introduction, definition of subfields • Literature review • An approach for change detection [PFF+13] • Repair • Introduction, definition of subfields • Literature review • An approach for validity repair [RFC11] • Data and Knowledge Evolution • Introduction, connection with belief change • Understanding the process of change • Literature review
Versioning Approaches (1/3) • Capture different aspects of versioning, such as: • Detecting versions • Storing versions efficiently • Allow cross-snapshot queries • Find gene products whose functions have not changed in the last 50 versions • Determine price fluctuation for x along different versions of the product catalog • Early versioning approaches inspired by SVN • Good for files, not directly adaptable to semantical languages • SHOE language [HH00] • Machine-readable version information (e.g., compatibility) • Provided by curator as SHOE statements • Memento [SSN+10] • Fine-grained versioning at URI level (resources, web pages) • Machine-readable version information, in the HTTP header • Timestamps, traversal information (prior/current versions) etc
Versioning Approaches (2/3) • Theoretical foundations for versioning [HP04] • Formal definitions to capture notions such as: • Compatibility (between versions) • Commitment (resources committing to a certain ontology) • Ontology perspectives (the part of the web committing to an ontology) • Temporal approaches [HS05, PTC05, KLGE07] • For capturing temporal relations between versions • For allowing cross-snapshot queries • Versioning in multi-editor environments [RSDT08] • Via change monitoring
Versioning Approaches (3/3) • Automatically detecting version relationships [AAM09] • Using heuristics based on URIs • Study of “relatedness” between versions [CQ13] • A model of “relatedness” between vocabularies from various sources • Similar to links in web pages • POI: Partial Order Index [TTA08] • Efficient method for storing versions and their differences • Stores several versions, exploiting their common triples for efficient storage
Change Languages (1/2) • Change languages necessary for monitoring, detection, propagation • Granularity • Low-level (or atomic, or elementary) • Simple add/remove operations • Add(s,p,o), Delete(s,p,o) • Simple to detect and define • Focus on machine-readability: determinism, well-defined semantics • High-level (or complex, or composite) • More coarse-grained, compact, closer to editor’s perception and intuition • Generalize_Domain(P,A), Delete_Class(A) • More interesting; harder to detect and define • Focus on human-understandability: often unclear and/or informal semantics
Change Languages (2/2) • Many different high-level languages (no standard) • [HGR12, JAP09, PFF+13, SK03, AH06, DA09, PTC07, …] • Some are domain-specific (e.g., [HGR12]) • Some are dynamic (e.g,, [AH06, DA09, PTC07]) • Allow custom, user-defined changes • Some allow terminological changes (e.g., [PFF+13]) • Rename, merge, split • Common, but tough to detect (easily confused with add/delete)
Representation Issues • Deltas are just sets of changes from the change language • Changes usually represented using a change ontology • Ontology represents changes • A specific change is an instance of such an ontology • Deltas associated with sets of such instances • Different proposals [NCLM06, KFKO02, KN03, PT05] • Allows the manipulation and communication of deltas/changes using standard Semantic Web technologies
Change Monitoring Approaches • Using a version log [PT05] • Logging actions on the dataset • Use it for change detection, as well as proper versioning • Good quality, high-level change monitoring • Based on a dynamic language of changes • Using migration specifications [ZZL+03] • Similar to logs, but with a more formal structure • DBPedia change monitoring [MLA+12] • http://live.dbpedia.org/ • Live versions, as opposed to “standard” versions
Low-Level Change Detection (1/2) • SemVersion [VWS+05] • Developed in Karlsruhe (FZI, AIFB) • Low-level change detection tool for RDF • Provides also versioning functionalities • Allows cross-snapshot queries • For RDF [ILK12] • Low-level change detection based on set difference • Aggregating and compressing deltas • Also dealing with versioning issues • For RDF/S [ZTC11] • Takes into account semantics (RDFS inference) • Four different methods to compute deltas (all based on set difference) • Formal analysis of these methods’ properties and semantics • Extension: effect of blank nodes on change detection [TLZ12]
Low-Level Change Detection (2/2) • Bubastis (http://www.ebi.ac.uk/fgpt/sw/bubastis/index.html) • Simple diff tool (triple-based comparison) • Basically RDF, but also supports OWL • For DL-Lite [KWZ08] • Formal, semantical approach • For EL [KWW08] • Uses a concept-based description of changes • For propositional knowledge bases [FMV10] • Propositional, but generic; it can be applied to DLs • Formal analysis of the problem • Also dealing with propagation semantics
High-Level Change Detection (1/2) • For OWL: PromptDiff [NKKM04], OntoView [KFKO02] • Employ heuristics and probabilistic methods • Evaluation using precision/recall metrics against a gold standard • Integrated into tools that also provide versioning functionalities • For RDF/S [PFF+13] • Dealing with both machine-readability and human-understandability • Also dealing with propagation (applying changes) • To be discussed in detail later • COnto-Diff [HGR12] • Rule-based approach • Also dealing with propagation
Change Propagation Approaches • Usually part of other tools [SMMS02, MMS+03] • Versioning, monitoring tools (push-based propagation) • Detection tools (pull-based propagation) • Evolution and repair tools (pull-based propagation) • Adapt your data to be “compatible” with the new remote version • SparqlPush [PM10] • Push-based propagation of changes on SPARQL “views” • PRISM, PRISM++ [CMZ08, CMDZ10] • High-level language of schema changes for relational data • Also supports changes on the integrity constraints • Identifies and propagates the changes required in the data for abiding to the new schema • Query and update rewriting • For applications that try to access the old schema
Other Change Management Approaches • Complete approach for XML [SP10] • Representing changes inline with the data using a graph (“evograph”) • Supports different change representation languages (both low-level and high-level) • Timestamps changes • Monitoring: evograph can be used to log the changes • Propagation: changes can be accessed and propagated • Versioning: timestamps in changes can be used to generate snapshots (versions) at different times • Allows cross-snapshot queries • Fairly generic, can be adapted for RDF
Talk Structure (B3) • Introduction to RDF/S, DLs, OWL • Remote change management • Introduction, definition of subfields • Literature review • An approach for change detection [PFF+13] • Repair • Introduction, definition of subfields • Literature review • An approach for validity repair [RFC11] • Data and Knowledge Evolution • Introduction, connection with belief change • Understanding the process of change • Literature review