1 / 170

Data and Knowledge Evolution

Data and Knowledge Evolution. Giorgos Flouris fgeo@ics.forth.gr. Open Data Tutorials, May 2013. Slides available at: http://www.ics.forth.gr/~fgeo/Publications/ WOD13 .p pt. World Wide Web. WWW (and HTML) focus on human readability Page presentation (fonts, colors, images, …)

sileas
Download Presentation

Data and Knowledge Evolution

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data and Knowledge Evolution Giorgos Flourisfgeo@ics.forth.gr Open Data Tutorials, May 2013 Slides available at: http://www.ics.forth.gr/~fgeo/Publications/WOD13.ppt

  2. World Wide Web • WWW (and HTML) focus on human readability • Page presentation (fonts, colors, images, …) • Human understanding • Presentation  Semantical content • Content is not formally described (for a machine to understand) • WWW contains documents, not data

  3. Problems with the Current Web • Search and access becomes difficult • Software ignorant of the semantical content of a web page • Keyword search • High recall, low precision • Terminological issues • Synonyms (heart disease = cardiac disease) • Hyponyms/hypernyms (parliament members are politicians) • Queries on the semantical content cannot be made • Fetch articles that support B. Obama’s foreign policy • Fetch the home pages of all members of the Greek Parliament

  4. Semantic Web • The Semantic Web is an extension of the current webin which information is given well-defined meaning, better enabling computers and people to workin cooperation[BLHL01] • The Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries http://www.w3.org/2001/sw/ • [Semantic Web] is a collaborative effort led by W3C with participation from a large number of researchers and industrial partnershttp://www.w3.org/2001/sw/

  5. Semantic Web in Practice • Web of data, rather than documents • HTML for presentation • Semantical languages for semantical content • Readable and understandable by humans and machines • Semantic Web languages, protocols, etc • Web page annotation (metadata descriptions etc) • Publication of data on the Internet • Efficient communication and manipulation of data over the Internet • Different applications • Efficient searching • Sharing of data (e-science, e-government, remote learning, …) • Linked Open Data (more on that later)

  6. Ontologies and Data (Datasets) • An ontology is an explicit specification of a shared conceptualizationof a domain [Gru93] • Precise, logical account of the intended meaning of terms • Common (shared) interpretation of terms • Formal vocabulary for information exchange (humans/machines) • Ontologies (vocabularies) allow the description of data • Terminology: • Ontology = vocabulary = schema • Data = instances • Dataset = data and the related ontology (i.e., a dataset may contain schema and/or data)

  7. Dataset Dynamics • Datasets change constantly • World changes (dynamic models) • View on the world changes (new knowledge, measurements, etc) • Perspective and usage changes • Example: • Gene Ontology (information about gene products): daily versions • DBPedia: 1,4 updates/second (http://live.dbpedia.org/LiveStats/) [MLA+12] • Need methodologies to cope with the problems related to dynamicity • Evolution (modify a dataset in response to a change) • Versioning (keep track of versions and their relations) • Debugging, cleaning, repairing, quality (maintain consistency and quality in a dynamic environment) • Change monitoring, detection and propagation (identify changes and use them to synchronize remote datasets) • …

  8. Linked (Open) Data • Datasets can be interlinked • Sharing knowledge • Reusing knowledge • Modular development • Reuse of schemas • Linked Open Data (LOD) movement • Constantly growing • 31 billion triples and 295 datasets as of September 2011

  9. Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/ Linked Open Data Cloud Diagram

  10. Linked Open Data Challenges • Both a blessing and a curse • Added-value benefits • Discovery of unknown correlations, connections, relationships • Vast amount of interrelated knowledge • No central control, everyone can publish and relate to others • Quality of datasets lies/depends on different providers • A change in one dataset affects all related ones • Several new problems related to dynamics • Propagation of changes among interrelated datasets • Maintaining the quality of local datasets • Co-evolution

  11. Scope: Dynamic Linked Datasets You are here Dynamic Datasets LinkedDatasets

  12. Purpose of This Talk • To survey different research areas related to dynamic LOD • Remote Change Management • Repair • Data and Knowledge Evolution • Categorize and classify works in each field • Broad but shallow description • Several references for more in-depth study • No claims of completeness (references are just indicative) • Two relevant surveys: [FMK+08, ZAA+13] • Emphasis on some related work done in FORTH • Will avoid technical discussion • References will be given for further details

  13. Defining Remote Change Management • Managing the effects of remote changes on interlinked datasets • Remote changes have profound effects on local datasets • Good practices are important • Proper versioning, change logging, adaptation to remote changes, … • Attention exploded after the success of the LOD paradigm • Related research questions • How should I version my data? • How can I efficiently monitor changes in my dataset? • How can I detect changes in remote datasets? • How does the evolution of remote datasets affect my data? • How can I efficiently propagate changes from one dataset to another?

  14. RD1 LD1 RD0 LD0 Remote Change Management: Visualization Remote Site Versioning, Change Monitoring Change Detection Local Site Change Propagation

  15. Remote Change Management: Structure • Three subfields • Versioning • Change monitoring and detection • Change propagation • Structure • Introduction, definition of subfields • Literature review • An approach for change detection [PFF+13]

  16. Defining Repair • Assessing and improving the quality and the semantical or structural integrity of the data • Maintaining consistency, coherency, validity • Restoring consistency, coherency, validity, when violated • Assessing and improving quality • Preserve quality/integrity in the face of remote changes • Related research questions • How can I preserve the integrity and quality of my data in a dynamic and interlinked environment? • How can I guarantee consistency and validity? • How can I restore consistency and validity, if violated?

  17. Repair: Visualization D1 D0 Repair Process(Cleaning, Debugging, Repairing, Quality Enhancement) Assessment Module (Diagnosis, Quality Assessment)

  18. Repair: Structure • Four subfields • Cleaning • Debugging • Validity repair • Quality enhancement • Structure • Introduction, definition of subfields • Literature review • An approach for validity repair [RFC11]

  19. Defining Evolution • Modifying a dataset in response to a change in the domain or its conceptualization • Identify the result of applying new information on the dataset • Determine the result of change propagation from remote datasets • Understand the process of change • Related research questions • What is the semantics of evolution and change? • How can I efficiently compute the ideal evolution result?

  20. D1 D0 Evolution: Visualization Real World EvolutionAlgorithm Delete_Class(…)Pull_Up_Class(…)Rename_Class(…)… Dataset

  21. Evolution: Summary • Evolution topics • Understanding the evolution challenges • Understanding the process of change • Balancing between philosophical and practical considerations • Cross-fertilization with belief change • Structure • Introduction, connection with belief change • Understanding the process of change • Literature review

  22. General Structure of this Talk • Introduction to RDF/S, DLs, OWL • Remote change management • Introduction, definition of subfields • Literature review • An approach for change detection [PFF+13] • Repair • Introduction, definition of subfields • Literature review • An approach for validity repair [RFC11] • Data and Knowledge Evolution • Introduction, connection with belief change • Understanding the process of change • Literature review The final few slides contain citations for the references in this talk Part I(2 hours) Part II(1 hour)

  23. Talk Structure (A) • Introduction to RDF/S, DLs, OWL • Remote change management • Introduction, definition of subfields • Literature review • An approach for change detection [PFF+13] • Repair • Introduction, definition of subfields • Literature review • An approach for validity repair [RFC11] • Data and Knowledge Evolution • Introduction, connection with belief change • Understanding the process of change • Literature review

  24. Datasets • Basic structures • Classes (or concepts): collections of objects (e.g., Actor, Politician) • Properties (or roles): binary relationships between objects (e.g., started_on, member_of) • Instances (or individuals): objects (e.g., Giorgos, B. Obama) • Relations between them • Subsumption (Parliament_Member subclass of Politician), instantiation (B. Obama instance of Politician), … • The allowed relations and their semantics depend on the language • Different representation languages for LOD • RDF/S, OWL

  25. instantiation subsumption Visualization, Triples, Serialization Visualization Triple Representation Serialization (RDF/XML) Period <rdfs:Class rdf:ID=“Period”> </rdfs:Class> <rdf:Property rdf:ID=“participants”> <rdfs:domain rdf:resource=“Onset”/> <rdfs:range rdf:resource=“Actor”/> </rdf:Property> <G_Birth rdf:about Birth> <participants> <Giorgos rdf:about Actor/> </participants> </G_Birth> <rdfs:Class rdf:ID=“Event”> <rdfs:subClassOf rdf:resource=“Period”/> </rdfs:Class> Define classes [Period type Class] Define properties [participants type Property] [participants domain Onset] [participants range Actor] Instantiate/define individuals [G_Birth type Birth] [Giorgos type Actor] [G_Birth participants Giorgos] Define hierarchies [Event subClass Period] Actor Event participants started_on Onset Existing Stuff Birth participants Giorgos G_Birth

  26. RDF and RDFS • An RDF dataset consists of triples • RDFS adds semantics • Subsumption hierarchies (classes and properties) • Transitive • Instantiation • Inheritance, implicit instantiation • Sometimes more than subsumption/instantiation is needed • Combining concepts, roles to form more complex relations • Concept definitions: a mother is a female who has a child • Other knowledge: all items stored in warehouse X are flammable • Constraints on data • Each person must have one mother

  27. Extensions of RDF/S: DLs (1/2) • Description Logics (DLs) • http://dl.kr.org/ • Formal underpinning of web representation languages • Family of logical formalisms • Well-defined semantics • Model-theoretic reasoning based on interpretations • Formally studied • Expressiveness, reasoning tools, computational complexity, … • Components • Individuals: specific objects (instances) – Giorgos • Concepts: sets of individuals (classes) – Parent • Roles: sets of pairs of individuals (properties) – has_child • Operators: , ⊓, , {.}, ⊤, … • Connectives: ⊑, ≡, …

  28. Extensions of RDF/S: DLs (2/2) • Definitions, partial definitions, constraints, subsumptions, … • A mother is a female who has a child • Mother ≡ has_child ⊓ Female • Each person must have one mother • Person ⊑ has_child-1.Mother • A great variety of DLs (trade-off involved) • Different properties • Different expressive power • Different reasoning complexity

  29. Extensions of RDF/S: OWL • OWL (Web Ontology Language) • http://www.w3.org/2004/OWL/ • General-purpose representation language • Compatible with the architecture of the Semantic Web • A family of languages • Flavors: OWL-Lite, OWL-DL, OWL Full • Profiles: OWL 2 EL, OWL 2 QL, OWL 2 RL • Different expressiveness (and complexity) • Each corresponds to a specific DL • Useful from a modeling perspective • Expressive but not too complex • Appealing computationally

  30. Representation Languages in LOD • Mostly RDF • With RDFS semantics • Instantiations • Class subsumption • Property subsumption is rare • Some OWL • Mostly OWL Lite • Extensive use of owl:sameAs • Often abusing it [HHM+10] • OWL 2 profiles are gaining ground

  31. Talk Structure (B1) • Introduction to RDF/S, DLs, OWL • Remote change management • Introduction, definition of subfields • Literature review • An approach for change detection [PFF+13] • Repair • Introduction, definition of subfields • Literature review • An approach for validity repair [RFC11] • Data and Knowledge Evolution • Introduction, connection with belief change • Understanding the process of change • Literature review

  32. DR DL uses Motivation for Remote Change Management • Crucial problem for dynamic linked datasets • Linking: datasets linked to other datasets (e.g., vocabularies) • Dynamics: changes cause problems to linked datasets • No central curation or control • No control over (or knowledge of) other datasets’ evolution process • Curators don’t bother annotating and logging changes • Temporal and versioning information is usually missing [RPH+12] • Remote change management seeks solutions to allow: • Keeping track of versions • Restoring previous versions • Assessing compatibility of versions • Monitoring and detecting changes • Tracing back the evolution history (of datasets, concepts, …) • For visualization and understanding • Propagating changes to synchronize linked datasets

  33. Subfields of Remote Change Management • Remote Change Management • Versioning • Keep track of versions • Change monitoring and detection • Monitoring: record changes as they happen • Detection: identify changes after they happen • Change propagation • Propagate changes across linked datasets for synchronization purposes

  34. Versioning • Versioning • Keep track of versions • Identify different versions of a dataset • Enable transparent access to the “correct” version (smooth interoperation) • Issues involved • Identification • Determine which versions to store and how to identify them • Manually or automatically (syntactical, semantical considerations) • Packaging of changes • Relation between versions • A sequence or a tree • Compatibility information • Backwards/forwards compatibility and how to determine it (often manually) • Dataset-wide compatibility or fine-grained compatibility (e.g., at resource level) • Metadata on the different versions • Transparent access • Relate versions with (compatible) data sources, applications etc

  35. DR DL uses Change Monitoring and Detection • Change monitoring • Record changes as they happen • Manual (error-prone and often incorrect) • Automatic (not used in practice) • In the good will of the dataset owner • Sometimes change logs are inaccessible • Change detection • Identify changes after they happen • Based on the previous and current versions • In both cases, a change language is required • Supported set of changes, along with their semantics • Can be low-level or high-level

  36. Change Propagation • Change propagation • Communicate changes to linked datasets for synchronization • Push-based or pull-based propagation • Push-based: locally-initiated, via “registration” or via monitoring and versioning • Pull-based: consumer-initiated • Communication based on deltas (rather than versions) • Reduce communication overhead • Reduce storage requirements • On average, 2-3% of a dataset changes between versions [OK02] • Deltas are based on a language of changes

  37. Talk Structure (B2) • Introduction to RDF/S, DLs, OWL • Remote change management • Introduction, definition of subfields • Literature review • An approach for change detection [PFF+13] • Repair • Introduction, definition of subfields • Literature review • An approach for validity repair [RFC11] • Data and Knowledge Evolution • Introduction, connection with belief change • Understanding the process of change • Literature review

  38. Versioning Approaches (1/3) • Capture different aspects of versioning, such as: • Detecting versions • Storing versions efficiently • Allow cross-snapshot queries • Find gene products whose functions have not changed in the last 50 versions • Determine price fluctuation for x along different versions of the product catalog • Early versioning approaches inspired by SVN • Good for files, not directly adaptable to semantical languages • SHOE language [HH00] • Machine-readable version information (e.g., compatibility) • Provided by curator as SHOE statements • Memento [SSN+10] • Fine-grained versioning at URI level (resources, web pages) • Machine-readable version information, in the HTTP header • Timestamps, traversal information (prior/current versions) etc

  39. Versioning Approaches (2/3) • Theoretical foundations for versioning [HP04] • Formal definitions to capture notions such as: • Compatibility (between versions) • Commitment (resources committing to a certain ontology) • Ontology perspectives (the part of the web committing to an ontology) • Temporal approaches [HS05, PTC05, KLGE07] • For capturing temporal relations between versions • For allowing cross-snapshot queries • Versioning in multi-editor environments [RSDT08] • Via change monitoring

  40. Versioning Approaches (3/3) • Automatically detecting version relationships [AAM09] • Using heuristics based on URIs • Study of “relatedness” between versions [CQ13] • A model of “relatedness” between vocabularies from various sources • Similar to links in web pages • POI: Partial Order Index [TTA08] • Efficient method for storing versions and their differences • Stores several versions, exploiting their common triples for efficient storage

  41. Change Languages (1/2) • Change languages necessary for monitoring, detection, propagation • Granularity • Low-level (or atomic, or elementary) • Simple add/remove operations • Add(s,p,o), Delete(s,p,o) • Simple to detect and define • Focus on machine-readability: determinism, well-defined semantics • High-level (or complex, or composite) • More coarse-grained, compact, closer to editor’s perception and intuition • Generalize_Domain(P,A), Delete_Class(A) • More interesting; harder to detect and define • Focus on human-understandability: often unclear and/or informal semantics

  42. Change Languages (2/2) • Many different high-level languages (no standard) • [HGR12, JAP09, PFF+13, SK03, AH06, DA09, PTC07, …] • Some are domain-specific (e.g., [HGR12]) • Some are dynamic (e.g,, [AH06, DA09, PTC07]) • Allow custom, user-defined changes • Some allow terminological changes (e.g., [PFF+13]) • Rename, merge, split • Common, but tough to detect (easily confused with add/delete)

  43. Representation Issues • Deltas are just sets of changes from the change language • Changes usually represented using a change ontology • Ontology represents changes • A specific change is an instance of such an ontology • Deltas associated with sets of such instances • Different proposals [NCLM06, KFKO02, KN03, PT05] • Allows the manipulation and communication of deltas/changes using standard Semantic Web technologies

  44. Change Monitoring Approaches • Using a version log [PT05] • Logging actions on the dataset • Use it for change detection, as well as proper versioning • Good quality, high-level change monitoring • Based on a dynamic language of changes • Using migration specifications [ZZL+03] • Similar to logs, but with a more formal structure • DBPedia change monitoring [MLA+12] • http://live.dbpedia.org/ • Live versions, as opposed to “standard” versions

  45. Low-Level Change Detection (1/2) • SemVersion [VWS+05] • Developed in Karlsruhe (FZI, AIFB) • Low-level change detection tool for RDF • Provides also versioning functionalities • Allows cross-snapshot queries • For RDF [ILK12] • Low-level change detection based on set difference • Aggregating and compressing deltas • Also dealing with versioning issues • For RDF/S [ZTC11] • Takes into account semantics (RDFS inference) • Four different methods to compute deltas (all based on set difference) • Formal analysis of these methods’ properties and semantics • Extension: effect of blank nodes on change detection [TLZ12]

  46. Low-Level Change Detection (2/2) • Bubastis (http://www.ebi.ac.uk/fgpt/sw/bubastis/index.html) • Simple diff tool (triple-based comparison) • Basically RDF, but also supports OWL • For DL-Lite [KWZ08] • Formal, semantical approach • For EL [KWW08] • Uses a concept-based description of changes • For propositional knowledge bases [FMV10] • Propositional, but generic; it can be applied to DLs • Formal analysis of the problem • Also dealing with propagation semantics

  47. High-Level Change Detection (1/2) • For OWL: PromptDiff [NKKM04], OntoView [KFKO02] • Employ heuristics and probabilistic methods • Evaluation using precision/recall metrics against a gold standard • Integrated into tools that also provide versioning functionalities • For RDF/S [PFF+13] • Dealing with both machine-readability and human-understandability • Also dealing with propagation (applying changes) • To be discussed in detail later • COnto-Diff [HGR12] • Rule-based approach • Also dealing with propagation

  48. Change Propagation Approaches • Usually part of other tools [SMMS02, MMS+03] • Versioning, monitoring tools (push-based propagation) • Detection tools (pull-based propagation) • Evolution and repair tools (pull-based propagation) • Adapt your data to be “compatible” with the new remote version • SparqlPush [PM10] • Push-based propagation of changes on SPARQL “views” • PRISM, PRISM++ [CMZ08, CMDZ10] • High-level language of schema changes for relational data • Also supports changes on the integrity constraints • Identifies and propagates the changes required in the data for abiding to the new schema • Query and update rewriting • For applications that try to access the old schema

  49. Other Change Management Approaches • Complete approach for XML [SP10] • Representing changes inline with the data using a graph (“evograph”) • Supports different change representation languages (both low-level and high-level) • Timestamps changes • Monitoring: evograph can be used to log the changes • Propagation: changes can be accessed and propagated • Versioning: timestamps in changes can be used to generate snapshots (versions) at different times • Allows cross-snapshot queries • Fairly generic, can be adapted for RDF

  50. Talk Structure (B3) • Introduction to RDF/S, DLs, OWL • Remote change management • Introduction, definition of subfields • Literature review • An approach for change detection [PFF+13] • Repair • Introduction, definition of subfields • Literature review • An approach for validity repair [RFC11] • Data and Knowledge Evolution • Introduction, connection with belief change • Understanding the process of change • Literature review

More Related