1 / 81

Heterogeneous Information Management

prepared for CERN seminar, June 2000. Heterogeneous Information Management. June 2000 Gio Wiederhold Stanford University. Abstract. Information is created by applying knowledge (enoded as programs or rules) to collected data and message received.

maya-stokes
Download Presentation

Heterogeneous Information Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. prepared for CERN seminar, June 2000 Heterogeneous Information Management June 2000 Gio Wiederhold Stanford University Gio - CERN

  2. Abstract Information is created by applying knowledge (enoded as programs or rules) to collected data and message received. Data and computation resources are provided by a variety of suppliers, public and private. The autonomy of the suppliers causes heterogeneity and inconsistencies. The number of potential suppliers and their autonomy also creates information overload To cope with these issues novel intermediate services are needed, opening up new opportunities. Many traditional relationships among consumers and vendors will change. We will present the concepts and status of such services. Collaboration, security, and payment schemes are some of the considerations. Gio - CERN

  3. Outline • Background for Mediated Systems • Motivation and Functions needed • Architecture • Current Status • Resolving Semantic Heterogeneity • Research Directions • Background • Maintenance • Research Projects • Integration of Simulation Information Gio - CERN

  4. mediators network Evolution of mediation applications A3 A4 A2 A5 A1 A6 integrators a. I2 I1 M1 b. M2 c. d. e. wrappers D1 W3 D6 W2 D5 D4 W1 D2 D3 datasources Gio - CERN

  5. Transforming Data to Information Application Layer Mediation Layer Foundation Layer users at workstations value-added services data and simulation resources Gio - CERN

  6. Data Loop Knowledge Loop Storage Education Recording Selection Abstraction Integration Summarization Experience Decision-making State changes Action Data and Knowledge Information is created at the confluence of data -- the state & knowledge -- the ability to select and project the state into the future Gio - CERN

  7. Definition* A mediator is a software module that exploits encoded knowledge about certain sets or subsets of data to create information for a higher layer of applications. It should be small and simple, so that it can be maintained by one expert or, at most, a small and coherent group of experts. * Wiederhold: IEEE Computer March 1992 Gio - CERN

  8. Information overload Data starvation • More databases • public & corporate • Faster communication • digital • packeting: TCP-IP, ATM • World-wide connectivity • Internet & Intranets • world-wide web • Disintermediation • ubiquitous publishing Gio - CERN

  9. Change in Supply vs Demand What information consumes is rather obvious, it consumes the attention of its recipients. Hence a wealth of information creates a poverty of attention, and a need to allocate that attention efficiently among the overabundance of information sources that might consume it. [Herbert Simon] Gio - CERN

  10. Function of Mediation Apply Domain-specific Specialist Knowledge to add value • to locate data sources • to convert for consistency • to integrate from diverse sources • to describe data for processing • to abstract for insight / models • to extrapolate to new situations • to summarize for presentation • INFORMATION Gio - CERN

  11. Human-computer Interaction User interface Application- specific code Service interface Domain- specific code MEDIATION Resource access interface Source- specific code Real-world interface Interfaces Gio - CERN

  12. Making data relevant • Data reduction • Data abstraction • Level changing • Summarization • Exception search • Level change to integrate with other data sources • Follow Customer Model: hierarchical, divide-and-conquer, a common paradigm Gio - CERN

  13. Summarize articulation Inte- -gration Hetero- genous resources Transform Selection Functionsinside Mediation Gio - CERN

  14. Today Handcrafted Expert consults with programmer Programmer codes the knowledge needed Resource changes require advise, program update Future Generated from models Domain Expert maintains models Specification determines functions Resource changes trigger regeneration Status of Mediation Technology Gio - CERN

  15. ] | ) ( :-[ Abstraction for relevance to customer Discovery (web,schema searching) :-) :-( Maintenance (rule technology?) Caching / History :-| Facilitation (auto linking) :-[ :-( Mediators for multiple domains :-( :-| Integration over sources Security for cooperation :-) :-[ :-( Wrapping (syntactical heterogeneity) Databases / Web / Text / Simulation :-( :-[ :-) :-) Coverage of Current DARPA I3 Efforts Good progress / active research / related work / poor coverage Gio - CERN

  16. Mediator Design Principle Transform Data into Information Match Costumer Model Hierarchical to Resource Model General network (and maintain models) Gio - CERN

  17. Heterogeneity among Domains If interoperation involves distinct domains mismatch ensues • Autonomy conflicts with consistency, • Local Needs have Priority, • Outside uses are a Byproduct Heterogeneity must be addressed • Platform and Operating Systems 4 4 • Representation and Access Conventions 4 • Naming and Ontology :

  18. Unsolved problem in Interoperation Common assumption in assembling and integrating distributed information resources • The language used by the resources is the same • Sublanguages used by the resources are subsets of a globally consistent language This assumption is provably false. Working towards the goal of global consistency is 1. naïve -- the goal cannot be achieved 2. inefficient -- languages are efficient in local contexts Gio - CERN

  19. Ontology: components . We represent the contents and structure of a languages by its ontology: • a set of well-defined terms, which delimit the domain of discourse • relationships among those terms, chosen from a limited set a formalizable subset of expert knowledge Gio - CERN

  20. SKC’s grounded definition . • Ontology: a set of terms and their relationships • Term: a reference to real-world and abstract objects • Relationship: a named and typed set of links between objects • Reference: a label that names objects • Real-world object: an entity instance with a physical manifestation • Abstract object: a concept which refers to other objects Gio - CERN

  21. Where are Ontologies found? Ontologies allow communication among partners in enterprises (rarely in machine-readable form) Relationships determine meaning - parent, school, company Variable and Class names in Software Databases use ontologies during design in their E-R diagrams (implicitly) and to represent the leaf nodes in their schemas. Knowledge-bases use term ontologies (often explicitely), add class definition (to hold instances), constraints, and operations among the terms. Gio - CERN

  22. Establishing Ontologies Top-down: • Commonly acceptable UPPER layers Domain-specific • Analysis and Sharing tools • Model and Object-type based Bottom-up • Wordlist creation from task-specific collections • Database models, schemas, and contents Gio - CERN

  23. Large Ontologies: good or bad? • Have all the Knowledge together • simple for customers of KBs • hard for owners of KBs, must synchronize with many others • in the limit -- everybody must be globally consistent • Large KB will cover multiple / all domains • created by a committee -- slow • maintained by a committee-- costly • Differences in level of abstraction -- efficiency • homeowner: nail • carpenter: sinker, brad, boxnail, . . . Gio - CERN

  24. No committee is needed to forge compromises * within a domain Domain ontology assumption . • a domain will contain known objects • the object configuration is consistent • within a domain all terms are consistent & • relationships among objects are consistent • context is implicit in use • explicit context is needed for external use Domain Ontology • Compromises hide valuable details Gio - CERN

  25. SKC SKC Objective Provide for Maintainable Ontologies • devolve maintenance onto many domain-specific experts / authorities • provide an algebra to compute composed ontologies that are limited to their articulation terms • enable interpretation within the source contexts Gio - CERN

  26. Conservative assumption ! When dealing with multiple ontologies one can never be sure that identically or similarly spelled words mean the same thing, I.e, refer to exactly the same set of real-world objects under all current and future conditions • Common, optimistic assumption: Meaning is identical • Gets worse when terms are stemmed • SKC, conservative or pessimistic assumption: Meaning never matches, unless there is a match rule • number of matching rules is reduced by focusing on the articulation Gio - CERN

  27. Intersection create a subset ontology • keep sharable entries • Union create a joint ontology • merge entries • Difference create a distinct ontology • remove shared entries An Ontology Algebra A knowledge-based algebra for ontologies The Articulation Ontology (AO) consists of matching rules that link domain ontologies Gio - CERN

  28. Sample Operation: INTERSECTION Terms useful for purchasing Result contains shared terms Source Domain 1: Owned and maintained by Store Source Domain 2: Owned and maintained by Factory Gio - CERN

  29. INTERSECTION support Articulation ontology Matching rules that use terms from the 2 source domains Terms useful for purchasing Store Ontology Factory Ontology Gio - CERN

  30. size = size color =table(colcode) style = style Articulation ontology matching rules : Shoe Factory Ana- tomy • Material inventory {...} • Employees { . . . } • Machinery { . . . } • Processes { . . . } • Shoes { . . . } Shoe Store • Shoes { . . . } • Customers { . . . } • Employees { . . . } {. . . } Hard- ware foot = foot Employees Employees Nail (toe, foot) Nail (fastener) . . . . . . Department Store Sample Intersections Gio - CERN

  31. DIFFERENCE: material fully under local control UNION: merging entire ontologies Arti- culation ontology typically prior intersections Other Basic Operations Gio - CERN

  32. Features of an algebra Operations can be composed Operations can be rearranged Alternate arrangements can be evaluated Optimization is enabled The record of past operations can be kept and reused Gio - CERN

  33. Articulation knowledge for U (A B) U U U (B C) Legend: U (C E) U : union U (C E) U : intersection B) (A U U (B C) (C D) Knowledge Composition Composed knowledge for applications using A,B,C,E Articulation knowledge for Knowledge resource E Articulation knowledge for Knowledge resource C U Knowledge resource A Knowledge resource B Knowledge resource D Gio - CERN

  34. What is the most recent year an OPEC member nation was on the UN security council? Related to DARPA HPKB Challenge Problem SKC resolves 3 Sources CIA Factbook ‘96 (nation) OPEC (members, dates) UN (SC members, years) SKC obtains the Correct Answer 1996 (Indonesia) Other groups obtained more, but factually wrong answers Problems resolved by SKC Factbook has out of date OPEC & UN SC lists Indonesia not listed Gabon (left OPEC 1994) different country names Gambia => The Gambia historical country names Yugoslavia UN lists future security council members Gabon 1999 intent of original question Temporal variants Sample Processing in HPKB Gio - CERN

  35. Tools to create articulations Graph matcher for Articulation- creating Expert Transport ontology Vehicle ontology Suggestions for articulations Gio - CERN

  36. continue from initial point • Also suggest similar terms • for further articulation: • by spelling similarity, • by graph position • by term match repository • Expert response: • 1. Okay • 2. False • 3. Irrelevant • to this articulation • All results are recorded • Okay’s are converted into articulation rules Gio - CERN

  37. Based on processing headwords ý definitions using algebra primitives Candidate Match Repository Term linkages automatically extracted from 1912 Webster’s dictionary * * free, other sources .have been processed. Notice presence of 2 domains: chemistry, transport Gio - CERN

  38. Using the match repository Gio - CERN

  39. Navigating the match repository Gio - CERN

  40. Unary Summarize -- structure up Glossarize - list terms Filter - reduce instances Extract - circumscription Binary Match - data corrobaration Difference - distance measure Intersect - schem discovery Blend - schema extension Constructors create object create set Connectors match object match set Editors insert value edit value move value delete value Converters object - value object indirection reference indirection Primitive Operations Model and Instance Gio - CERN

  41. Future: exploiting the result Avoid n2 problem of interpreter mapping as stated by Swartout as an issue in HPKB year 1 Result has links to source Processing & query evaluation is best performed within Source Domains & by their engines Gio - CERN

  42. SKC Synopsis • Research:Reliable query answers from heterogeneous, imperfect data sources • Sources: • General: CIA World Factbook ‘96, UN www, OPEC www Webster’s Dictionary, Thesaurus, Oxford English Dictionary • Topical: OPEC, BattleSpace Sensors, Logistics Servers • Client: DARPA High Performance Knowledge Base (HPKB) project • Theory: Rule-based algebra • Translation & Composition primitives Gio - CERN

  43. Innovation in SKC • No need to harmonize full ontologies • Focus on what is critical for interoperation • Rules specific for articulation • Potentially many sets of articulation rules • Maintenance is distributed • to n sources • to m articulation agents is m < n2 , depending on architecture density a research question Gio - CERN

  44. Empowerment automously maintainable Domain Specialization • Knowledge Acquisition (20% effort) & • Knowledge Maintenance (80% effort *) to be performed by • Domain specialists • Professional organizations • Field teams of modest size * based on experience with software Gio - CERN

  45. SKC Summary . • Algebra enables Interoperation by dealing explicitly with differences by knowledge identifying maintenance domains keeping sources autonomous • Assumes domain has a common ontology composing domain ontologies requires the algebra to manage the linkages where articulation occurs processes are best executed within the domains • Knowledge about articulation is disjoint allows integration specialists to work independently supports multiple intersections and views • Maintenance is structured and partitioned Gio - CERN

  46. Current SKC Directions • Experience with real world (imperfect) data confirms validity of our approach • Expert sources are better maintained than general sources • Rules applied to multiple sources provide more reliable and accurate query results • Component architecture enables scalable, maintainable knowledge base development • Porting the concepts to the DARPA Markup Language (DAML) setting Gio - CERN

  47. Mediation Research Topics • Mediator management and maintenance • Representation of knowledge and customer models • Balancing dynamic and warehouse solutions • Formalization of semantic heterogneities • many levels and types • roles for wrappers vs. mediators vs. applications • scalability by partitioning -- make it simple! • Domain Ontologies --- tools, validation, . . . • Effect of object paradigm and method-based access • Service and business models • New types of information systems Gio - CERN

  48. Long Range Science Vision Artificial Intelligence knowledge mgmt domain expertise uncertainty Systems Engineering analysis documentation costing Databases access storage algebras Integration Methods GIS Integration Science Spatial is special. Gio - CERN

  49. Background Material: • Technology Sources • Maintenance • Projects • Information about the Future Gio - CERN

  50. Human « Computer {x-widgets, HTML} Application « Mediator {OQL, KQML, ...} Mediator « Data sources {SQL, TQL, XML, … } Data ¬ real world {sensors, clerks, … } Interfaces Gio - CERN

More Related