810 likes | 886 Views
prepared for CERN seminar, June 2000. Heterogeneous Information Management. June 2000 Gio Wiederhold Stanford University. Abstract. Information is created by applying knowledge (enoded as programs or rules) to collected data and message received.
E N D
prepared for CERN seminar, June 2000 Heterogeneous Information Management June 2000 Gio Wiederhold Stanford University Gio - CERN
Abstract Information is created by applying knowledge (enoded as programs or rules) to collected data and message received. Data and computation resources are provided by a variety of suppliers, public and private. The autonomy of the suppliers causes heterogeneity and inconsistencies. The number of potential suppliers and their autonomy also creates information overload To cope with these issues novel intermediate services are needed, opening up new opportunities. Many traditional relationships among consumers and vendors will change. We will present the concepts and status of such services. Collaboration, security, and payment schemes are some of the considerations. Gio - CERN
Outline • Background for Mediated Systems • Motivation and Functions needed • Architecture • Current Status • Resolving Semantic Heterogeneity • Research Directions • Background • Maintenance • Research Projects • Integration of Simulation Information Gio - CERN
mediators network Evolution of mediation applications A3 A4 A2 A5 A1 A6 integrators a. I2 I1 M1 b. M2 c. d. e. wrappers D1 W3 D6 W2 D5 D4 W1 D2 D3 datasources Gio - CERN
Transforming Data to Information Application Layer Mediation Layer Foundation Layer users at workstations value-added services data and simulation resources Gio - CERN
Data Loop Knowledge Loop Storage Education Recording Selection Abstraction Integration Summarization Experience Decision-making State changes Action Data and Knowledge Information is created at the confluence of data -- the state & knowledge -- the ability to select and project the state into the future Gio - CERN
Definition* A mediator is a software module that exploits encoded knowledge about certain sets or subsets of data to create information for a higher layer of applications. It should be small and simple, so that it can be maintained by one expert or, at most, a small and coherent group of experts. * Wiederhold: IEEE Computer March 1992 Gio - CERN
Information overload Data starvation • More databases • public & corporate • Faster communication • digital • packeting: TCP-IP, ATM • World-wide connectivity • Internet & Intranets • world-wide web • Disintermediation • ubiquitous publishing Gio - CERN
Change in Supply vs Demand What information consumes is rather obvious, it consumes the attention of its recipients. Hence a wealth of information creates a poverty of attention, and a need to allocate that attention efficiently among the overabundance of information sources that might consume it. [Herbert Simon] Gio - CERN
Function of Mediation Apply Domain-specific Specialist Knowledge to add value • to locate data sources • to convert for consistency • to integrate from diverse sources • to describe data for processing • to abstract for insight / models • to extrapolate to new situations • to summarize for presentation • INFORMATION Gio - CERN
Human-computer Interaction User interface Application- specific code Service interface Domain- specific code MEDIATION Resource access interface Source- specific code Real-world interface Interfaces Gio - CERN
Making data relevant • Data reduction • Data abstraction • Level changing • Summarization • Exception search • Level change to integrate with other data sources • Follow Customer Model: hierarchical, divide-and-conquer, a common paradigm Gio - CERN
Summarize articulation Inte- -gration Hetero- genous resources Transform Selection Functionsinside Mediation Gio - CERN
Today Handcrafted Expert consults with programmer Programmer codes the knowledge needed Resource changes require advise, program update Future Generated from models Domain Expert maintains models Specification determines functions Resource changes trigger regeneration Status of Mediation Technology Gio - CERN
] | ) ( :-[ Abstraction for relevance to customer Discovery (web,schema searching) :-) :-( Maintenance (rule technology?) Caching / History :-| Facilitation (auto linking) :-[ :-( Mediators for multiple domains :-( :-| Integration over sources Security for cooperation :-) :-[ :-( Wrapping (syntactical heterogeneity) Databases / Web / Text / Simulation :-( :-[ :-) :-) Coverage of Current DARPA I3 Efforts Good progress / active research / related work / poor coverage Gio - CERN
Mediator Design Principle Transform Data into Information Match Costumer Model Hierarchical to Resource Model General network (and maintain models) Gio - CERN
Heterogeneity among Domains If interoperation involves distinct domains mismatch ensues • Autonomy conflicts with consistency, • Local Needs have Priority, • Outside uses are a Byproduct Heterogeneity must be addressed • Platform and Operating Systems 4 4 • Representation and Access Conventions 4 • Naming and Ontology :
Unsolved problem in Interoperation Common assumption in assembling and integrating distributed information resources • The language used by the resources is the same • Sublanguages used by the resources are subsets of a globally consistent language This assumption is provably false. Working towards the goal of global consistency is 1. naïve -- the goal cannot be achieved 2. inefficient -- languages are efficient in local contexts Gio - CERN
Ontology: components . We represent the contents and structure of a languages by its ontology: • a set of well-defined terms, which delimit the domain of discourse • relationships among those terms, chosen from a limited set a formalizable subset of expert knowledge Gio - CERN
SKC’s grounded definition . • Ontology: a set of terms and their relationships • Term: a reference to real-world and abstract objects • Relationship: a named and typed set of links between objects • Reference: a label that names objects • Real-world object: an entity instance with a physical manifestation • Abstract object: a concept which refers to other objects Gio - CERN
Where are Ontologies found? Ontologies allow communication among partners in enterprises (rarely in machine-readable form) Relationships determine meaning - parent, school, company Variable and Class names in Software Databases use ontologies during design in their E-R diagrams (implicitly) and to represent the leaf nodes in their schemas. Knowledge-bases use term ontologies (often explicitely), add class definition (to hold instances), constraints, and operations among the terms. Gio - CERN
Establishing Ontologies Top-down: • Commonly acceptable UPPER layers Domain-specific • Analysis and Sharing tools • Model and Object-type based Bottom-up • Wordlist creation from task-specific collections • Database models, schemas, and contents Gio - CERN
Large Ontologies: good or bad? • Have all the Knowledge together • simple for customers of KBs • hard for owners of KBs, must synchronize with many others • in the limit -- everybody must be globally consistent • Large KB will cover multiple / all domains • created by a committee -- slow • maintained by a committee-- costly • Differences in level of abstraction -- efficiency • homeowner: nail • carpenter: sinker, brad, boxnail, . . . Gio - CERN
No committee is needed to forge compromises * within a domain Domain ontology assumption . • a domain will contain known objects • the object configuration is consistent • within a domain all terms are consistent & • relationships among objects are consistent • context is implicit in use • explicit context is needed for external use Domain Ontology • Compromises hide valuable details Gio - CERN
SKC SKC Objective Provide for Maintainable Ontologies • devolve maintenance onto many domain-specific experts / authorities • provide an algebra to compute composed ontologies that are limited to their articulation terms • enable interpretation within the source contexts Gio - CERN
Conservative assumption ! When dealing with multiple ontologies one can never be sure that identically or similarly spelled words mean the same thing, I.e, refer to exactly the same set of real-world objects under all current and future conditions • Common, optimistic assumption: Meaning is identical • Gets worse when terms are stemmed • SKC, conservative or pessimistic assumption: Meaning never matches, unless there is a match rule • number of matching rules is reduced by focusing on the articulation Gio - CERN
Intersection create a subset ontology • keep sharable entries • Union create a joint ontology • merge entries • Difference create a distinct ontology • remove shared entries An Ontology Algebra A knowledge-based algebra for ontologies The Articulation Ontology (AO) consists of matching rules that link domain ontologies Gio - CERN
Sample Operation: INTERSECTION Terms useful for purchasing Result contains shared terms Source Domain 1: Owned and maintained by Store Source Domain 2: Owned and maintained by Factory Gio - CERN
INTERSECTION support Articulation ontology Matching rules that use terms from the 2 source domains Terms useful for purchasing Store Ontology Factory Ontology Gio - CERN
size = size color =table(colcode) style = style Articulation ontology matching rules : Shoe Factory Ana- tomy • Material inventory {...} • Employees { . . . } • Machinery { . . . } • Processes { . . . } • Shoes { . . . } Shoe Store • Shoes { . . . } • Customers { . . . } • Employees { . . . } {. . . } Hard- ware foot = foot Employees Employees Nail (toe, foot) Nail (fastener) . . . . . . Department Store Sample Intersections Gio - CERN
DIFFERENCE: material fully under local control UNION: merging entire ontologies Arti- culation ontology typically prior intersections Other Basic Operations Gio - CERN
Features of an algebra Operations can be composed Operations can be rearranged Alternate arrangements can be evaluated Optimization is enabled The record of past operations can be kept and reused Gio - CERN
Articulation knowledge for U (A B) U U U (B C) Legend: U (C E) U : union U (C E) U : intersection B) (A U U (B C) (C D) Knowledge Composition Composed knowledge for applications using A,B,C,E Articulation knowledge for Knowledge resource E Articulation knowledge for Knowledge resource C U Knowledge resource A Knowledge resource B Knowledge resource D Gio - CERN
What is the most recent year an OPEC member nation was on the UN security council? Related to DARPA HPKB Challenge Problem SKC resolves 3 Sources CIA Factbook ‘96 (nation) OPEC (members, dates) UN (SC members, years) SKC obtains the Correct Answer 1996 (Indonesia) Other groups obtained more, but factually wrong answers Problems resolved by SKC Factbook has out of date OPEC & UN SC lists Indonesia not listed Gabon (left OPEC 1994) different country names Gambia => The Gambia historical country names Yugoslavia UN lists future security council members Gabon 1999 intent of original question Temporal variants Sample Processing in HPKB Gio - CERN
Tools to create articulations Graph matcher for Articulation- creating Expert Transport ontology Vehicle ontology Suggestions for articulations Gio - CERN
continue from initial point • Also suggest similar terms • for further articulation: • by spelling similarity, • by graph position • by term match repository • Expert response: • 1. Okay • 2. False • 3. Irrelevant • to this articulation • All results are recorded • Okay’s are converted into articulation rules Gio - CERN
Based on processing headwords ý definitions using algebra primitives Candidate Match Repository Term linkages automatically extracted from 1912 Webster’s dictionary * * free, other sources .have been processed. Notice presence of 2 domains: chemistry, transport Gio - CERN
Using the match repository Gio - CERN
Navigating the match repository Gio - CERN
Unary Summarize -- structure up Glossarize - list terms Filter - reduce instances Extract - circumscription Binary Match - data corrobaration Difference - distance measure Intersect - schem discovery Blend - schema extension Constructors create object create set Connectors match object match set Editors insert value edit value move value delete value Converters object - value object indirection reference indirection Primitive Operations Model and Instance Gio - CERN
Future: exploiting the result Avoid n2 problem of interpreter mapping as stated by Swartout as an issue in HPKB year 1 Result has links to source Processing & query evaluation is best performed within Source Domains & by their engines Gio - CERN
SKC Synopsis • Research:Reliable query answers from heterogeneous, imperfect data sources • Sources: • General: CIA World Factbook ‘96, UN www, OPEC www Webster’s Dictionary, Thesaurus, Oxford English Dictionary • Topical: OPEC, BattleSpace Sensors, Logistics Servers • Client: DARPA High Performance Knowledge Base (HPKB) project • Theory: Rule-based algebra • Translation & Composition primitives Gio - CERN
Innovation in SKC • No need to harmonize full ontologies • Focus on what is critical for interoperation • Rules specific for articulation • Potentially many sets of articulation rules • Maintenance is distributed • to n sources • to m articulation agents is m < n2 , depending on architecture density a research question Gio - CERN
Empowerment automously maintainable Domain Specialization • Knowledge Acquisition (20% effort) & • Knowledge Maintenance (80% effort *) to be performed by • Domain specialists • Professional organizations • Field teams of modest size * based on experience with software Gio - CERN
SKC Summary . • Algebra enables Interoperation by dealing explicitly with differences by knowledge identifying maintenance domains keeping sources autonomous • Assumes domain has a common ontology composing domain ontologies requires the algebra to manage the linkages where articulation occurs processes are best executed within the domains • Knowledge about articulation is disjoint allows integration specialists to work independently supports multiple intersections and views • Maintenance is structured and partitioned Gio - CERN
Current SKC Directions • Experience with real world (imperfect) data confirms validity of our approach • Expert sources are better maintained than general sources • Rules applied to multiple sources provide more reliable and accurate query results • Component architecture enables scalable, maintainable knowledge base development • Porting the concepts to the DARPA Markup Language (DAML) setting Gio - CERN
Mediation Research Topics • Mediator management and maintenance • Representation of knowledge and customer models • Balancing dynamic and warehouse solutions • Formalization of semantic heterogneities • many levels and types • roles for wrappers vs. mediators vs. applications • scalability by partitioning -- make it simple! • Domain Ontologies --- tools, validation, . . . • Effect of object paradigm and method-based access • Service and business models • New types of information systems Gio - CERN
Long Range Science Vision Artificial Intelligence knowledge mgmt domain expertise uncertainty Systems Engineering analysis documentation costing Databases access storage algebras Integration Methods GIS Integration Science Spatial is special. Gio - CERN
Background Material: • Technology Sources • Maintenance • Projects • Information about the Future Gio - CERN
Human « Computer {x-widgets, HTML} Application « Mediator {OQL, KQML, ...} Mediator « Data sources {SQL, TQL, XML, … } Data ¬ real world {sensors, clerks, … } Interfaces Gio - CERN