510 likes | 519 Views
This article discusses the need for precision in web-based interoperation and the challenges posed by the excess of information available. It introduces the SKC (Scalable Knowledge Composition) project at Stanford University, which aims to develop methods and tools for interoperation among different domains using ontologies.
E N D
RIACS / NASA AMES Semantic Precision forWeb-based Interoperation Gio Wiederhold Stanford University July 2001 www-db.stanford.edu/people/gio.html Thanks to Jan Jannink, Shrish Agarwal, Prasenjit Mitra, & Stefan Decker. Gio ICEIS1
Outline • Setting VG 3 - VG 5 • Precision VG 6 - VG 8 • Lack of precision VG 9 - VG 11 • SKC solution VG 12, VG 20- VG 29 • Ontologies VG 13 - VG 19 • Early results VG 30 • Interoperation VG 31 - VG 32 • Tool & examples VG 33 - VG 41 • Composition and execution VG 42 - VG 43 • Evolution and maintenance VG44 - VG46 • Summary – SKC to general scienceVG 47 - VG 49 Gio ICEIS2
90 80 70 60 50 40 30 20 10 0 Centroid, in 1999 ~1% of total market % Ü 98 99 00 01 02 03 04 0.3 1 3 9 27 81 ** Year / % Ü T r e n d s 1998 : 1999 • Users of the Internet 40% 52% of U.S. population • Growth of Net Sites (now 2.2M public sites with 288M pages) • Expected growth in E-commerce by Internet users[BW, 6 Sep.1999] segment 1998 1999 • books 7.2% 16.0% • music & video 6.3% 16.4% • toys 3.1% 10.3% • travel 2.6% 4.0% • tickets 1.4% 4.2% • Overall 8.0% 33.0% = $9.5Billion An unstainable trend cannot be sustained [Herbert Stein] new services E-penetration Toys Gio ICEIS3
Growth and Perception E-commerce • Gartner: 2000 prediction for 2004: 7.3 T$ • Revision:2001 prediction for 2004: 5.9 T$ drastic loss? 50 companies, each after 20% of the market Examples Artificial Intelligence Databases Neural networks E-commerce Extrapolated growth Disap- pointment Combi- natorial growth Realistic growth Failures Perceived growth Perception level Perceived initial growth Invisible growth 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ... Gio ICEIS4
Our* Information Environment. B2B, B2C, G2G, G2C, . . . • In the past: Scarcity Customers needed more information to make better decisions • Today: Excess The web provides more information than customers can digest Effect: confusion in decision-making Must I look at all possibly relevant information? What is the penalty for missing something ? What is the cost of looking at everything ? I am confused, best defer making any decision . . . . . . . . . . Gio ICEIS5
Need for precision Precision: Few wrong or irrelevant results More precision is needed as data volume increases --- a small error rate still leads to too many errors Information Wall human limit hard to move human with tools? acceptable limit data error rate information quantity adapted from Warren Powell, Princeton Un. Gio ICEIS6
perfect recall 100% v.relevant r = v.available 50% v.relevant p= v.retrieved perfect precision volume retrieved volume available %tage actually relevant 0% space of methods Relationships among parameters Type 1 errors recall Type 2 errors precision Gio ICEIS7
Missed Valid Information (False Negatives ) causes lost opportunities cheapest shovel, . . . suboptimal decision-making by x valid suppliers Cost- benefit 1 Excess Information (False Positives ) has to be investigated attractive-looking supplier - makes toys 2 Space of results, ordered Cost of Error types differs Having many cases of excess information costs more than some missing information Gio ICEIS8
A Major Cause of Errors Searches extend over many domains • Domains have their own terminologies • Need autonomy to deal with knowledge growth • The usage of terms in a domain is efficient • Appropriate granularity • Mechanic working on a truck vs. logistics manager • Shorthand notations • PSU vs. PSU • Functions differ in scope • Payroll versus Personnel • getting paid vs. available (includes contract staff) Gio ICEIS9
Semantic Mismatches Information comes from many autonomous sources • Differing viewpoints ( by source ) • differing terms for similar items { lorry, truck } • same terms for dissimilar items trunk( luggage, car) • differing coverage vehicles ( DMV, police, AIA ) • differing granularity trucks ( shipper, manuf. ) • different scope student ( museum fee, Stanford ) • different hierarchical structures supplier vs. usage • Hinders use of information from disjoint sources • missed linkages loss of information, opportunities • irrelevant linkages overload on user or application program • Poor precision when merged Still ok for web browsing ,poor for business & science Gio ICEIS10
Structural Heterogeneity Same concept? If incorporated in differing structures? Gio ICEIS11
Approach (SKC project) Scalable Knowledge Composition – Stanford Univ. DB group • Define Terminology in a domain precisely • Schemas, XML DTDs Ontologies • Develop methods to permit interoperation among differing domains (not integration) • Articulation • Ontology Algebra • Develop tools to support the methods • Ontology matching Gio ICEIS12
Functions of Ontologies . • Enable Precision in Understanding People = designers, implementors, users, maintainers Systems = sources, mediators, applications • Share the Cost of Knowledge Acquisition & Maintenance reuse encoded knowledge, remain up-to-date as domains change • Enable Information Interoperation * Define the terms that link domains Gio ICEIS13
Ancestors of Ontologies • Lexicons: collect terms used in information systems • Taxonomies: categorize, abstract, classify terms • Schemas of databases: attributes, ranges, constraints • Data dictionaries: systems with multiple files, owners • Object libraries: grouped attributes, inherit., methods • Symbol tables: terms bound to implemented programs • Domain object models: (XML DTD): interchange terms • . . . More Knowledge formalized Gio ICEIS14
Data and Knowledge Much Data Diverse Knowledge Information is created at the confluence of data -- the state & knowledge -- the ability to select and project the state into the future Knowledge Loop Data Loop Storage Education Selection Recording Integration Abstraction Experience State changes Decision-making Action bound to an application Gio ICEIS15
Two Mismatch Solutions • A Single, Globally consistent Ontology ( Your Hope ) • wonderful for users and their programs • too many interacting sources • long time to achieve,2 sources ( UAL, LH ), 3 (+ trucks), 4, … all ? • costly maintenance, since all sources evolve • no world-wide authority to dictate conformance • Domain-specific ontologies ( XML DTD assumption ) • Small, focused, cooperating groups • high quality, some examples - arthritis, Shakespeare plays • allows sharable, formal tools • ongoing, local maintenance affecting users - annual updates • poor interoperation, users still face inter-domain mismatches Gio ICEIS16
Global consistency: Hope, but . . Common assumptions in assembling and integrating distributed information resources • The language used by the resources is the same • Sub languages used by the resources are subsets of a globally consistent language These assumptions are provably false Working towards the goal of globally consistency is 1. naïve -- the goal cannot be achieved • inefficient -- languages are efficient in local contexts • unmaintainable – terminology evolves with progress Gio ICEIS17
Domain-specific Expertise . Knowledge needed is huge • Partition into natural domains • Determine domain responsibility and authority • Empower domain owners • Provide tools Consider interaction Society of specialists Gio ICEIS18
No committee is needed to forge compromises * within a domain Domains and Consistency . • a domain will contain many objects • the object configuration is consistent • within a domain all terms are consistent & • relationships among objects are consistent • context is implicit Domain Ontology • Compromises hide valuable details Gio ICEIS19
SKC grounded definition . • Ontology: a set of terms and their relationships • Term: a reference to real-world and abstract objects • Relationship: a named and typed set of links between objects • Reference: a label that names objects • Abstract object: a concept which refers to other objects • Real-world object: an entity instance with a physical manifestation (or its representation in a factual database) Gio ICEIS20
Grounding enables implementation • We use many abstract terms in our work • Needed because we are dealing with many objects • Human thinking is limited to short-term memory • Someone must be able to translate them into code reliably • Each abstract term must have a path to reality • You must provide that path for students and coders • Without a clear path that is not possible • Not automatically at all – machines need specs • Not reliably by human programmers – failures occur • Without implementation there is no benefit Advice Gio ICEIS21
Intersection create a subset ontology • keep sharable entries • Union create a joint ontology • merge entries • Difference create a distinct ontology • remove shared entries An Ontology Algebra A knowledge-based algebra for ontologies The Articulation Ontology (AO) consists of matching rules that link domain ontologies Gio ICEIS22
Sample Operation: INTERSECTION Terms useful for purchasing Result contains shared terms Source Domain 1: Owned and maintained by Store Source Domain 2: Owned and maintained by Factory Gio ICEIS23
Shoe Factory • Material inventory {...} • Employees { . . . } • Machinery { . . . } • Processes { . . . } • Shoes { . . . } Shoe Store • Shoes { . . . } • Customers { . . . } • Employees { . . . } Sample Intersections Articulation ontology matching rules : size = size color =table(colcode) style = style Ana- tomy {. . . } Hard- ware foot = foot Employees Employees Nail (toe, foot) Nail (fastener) . . . . . . Department Store Gio ICEIS24
Arti- culation ontology Other Basic Operations DIFFERENCE: material fully under local control UNION: merging entire ontologies typically prior intersections Gio ICEIS25
Features of an algebra Operations can be composed Operations can be rearranged Alternate arrangements can be evaluated Optimization is enabled The record of past operations can be kept and reused (experience: 3 months 1 week for Webster's annual update, 2 weeks for OED (6 x size [Jannink:01] ) Gio ICEIS26
Articulation ontology Matching rules that use terms from the 2 source domains Terms useful for purchasing Store Ontology Factory Ontology INTERSECTION support Gio ICEIS27
Unary Summarize -- structure up Glossarize - list terms Filter - reduce instances Extract - circumscription Binary Match - data corrobaration Difference - distance measure Intersect - schem discovery Blend - schema extension ... Constructors create object create set Connectors match object match set Editors insert value edit value move value delete value Converters object - value object indirection reference indirection Primitive Operations Model and Instance Gio ICEIS28
Matching rules (in process) • A equals U • A equals {u, … } • A equals Union (U, V) • A equals Union (U, {v, ...}) • A equals Intersection (U, W) • A equals Intersection (U, {w, ...}) • A equals Difference (U, X) • A equals Difference (U, {x, ...}) must obey algebraic properties Gio ICEIS29
What is the most recent year an OPEC member nation was on the UN security council (SC)? (An DARPA HPKB Challenge Problem) SKC resolves 3 Sources CIA Factbook ‘96 (nations) OPEC (members, dates) UN (SC members, years) SKC obtains the Correct Answer 1996 (Indonesia) Other groups obtained more, but factually wrong answers; they relied on one global source, the CIA factbook. Problems resolved by SKC Factbook – a secondary source -- has out of date OPEC & UN SC lists Indonesia not listed Gabon (left OPEC 1994) different country names Gambia => The Gambia historical country names Yugoslavia UN lists future security council members Gabon 1999 needed ancillary data Sample Processing in HPKB Gio ICEIS30
Interoperation via Articulation At application definition time • Match relevant ontologies where needed • Establish articulation rules among them. • Record the process At execution time • Perform query rewriting to get to sources • Optimize based on the ontology algebra. For maintenance • Regenerate rules using the stored process formulation Gio ICEIS31
Generation of the rules Provide library of automatic match heuristics • Lexical Methods -- spelling • Structural Methods -- relative graph position • Reasoning-based Methods • Nexus • Hybrid Methods • Iteratively, with an expert in control GUI tool to • - display matches and • - verify generated matches using the human expert • - expert can also supply matching rules Gio ICEIS32
Articulation Generator Being built by Prasenjit Mitra Thesaurus OntA Context-based Word Relator Phrase Relator Driver Semantic Network (Nexus) Structural Matcher Ont1 Ont2 Human Expert Gio ICEIS33
Lexical Methods • Preprocessing rules. • - Expert-generated seed rules. • e.g., (Match O1.President O2.PrimeMinister) • - Context-based preprocessing directives. • Thesaurus - synonyms, generalizations yellow ochre, canary • Nexus – term relationship graph Owner = buyer • ( Distance of words as measure of relatedness ) Gio ICEIS34
Tools to create articulations Graph matcher for Articulation- creating Expert Transport ontology Vehicle ontology Suggestions for articulations Gio ICEIS35
continue from initial point • Also suggest similar terms • for further articulation: • by spelling similarity, • by graph position • by term match repository • Expert response: • 1. Okay • 2. False • 3. Irrelevant • to this articulation • All results are recorded • Okay’s are converted into articulation rules Gio ICEIS36
Based on processing headwords ý definitions using algebra primitives Candidate Match Nexus Term linkages automatically extracted from 1912 Webster’s dictionary * * free, we also have an OED-based nexus. Notice presence of 2 domains: chemistry, transport Gio ICEIS37
Using the nexus Gio ICEIS38
Navigating the match repository Gio ICEIS39
Example: NATO Country Graphs Austria: bundestag .... Gio ICEIS40
To be matched to Great Britain parliament .... 70% of documented matches found automatically Remainder required human interaction with our tools. Gio ICEIS41
Articulation ontology for U (A B) U U U (B C) Legend: U (C E) U : union U (C E) U : intersection B) (A U U (B C) (C D) Broader Applications Compose Composed ontology for applications using A,B,C,E Articulation ontology ontology for resource E Articulation ontology for Ontology for C U Ontology for resource A Ontology for resource B Ontology for resource D Gio ICEIS42
Exploiting the result Future work Result has links to source Processing & query evaluation is best performed within Source Domains & by their engines Gio ICEIS43
applications A3 A4 A2 A5 A1 A6 integrators I2 I1 mediators M1 b. M2 Network middleware wrappers D1 W3 D6 W2 D5 D4 W1 D2 D3 datasources Architectural development time Gio ICEIS44
decision-makers at workstations value-added services data and simulation resources Transform Data to Information Application Layer Mediation Layer Foundation Layer Gio ICEIS45
Application Interface Changes of user needs Owner / Creator Maintainer Lessor - Seller Advertisor Resource Interfaces Mediation and maintenance Domain ontology changes Software & People Models, programs, rules, caches, . . . Resource functional & ontological changes Tools can help, but changes have to be dealt with rapidly. Automated learning is typically too slow, requires many instances Gio ICEIS46
Empowerment autonomously maintainable Domain Specialization • Knowledge Acquisition (20% effort) & • Knowledge Maintenance (80% effort *) to be performed by • Domain specialists • Professional organizations • Field teams of modest size * based on experience with software Gio ICEIS47
SKC Synopsis • Research Objective: • Precise answers from heterogeneous, imperfect, scalably many data sources • Sources for Ontologies: • General: CIA World Factbook ‘96, UN-www, OPEC-www Webster’s Dictionary, Thesaurus, Oxford English Dictionary • Topical: NATO, BattleSpace Sensors, Logistics Servers • Theory: • Rule-based algebra over ontologies • Translation & Composition primitives • Sponsor and collaboration • AFOSR; DARPA DAML program; W3C; Stanford KSL and SMI; Univ. of Karlsruhe, Germany; others. Gio ICEIS48
Innovation in SKC • No need to harmonize full ontologies • Focus on what is critical for interoperation • Rules specific for articulation • Tools for creation and maintenance • Maintenance is distributed • to n sources • to m articulation agents in mediators • Potentially many sets of articulation rules is m < n2 , depends on semantic architecture density a research question: density Gio ICEIS49
Conclusion • High precision is important for enterprise applications • cost of overload versus opportunity loss • Semantic differences cause problems • Today solved by human intermediate experts • Will need automation support • Tools so that expert knowledge is captured and maintainable • Scalability requires a thorough foundation • Algebra provides composition, formal basis, delegation • Formal composition supports maintenance • Delegation of responsibility and authority enhances quality • Many research tasks left Gio ICEIS50