560 likes | 678 Views
E N D
Semantic Interoperability in Infocosm:Beyond Infrastructural and Data Interoperability in Federated Information SystemsKeynote TalkInternational Conference on Interoperating Geographic Systems (Interop’97), Santa Barbara, December 3-4 1997Amit ShethLarge Scale Distributed Information Systems LabUniversity of Georgiahttp://lsdis.cs.uga.eduThanks: Vipul Kashyap, Kshitij Shah
Three perspectives • Information Integration Perspective:Distribution, Heterogeneity, Autonomy • Information Brokering Perspective:Data, Metadata, Semantic (Terminological, Contextual) • “Vision” Perspective: Connectivity+Computation, Information, Knowledge
InfoQuilt 1997 Digital Library Projects, .. VisualHarness InfoHarness Infoscopes, HERMES, SIMS, ... TSIMMIS,Harvest, RUFUS,... 1990 Mermaid DDTS Multibase, MRDSM, ADDS, IISS, Omnibase, ... Early 80s Evolving targets and approaches in integrating data and information: a personal perspective Infocosm Generation 3 Generation 2 Generation 1
Data recognized as corporate resource -- leverage it! Most data in structured databases (and the rest in files), different data models, transitioning from Network and Hierarchical to Relational DBMSs Connectivity/access -- a major issue Heterogeneity (system, modeling and schematic) as well as need to support autonomy posed main challenges Support for corporate IS applications as the primary objective, update often required, data integrity important Generation I
Significant improvements in computing and connectivity (standardization of protocol, public network, Internet/Web); remote data access as given Increasing diversity in data formats, with focus on variety of textual data and semi-structured documents (and lesser focus on structured data) Many more data sources, diverse domains, but not necessarily better understanding of data Use of data beyond traditional business applications -- mining + warehousing, marketing, commerce Generation II
Generation II • Query only, little attention to updates; extensive use of IR techniques • Focus shift from data to metadata; earlier, distribution applied to data only, now it also applies to metadata • Wrapper part of Mediator Architecture*, Metadata component of Information Brokering Architecture • Early work on ontology support * Gio Wiederhold
Generation III • Increasing information overload • Changes in Web architecture: push,… • Broader variety of content with increasing amount of visual information • Continued standardization related to Web for representational and metadata issues (MCF, RDF, XML) and distributed computing (CORBA, Java) • Not just metadata, logical correlation • Users demand simplicity, but complexities continue to rise
Generation III (contd) • Broader variety of users and applications; well beyond business and scientific uses (e.g., focused marketing-- more than information on the web) • Not just data access, but decision support through “data mining and information discovery, information fusion, information dissemination, knowledge creation and management”, “information management complemented by cooperation between the information system and humans”
Autonomy Distribution Heterogeneity Dimensions for interoperability and integration: Perspective used for Federated Databases
External Schema External Schema Federated Schema o o o Export Schema Export Schema Export Schema o o o Component Schema Component Schema o o o Local Schema Local Schema o o o Component DBS Component DBS o o o FDBS: Schema Architecture • Model Heterogeneity: Common/Canonical Data Model Schema Translation • Information Sharing while preserving Autonomy schema integration schema translation
Heterogeneity in FDBMSs • Database System • Semantic Heterogeneity • Differences in DBMS • data models (abstractions, constraints, query languages) • System level support (concurrency control, commit, recovery) 1980s • Operating System • file system • naming, file types, operation • transaction support • IPC C o m m u n i c a t i o n 1970s • Hardware/System • instruction set • data representation/coding • configuration
Schematic Conflicts Abstraction Level Incompatibility Schematic Discrepancies Domain Definition Incompatibility Data Value Incompatibility Entity Definition Incompatibility Naming Conflicts Database Identifier Conflicts Schema Isomorphism Conflicts Missing Data Items Conflicts Data Value Attribute Conflict Entity Attribute Conflict Data Value Entity Conflict Known Inconsistency Temporal Inconsistency Acceptable Inconsistency Generalization Conflicts Aggregation Conflicts Naming Conflicts Data Representation Conflicts Data Scaling Conflicts Data Precision Conflicts Default Value Conflicts Attribute Integrity Constraint Conflicts Sheth & Kashyap, Kim & Seo Characterization of Schematic Conflictsin Multidatabase Systems
Observations and Lessons Learnt • “tightly coupled” vs “loosely coupled” debate • “good common data model” debate • “tightly coupled” harder to build, but can give better control over data sharing, provide more transparent access, and can possibly support update; lessons learned in schema integration can be reapplied in newer situations • “loosely coupled” more flexible, but generally require more user involvement
Retracing the path without learning from past expeditions Steps for transitioning from Data Marts to Warehouses: • Create consistent dimensions in the data marts • Create a data warehouse data model and convert data marts to it • Go back and build an enterprise data warehouse, then convert data marts to the new common data model and architectures The above is doomed to repeat past mistakes. Integrating metadata is not easy! PC Week, November 24 , 1997
Generation 1 concern: So far (schematically), yet so near (semantically)! Generation 3 concern: So near (schematically), yet so far (semantically)!
Information Brokering: A Three-Level Approach Top Down Semantic (Domain, Application specific) Ontology used-by used-by Metadata Content Emphasis from Gen.I to Gen.III (content descriptions, intentional) abstracted-into abstracted-into Data Representation Bottom Up (heterogeneous types, media)
User Query/ Information Request User Query/ Information Request User Query/ Information Request Inter-Vocabulary Relationships Manager Vocabulary Brokering Vocabulary Vocabulary Broker Broker Metadata Brokering Metadata Metadata Metadata Metadata Repository Broker Repository Broker Metadata System Metadata System ... ... DATA REPOSITORIES DATA REPOSITORIES An Architecture for Information Brokering INFORMATION BROKERING Data Brokering (CORBA, HTTP, IIOP) Information System 1 Information System N
Generation 2:Limited Types of Metadata,Extractors,Mappers,Wrappers
DB Nexis UPI AP EXTRACTORS METADATA Global/Enterprise Web Repositories Generation 2
Data Integration Extraction Rules Extractor Internet Mapping Rules Text Mapper Wrappers (SDL Description) RDBMS Data Publishing IDT Publishing Rule Publisher Application Gen.2 Junglee
Find Marketing Manager positions in a company that is within 15 miles of San Francisco and whose stock price has been growing at a rate of at least 25% per year over the last three years Junglee, SIGMOD Record, Dec. 1997
Extractors • can automatically identify data/media type • can be extended at any time (pre-specified or parameterized routines) • can run at data source, metadata storage site or at IQ server • can run at pre-specified times or events, or on demand • can route metadata to appropriate metadatabase repositories • Extractors use agent & networking computing (NC) technologies and are implemented in PERL/ Java
A Classification of Metadata • Content Independent Metadatae.g. creation-date, location, ... • Content Dependent Metadatae.g. size, number of colors in an image • Content-(directly)based Metadata e.g. inverted lists, doc vectors • Content-descriptive Metadata • Domain Independent (structural) Metadata e.g. parse tree of a C++ program, HTML/SGML DTDs • Domain Specific Metadatascale, coordinate,land-cover, relief (GIS Domain),area, population (Census Domain),concept descriptions from Domain Specific Ontologies Move in this direction to tackle information overload !!
Query Processing andInformation Requests • traditional queries based on keywords • attribute-based queries • content-based queries • 'high-level' information requests involving ontology-based, iconic, mixed-media, and media-independent information requests • user selected ontology, use of profile Generation 2 Generation 3 E.g., Kabila’s political activities (in all media)
Metadata for combined access Structure Color Comp Texture Results Other Attributes Image Data VH VIR User Query Extraction Null Image VisualHarness . .
WWW A confusing heterogeneity of media, formats (Tower of Babel) Information correlation using physical (HREF) links at the extensional data level Location dependent browsing of information using physical (HREF) links => User has to keep track of information content !! WWW+Information Brokering Domain Specific Ontologies as “semantic conceptual views” Information correlation using concept mappings at the intensional concept level Browsing of information using terminological relationships across ontologies=> Higher level of abstraction, closer to user view of information !!
Ontologies for semantic interchange • Need for “transcending” local subject areas/domains => Design Adaptable systems which “adapt/adjust” themselves in the face of vocabularies from different domains • Coordination and interrelation of models across domainsOne approach => utilize terminological relationships across concepts in ontologies • Specification languages for ontologies: • Description Logics, Rule-based Languages • Support for mechanisms for Coordination and Correlation, viz., representation and reasoning with terminological relationships
The InfoQuilt Project http://lsdis.cs.uga.edu/infoquilt
MREFMetadata Reference Link -- complementing HREF Creating “logical web” through Media Independent Metadata based Correlation
Metadata Reference Link (<A MREF …>) • <A HREF=“URL”>Document Description</A> physical link between document (components) • <A MREF KEYWORDS=<list-of-keywords>; THRESH=<real>>Document Description</A> • <A MREF ATTRIBUTES(<list-of-attribute-value-pairs>)>Document Description</A> • <A MREF(<parameterized_routine(….)> Document Description</A>
Content Descriptive Metadata waterfall.gif (Data) WAIS LSI Glimpse SMART …. …. Marina wonderland You are seeing the nature’s beauty of marina wonderland situated in the coastal region of the southern part of India. It consists of huge mountains and water flowing in between the mountains. Full Text Indexing Correlation based on Content-descriptive Metadata Some interesting <A MREF KEYWORDS=“scenic waterfall mountain”; THRESH = 0.9>information on scenic waterfalls</A> is available here.
waterflow.gif (Data) Metadata Storage Content Dependent Metadata waterflow.gif ……gif ……ppm Content based Metadata Major component(RGB) Blue Correlation based on Content-based Metadata Some interesting <A MREF KEYWORDS= “scenic waterfalls”; THRESH = 0.9; ATTRIBUTES (major-color = ‘blue’)>information on scenic waterfalls</A> is available here. height, width and size
Metadata,Domain Specific Ontologies Get the titles, authors, documents, maps published by the United States Geological Service (USGS) about regions having a population greater than 5000, area greater than 1000 acres having a low density urban area land cover domain specific metadata: terms chosen from domain specific ontologies What is Metadata ? What are Ontologies ? - collection of terms, definitions and their interrelationships - specification of a representational vocabulary for a shared domain of discourse - data/information about data - useful/derived properties of media - properties/relationships between objects
Repositories and the Media Types Population: Area: Boundaries: Land cover: Relief: Image Features (image processing routines) Regions (SQL) Boundaries TIGER/Line DB Image/Map DB Census DB
Domain Specific Correlation Potential locations for a future shopping mall identified by all regions having a population greater than 500 and area greater than 50 sq ft having an urban land cover and moderate relief <A MREF ATTRIBUTES(population > 500; area > 50; region-type = ‘block’; land-cover = ‘urban’; relief = ‘moderate’)>can be viewed here</A> =>media-independent relationshipsbetween domain specific metadata:population, area, land cover relief =>correlation between image and structured data at a higherdomain specific level asopposed to physical “link-chasing” in the WWW
InfoQuilt Architecture (partial) Media Independent Information Requests [Browsing Collections, Keyword-based queries, Attribute-based queries] Domain Knowledge IQR: Metadata & Domain Knowledge Repository and Registry Correlation Server KnowledgeBase Parameterized Routines Attr. Metadata Indices loc, type, author InfoQuilt Server Media and Domain specific Extractor Agents Other InfoQuilt Servers ... Wrapper Wrapper Wrapper Text, Image, Audio, Video media repositories
What next (after comprehensive use of metadata) ? • Context, context, context • Semantic Proximity • domain • context • modeling/abstraction/representation • state • Characterizing Loss of Information incurred due to differences in vocabulary BIG challenge: identifying relationship or similarity between objects of different media, developed and managed by different persons and systems
A Semantic Taxonomy Semantic Proximity Semantic Incompatibility Semantic Resemblance Semantic Relevance Semantic Relationship Semantic Equivalence
Tools to support semantics profiles ontologies context domain-specific metadata
Decision Knowledge Information Cooperation Data Interoperability Computing Communication Connectivity and Data Access
Interoperability in the ‘80s Decision System level interoperability like TCP/IP. Standard communication channels, data exchange formats, etc. Basic infrastructural work for higher level interoperability. Knowledge Information Cooperation Data Computing Interoperability Communication Connectivity HTTP, IIOP, TCP/IP
Interoperability in the ‘90s Decision Information level interoperability. Standards evolve that go beyond connectivity and define information standards. Systems start exchanging metadata (MCF,RDF,..). Knowledge Information Cooperation Business Objects, CORBA, DCOM, EDI Data Computing Interoperability Communication Connectivity
Where we are headed Semantic interoperability where systems share ontologies and knowledge. Knowledge Information Systems and human can cooperate in decision making and can generate new knowledge as a collective entity. Cooperation Data Computing Interoperability Communication Connectivity