DOTTORATO DI RICERCA IN INGEGNERIA DELL’INFORMAZIONE XVI ciclo di dottorato - II ciclo Nuova Serie. Dai Dati all’Informazione: il sistema MOMIS. dott. ing. Francesco Guerra tutore: prof. Sonia Bergamaschi. Outline. Intelligent Integration of Information Matching The MOMIS system
Outline • Intelligent Integration of Information • Matching • The MOMIS system • MOMIS in the Semantic Web • MOMIS as the basis of a virtual marketplace • MOMIS to manage collaborative processes (the WINK project) • MOMIS as a semantic search engine (the SEWASIE project)
Intelligent Integration of Information • Distinguishing elements: • Kinds of managed sources • The Global-as-View vs. the Local-as-View approach • Data Model • Building the Global View • Querying the Global View • Description Logics techniques • Updating the Global View
Matching comparison • Distinguishing elements: • Different kinds of mappings representation (granularity, cardinality) • Mappings extraction (structure-instances analysis, lexical analysis, external tools exploitation)
Matching comparison Extended from : E. Rahm and P.A. Bernstein. A survey of approaches to automatic schema matching, VLDB Journal, 10(4):334-350,2001
The MOMIS System • MOMIS (Mediator envirOnment for Multiple Information Sources) is a framework to perform information extraction and integration from both structured and semistructured data sources. • An object-oriented language, with an underlying Description Logic, called ODL-I3, derived from the standard ODMG is introduced for information extraction. Information integration is then performed in a semi-automatic way, by exploiting the knowledge in a Common Thesaurus and ODL-I3 descriptions of source schemas with a combination of clustering techniques and Description Logics. This integration process gives rise to a virtual integrated view of the underlying sources (the Global Virtual View) for which mapping rules and integrity constraints are specified to handle heterogeneity. • The MOMIS system, based on a conventional wrapper/mediator architecture, provides methods and open tools for data management in Internet-based information systems by using a CORBA-2 interface. MOMIS was developed as a joint collaboration between the University of Modena and Reggio Emilia and University of Milano and Brescia.
The MOMIS System Distributed information stored in multiple, heterogeneous sources • Sources integration provides a Global Schema (which is a virtual view) • the Global Schema allows the user to send a query and get a unified answer from all the involved sources (transparently) • All information in http://www.dbgroup.unimo.it • INTERDATA (1999-2000); D2I (from Data to Information) (2001-2002) – “Programmi di ricerca scientifica di rilevante interesse nazionale”; WINK (Web-linked Integration of Network-based Knowledge) (2002-2003); SEWASIE (Semantic Webs and AgentS in Integrated Economies) (2002-2005)
Local sources annotation • The integration designer has to manually choose the appropriate WordNet (www.cogsci.princeton.edu/~wn/) meaning for each element of the conceptual schema provided by wrappers. • Motivations of the annotation: • Exploiting semantics associated with the names of the schemas/structures of the information sources • Having a well-known meaning for each term of the sources • The annotation phase is composed of two steps: • Word Form choice. The WordNet morphologic processor aids the designer by suggesting a word form corresponding to the given term. • Meaning choice. The designer can choose to map an element on zero, one or more senses. Notice that the user can choose a sense among the existing ones in WordNet and he can add new senses in the DB.
Global Virtual View annotation • The GVV has to be annotated to become ”exportable knowledge”. • Annotating a GVV means to provide Global Classes with a name and with meanings. • By starting from annotations of local sources and mappings between the GVV and the local ontologies, we have developed a semi-automatic methodology to generate the annotations of the GVV.
A Global class GlobalClass1 CS.Essay CS.Publication UNI.Article The annotated Global class Wordnet meanings GlobalClass1 = <publication, {essay#1,publication#2, article#1}> essay#1 = an analytic or interpretive literary composition publication#2 = a copy of a printed work offered for distribution article#1 = nonfictional prose forming an independent part of a publication name GVV annotation Annotated Local classes CS.Essay=<essay, {essay#1}> CS.Publication=<publication,{publication#2}> UNI.Article=<article,{article#1}> The CT relationships broadest meaning meanings BLCGC={LCGC| y GC, (LC NT y ) v (y BT LC)}
Updating the GVV • A created GVV can change: • By adding a new source on the system • By updating an existing data source schema • By deleting a previously integrated source • Adding a new source: two possible scenarios • Integration from scratch: the integration process is applied again; in this case only the Common Thesaurus of the previously GVV can be exploited. • Integration with the GVV: the process exploits the “automatically annotated” GVV and the Common Thesaurus.
Sources’s Schema ODLI3 Common Thesaurus Cluster generation GloblalClass1 GloblalClass2 New GVV OODB Mapping Global schema/ Local schema GloblalClass3 RDB Adding a new source Annotated GVV Common Thesaurus • intra/inter schema relationships • (only new sources) • lexicon relationships • (GVV e new sources annotated) • relationships inserted by user • inferred relationships XML New New
Adding a new source • Three scenarions: • A new global class is composed of only one old global class and one or more new local classes • A global class of the new integrated schema is composed of only new local classes • A global class of the new integrated schema is composed of more than one global class of the old GVV and at least one local class of the new source
GVV- integrated ontology • A GVV may be thought of as a domain ontology for the integrated sources; the usual approach in the Semantic Web is based on “a priori” existence of an ontology connected by means of semantic markups to the sources MOMIS Semantic Web Ontology Ontology Builder
GVV- integrated ontology • The MOMIS ontology is composed of the following components: • Global Virtual View • Mapping Rules • Integrity constraint rules • Intensional and extensional inter and intra-schema relationships (Common Thesaurus) • We express the ontology by using the ODLI3 language or an OWL file.
Using the MOMIS system • The MOMIS system was exploited: • To create a virtual marketplace • To support collaborative processes within the European Wink project • To build an advanced semantic search engine within the European SEWASIE project (under development)
SEWASIE • SEWASIE (SEmantic Webs and AgentS in Integrated Economies) is a research project funded by EU on action line Semantic Web (May 2002/April 2005) • The consortium details • Università degli Studi di Modena e Reggio Emilia (ITALY) • CNA SERVIZI Modena s.c.a.r.l. (ITALY) • Università degli Studi di Roma “La Sapienza” (ITALY) • Rheinisch Westfaelische Technische Hochschule Aachen (GERMANY) • Libera Università di Bolzano (ITALY) • Thinking Networks AG (GERMANY) • IBM Italia SPA (ITALY) • Fraunhofer-Gesellschaft Institut Angewandte Informationstechnik (GERMANY)
SEWASIE Objectives The SEWASIE project aims to develop an advanced search engine enabling intelligent access to heterogeneous data sources on the web, via semantic enrichment, to provide the basis for structured web-based communication. The SEWASIE project pursues the following aims: • To develop an agent-based secure, scalable and distributed system architecture for semantic search (based on ontologies) and for structured web-based communication. • To develop a general framework for query management and information reconciliation based on a semantically enriched data and trusted agent structure. • To develop an information brokering component which includes methods for collecting, contextualizing and visualizing data. • To provide the end-user with an efficient interface for formulating queries using a graphical representation and for intelligent navigation through the semantically information space.
The SEWASIE architecture • The SEWASIE system realizes a virtual network, the SEWASIE Virtual Network (SVN), whose nodes are SEWASIE Information Nodes (SINodes), multi-database mediator-based systems, each including a Virtual Data Store, an Ontology Builder, and a Query Manager • Brokering Agents maintain the knowledge related to the SEWASIE Virtual Network and the user profiles. • In query solving phase, starting from a specified SINode, a Query Agent accesses other SINodes and thus collects partial answers. • To select SINodes useful to solve a query, a Query Agent interacts with a/several Brokering Agents.
Brokering Agent (BA) Ontologymaps Query Agent Query Agent Query Agent Brokering Agent (BA) Ontologymaps Wrapper HTML XML Wrap HTML→XML The SEWASIE architecture The userinterface layer Other users user User Profile Monitor Profiles OLAP Tool OLAP Reports User Interface Visualisation user user Comm. Agent Monitoring Interface Comm. Interface Query Interface Metadata Interface Communication Tool Monitoring Agent (MA) Query Results SINode Theinformationlayer Virtual Data Store Virtual Data Store Query Query Manager Ontology builder SEWASIE Interconnection infrastructure Metadata Metadata BA Repository Repository BA Ontology Ontology BA Wrapper Wrapper Wrapper Wrapper Wrapper Wrapper … Semantic Semantic Semantic Semantic Enrichment Enrichment Theintermediarieslayer Enrichment Enrichment BA <XML> <XML> RDBs RDBs <DATA>... <DATA>... </DATA> </DATA> Structured Structured Semi - Structured Databases Databases Databases Databases <HTML> <HTML> Unstructured ... ... Text documents
Future Work • Ontology evolution within an SINode • Update of existing sources • Deletion of previously integrated sources • Extending WordNet • If a source description element has no correspondent concept in WordNet, the designer may add a new meaning and proper relationships connecting them to existing meanings. • Multilingual functionalities • SEWASIE multilingual technologies will allow users to share information and resources available all over the world, but also to preserve their original local qualities. • Enrichment of multi-lingual lexicon ontology with the aid of statistical analysis techniques for multilingual text corpora (for example with techniques for the generation of multilingual dictionaries).
Global Instance Computation • For the definition of a Global Class we have to define the following elements: • Mapping Table: define the mapping between the global class attributes and the local classes attributes • Join condition: we assume that there is a Join Condition between each pair of overlapping relations to identify tuples corresponding to the same object and fuse them • Full disjunction: the GC contains a unique tuple containing a unique tuple resulting from the merge of all different tuples representing the same real world object.
Global Instance Computation S(l1)= (firstn, lastn, year, e_mail) S(l2)= (name, e_mail, dept_code, s_code) • Two functions: • Global function: renaming the attributes of the local classes into attributes of the global class • Local Function: converting a tuple of elements of a local classby suitable functions such as string concatenations ….
Global Instance Computation • Semantic Homogeneity property condition Join Attribute Join Attribute Full Disjunction
Global Instance Computation • Semantic Homogeneity property condition not verified: • Resolution functions: • Random • Priority • User defined function
Example University source (relational) Department(dept_code,dept_name,budget) Research_Staff(name,e_mail,dept-code,s_code) FK dept_code REF Department, s_code REF Section School_Member(name,school,year,e_mail) Section(s_code,section_name,length,room_code) FK room_code REF Department, s_code REF Room Room(room_code,seats_number,notes) Tax_Position source (XML) <!ELEMENT ListOfStudent (Student*)> <!ELEMENT Student (name,s_code,school_name,e_mail,tax_fee)> <!ELEMENT name (#PCDATA)>
Example Computer_Science source (object) CS_Person(first_name,last_name) Professor:CS_Person(belongs_to:Division,rank) Student:CS_Person(year,takes:set<Course>,rank,e_mail) Division(description,address:Location) Location(city,street,number,country) Course(course_name,tought_by:Professor)
Common Thesaurus (Domain Ontology) Set of terminological relationships between classes and attributes names (terms) expresses both intra-schema and inter-schema knowledge Relationships added to Common Thesaurus: (1) schema derived (2) lexicon derived (3) designer supplied (4) inferred exploiting ODB-Tools capabilities
Schema-derived relationships • Terminological and extensional intra-schema relationships • RT relationships derived from foreign keys in a relational schemaUNI.SectionRT UNI.Department • BT/NT relationships derived from inheritance relationships in a object-oriented schema or integrity constraints in relational schemaCS.Student NT CS.CS_PersonCS.Professor NT CS.CS_Person
Lexicon-derived relationships Extracted from WordNet lexical database (Princeton Un.) 129625 lemma organized in 99759 synonym set (synset)Synonymy Polysemy Tax_position_xml.Student.name SYN University.School_member.name CS.Professor NT CS.CS_Person
Inferred relationships Exploiting Description Logics techniques (ODB-Tools system) a new set of terminological relationships are inferred University.Research_Staff RT CS.Course
Mediator global schema • Global schema generation • (interaction with ARTEMIS module): • Affinity calculation • Cluster generation • Global attributes and mapping table generation • A global classgciis generated for each clusterCli • SI-Designer builds the attributes set to be associated to the cluster: • Union of the attributes of all classes belonging to the cluster • Fusion of “similar attributes”