Integration of Information from Multiple Sources of Textual Data

Integration of Information from Multiple Sources of Textual Data November 22, 1999 93419-052 전 승 원

1. Introduction • The number of information sources in the Internet is exponentially increasing. • Information is highly heterogeneous both in its structure and in its origin. • Data types: Textual data, Images, Sounds, etc.

1. Introduction Integration from Textual Data • To organize data (often a huge amount) coming from multiple heterogeneous sources in easily accessible structures. • Research topic for different research communities: Database, Artificial Intelligence, and Information Retrieval • Two scenarios: Known sources, Unknown sources

1. Introduction The First Scenario (Known Sources) • The sources of heterogeneous textual data are known. • Widely investigated in the database area: decision support systems, integration of heterogeneous databases, and datawarehouse • DARPA Intelligent Integration of Information (I3) research program

1. Introduction The Second Scenario (Unknown Sources) • Information Discovery problem • Arising mainly due to the Internet explosion • In this scenario, • First: to individuate among a huge amount of sources of heterogeneous textual data a possibly low amount of relevant sources. • Second: to face the problem of scenario 1. • Out of the scope • We will discuss problems and solutions of the extraction and integration of information from highly heterogeneous multiple sources of textual data in order to provide true information.

1. Introduction Heterogeneity • Heterogeneous sources • Databases, File Systems, Knowledge Bases, Digital Libraries, Information Retrieval Systems, and Electronic Mail Systems. • Structural and implementation heterogeneity • Differences in hardware platforms, DBMS, data models, and data languages. • Semantic heterogeneity • Different names are employed to represent the same information. • Different modeling constructs are used to represent the same piece of information in different sources. • How to cope with the heterogeneity • Two fundamental approaches: Structural and Semantic

1. Introduction Structural Approach • Characterized as follows • Self-describing model • Semantic information is effectively encoded in rules • Arguments in favor of the Structural Approach • Flexibility, generality and conciseness of a self-describing model • A form of first-order logic languages is provided. • Useful when a client doesn’t know in advance the structure of the objects of a source. • Schema-less property • The TSIMMIS Project

1. Introduction Semantic Approach • Characterized as follows • For each source, meta-data (conceptual schema) must be available. • Semantic information is encoded in the schema. • Partial or total schema unification is performed. • It adopts conventional OO data models. • Arguments in favor of the Semantic Approach • It allows us to organize extensional knowledge and to give a high level abstraction view of information; • To check consistency of instances with respect to their descriptions, and thus to preserve the Quality of data; and • To efficiently extract information through the query optimization. • A relevant effort has been devoted to develop OO standards: CORBA and ODMG93. • Schema Nature • The MOMIS Project

1. Introduction Virtual Approach • First proposed in multidatabase models in the early 1980s • Recently, developed on the use of description logic. • Conjunctive queries (select, project, join) • Open World Assumption • Top-down approach for the schema

2. The TSIMMIS Project • The Stanford-IBM Manager of Multiple Information Sources • To develop tools that facilitate the rapid integration of heterogeneous textual sources

2. The TSIMMIS Project Wrapper Wrapper Wrapper Info Info Info TSIMMIS Architecture Application Mediator MediatorGenerator Mediator WrapperGenerator

2. The TSIMMIS Project TSIMMIS Architecture (continued) • Wrappers (Translators) and Mediators • Common Model • OEM (Object Exchange Model) • Query Languages • OEM-QL • MSL (Mediator Specification Language) • Possible bottlenecks • An ad-hoc translator must be developed for any information source. • Implementing a mediator can be complicated and time-consuming. • Important goals • to provide a translator generator • to automatically or semi-automatically generate mediators

2. The TSIMMIS Project The OEM Model and the MSL Language • OEM model: self-describing model <ob1: person, set, {sub1, sub2, sub3, sub4, sub5}> <sub1: last_name, str, ‘Smith’> <sub2: first_name, str, ‘John’> <sub3: role, str, ‘faculty’> <sub4: department, str, ‘cs’> <sub5: telephone, str, ‘32435465’> • MSL language • First-order logic language that allows the declarative specification of mediators. • Rule: head :- tail • head: the pattern of the top-level integrated object supported by the mediator • tail: the pattern of the object to be fetched from the source

2. The TSIMMIS Project The TSIMMIS Wrapper Generator • OEM Support Libraries • to quickly implement wrappers, mediators and end-user interfaces • The architecture of wrappers • Client Support Library • either a mediator or an application • Server Support Library • either a translator or a mediator • Converter • MSL  Native Query • QDTL (Query Description and Translation Language) • Extractor • Packager • Filter Processor

2. The TSIMMIS Project TSIMMIS Wrapper Client Client Support Library Server Support Library Filter Processor QDTL Converter Driver Packager Extractor Information Source DEX

2. The TSIMMIS Project Converter and QDTL • Example: WHOIS information source > lookup -ln ‘ss’ > lookup -ln ‘ss’ -fn ‘ff’ • QDTL (Query Description and Translation Language) D1: (QT1.1) Query ::= *0 :- <0 person {<last_name $LN>}> (AC1.1) {printf (lookup_query, ‘lookup -ln %s’, $LN);} (QT2.2) Query ::= *0 :- <0 person { <last_name $LN> <first_name $FN>}> (AC2.2) {printf (lookup_query, ‘lookup -ln %s -fn %s ’ , $LN, $FN);}

2. The TSIMMIS Project Converter and QDTL (continued) • Converter exploits each template to describe much more queries • Directly supported queries • queries with a syntax analogous to the template • Logically supported queries • A query q is logically supported by a template t if q is logically equivalent to, or subsumed by, a query q’ directly supported by t. • Indirectly supported queries • A query q is indirectly supported by a template t if q can be decomposed in a query q’ directly supported by t and a filter that is applied on the results of q’. • Example • (Q6) *Q :- <Q person {<last_name ‘Smith’> <role ‘student’>}>

2. The TSIMMIS Project Extractor, DEX Templates, and Filter Processor • Extractor • Input • A query result expressed in a unstructured format • DEX templates • Packager • Output • a set of OEM object • Filter Processor • The filter is a MSL query built by the Converter.

2. The TSIMMIS Project The TSIMMIS Mediator Generator • The MedMaker system • the TSIMMIS component developed for declaratively specifying mediators • Example • CS objects in OEM <&e1, employee, set, {&f1, &l1, &t1, &rep1}> <&f1, first_name, string, ‘Joe’> <&l1, last_name, string, ‘Chung’> <&t1, title, string, ‘professor’> <&rep1, reports_to, string, ‘John Hennessy’> • WHOIS objects in OEM <&p1, person, set, {&n1, &d1, &rel1, &elem1}> <&n1, name, string, ‘Joe Chung’> <&d1, dept, string, ‘cs’> <&rel1, relation, string, ‘employee’> <&elem1, e_mail, string, ‘chung@cs’>

2. The TSIMMIS Project The TSIMMIS Mediator Generator (continued) • An object exported by ‘MED’ <&cp1, cs_person, set, {&mn1, &mrel1, &t1, &rep1, &elem1}> <&mn1, name, string, ‘Joe Chung’> <&mrel1, relation, string, ‘employee’> <&t1, title, string, ‘professor’> <&rep1, reports_to, string, ‘John Hennessy’> <&elem1, e_mail, string, ‘chung@cs’> • Rules of ‘MED’ (MS1) Rules: <cs_person {<name N> <rel R> Rest1 Rest2}> :- <person {<name N> <dept ‘cs’> <relation R> | Rest1}> @whois AND decomp(N, LN, FN) AND <R {<first_name FN> <last_name LN> | Rest2}>@cs External: decomp(string, string, string) (bound, free, free) impl by name_to_lnfn decomp(string, string, string) (free, bound, bound) impl by lnfn_to_name

2. The TSIMMIS Project Architecture and Implementation of MSI • Mediator Specification Interpreter • processes a query on the basis of the rules expressed in MSL. • Three modules • VE&AO (View Expander and Algebraic Optimizer) • builds the logical datamerge program • cost-based optimizer • builds the physical datamerge program • execution plan • datamerge engine

3. The MOMIS Project The MOMIS Project • Mediator envirOnment for Multiple Information Sources • to allow a user to pose a single query and to receive a single unified answer • “Read-only views” • Common data model and language • ODMI3, ODLI3 - a subset of the corresponding ODMG93 ODM and ODL • olcd • Object Language with Complements and Descriptive cycles • ODB-Tools • GARLIC, SIMS

3. The MOMIS Project Information Integration in MOMIS • Extraction and analysis process • derives a Common Thesaurus of terminological relationships • constructs clusters • olcd description logics • to set up the Thesaurus by inferring relationships • to optimize the queries against the global schema • Unification process • builds an integrated global schema • Hierarchical clustering • allows automated identification of ODLI3 classes

3. The MOMIS Project Architecture • Wrappers • lie above each source. • responsible for translating the structure of the data source into the common ODLI3 and translating the OQLI3 to a local request. • Mediators • lie above the wrapper. • software modules that combines, integrates, and refines ODLI3 schemata received from the wrappers. • Generates the OQLI3 queries for the wrappers • ODB-Tools • ARTEMIS

3. The MOMIS Project ODLI3 • Data description language used to communicate between wrappers and mediators engines. • Based on ODL, adds features of the intelligent information integration system. • if then rules • mapping rules • A source independent language

3. The MOMIS Project Phases of Intelligent schema integration • Generation of a Common Thesaurus • Terminological relationships are derived in a semi-automatic way by analyzing the structure and context of classes in the schema • SYN (synonym), BT (Broader Term), NT (Narrow Term), RT (Related Term) • ODB-Tools and olcd • Affinity analysis of ODLI3 classes • to evaluate the level of affinity between classes’ intra and inter sources. • Clustering ODLI3 classes • Generation of the mediator global schema • A class is defined for each cluster.

3. The MOMIS Project Example • University source (S1) • Research_Staff(first_name, last_name, relation, email, dept_code, selection_code) • School_Member(first_name, last_name, faculty, year) • Department(dept_name, dept_code, budget, dept_area) • Section(section_name, section_code, length, room_code) • Room(room_code, seats_number, notes) • Computer_Science source (S2) • CS_Person(name) • Professor:CS_Person(title, belongs_to:Division,rank) • Student:CS_Person(year, takes:set<Course>, rank) • Division(description, address:Location, fund, sector, employee_nr) • Location(city, street, number, county) • Course(course_name, taught_by:Professo) • Tax_Position source (S3) • University_Student(name, student_code, faculty_name, tax_fee)

3. The MOMIS Project Example (continued) 0.25 0.35 0.35 0.39 Location 0.73 0.66 0.54 Room Division Department 0.6 Research_Staff Section Course 0.6 CS_Person 0.6 Professor 0.65 University_Student School_Member Student

3. The MOMIS Project Global Class Specification in ODLI3 Interface University_Person (extent Research_Staffers, School_Members, CS_Persons Professors, Students, University_Students key name) { attribute string name mapping_rule (University.Research_Staff.first_name and University.research_Staff.last_name), (University.School_Member.first_name and University.School_Member.last_name), …… Tax_Position.University_Student.name; attribute string rank mapping_rule University.Research_Staff = ‘Professor’, University.School_Member = ‘Student’, …}

3. The MOMIS Project The Global Schema Builder • SIM1 (Schemata Integrator Module, first version) • reads the local schemata descriptions to derive the Common Thesaurus. • interacting with the user • using description logics (supported by ODB-Tools) • Artemis • Affinity Coefficients are computed between all the pairs of local classes to be integrated. • Similar classes are grouped together using clustering techniques.

3. The MOMIS Project Description Logics and ODB-Tools • DL (Description Logics language) • also known as Concept Language or Terminological Logics • reasoning techniques • CODM (Complex Object Data Model) • The expressiveness gave rise to new problems • odl (Object Description Logics) • olcd • included by ODB-Tools. • ODB-Tools • Two modules: ODB-Designer and ODB-QOptimizer

Discussion • The TSIMMIS system • structural approach • Drawbacks • inefficient retrieval of data to be integrated • incapability to answer not-predefined queries • The MOMIS system • semantic approach • generation of the global schema for the mediator is a semi-automated process • More (not mentioned) points • query decomposition and optimization • object fusion in mediator system • integration of semi-structured data

Integration of Information from Multiple Sources of Textual Data