250 likes | 412 Views
Section 20.1 Modes of Information Integration. Anilkumar Panicker CS257: Database Systems ID: 118. Agenda. Information Integration Problems of Information Integration 1. Data type differences 2. Value differences 3. Semantic Differences 4. Missing Values
E N D
Section 20.1 Modes of Information Integration Anilkumar Panicker CS257: Database SystemsID: 118
Agenda • Information Integration • Problems of Information Integration 1. Data type differences 2. Value differences 3. Semantic Differences 4. Missing Values • Modes of Information Integration 1. Federated Databases 2. Warehousing 3. Mediation
Information Integration • In simple terms “Information Integration” can be defined as the process of taking data that is stored in two or more databases (Information Sources) and building one large database from them. • The sources may be conventional databases or collection of web pages.
Motivation • The main motivation behind Information Integration is to bring Information from Disparate sources, possible from sources with varied schemas so that data can be queried as a single unit. • End user is oblivious to the contextual and typographical differences in the sources of data.
Example • Consider the example of an automobile company with 1000 dealers. • Each dealer maintains a database of their cars in stock. • The company wants to create an integrated database which contains information from all the 1000 dealers. • The integrated database can be useful in performing market analysis and determining market trends to adjust production.
Example cont.. • However, the 1000 dealers do not all use the same database schema. • Dealer 1 may store information in a single relation. • While dealer 2 might use a schema where options are separated into a second relation.
Example Cont.. • Dealer 1 Cars(sNo, model, color, aTran, cdPlr, ….) • Dealer 2 Autos(serial, model, color) Options(serial, option) • Not only is the schema different, but equivalent names have also changed
Problems of Information Integration • Data Type differences • Serial number might be represented by character string at one source, while other source might use integers. • Even if two sources use character string to store serial number, one may use fixed length while other may use variable length string.
Problems of Information Integration • Value Differences • The same concept might be represented by different constants at different sources • For e.g. The color black might be represented by code BL at one source, while BL may stand for color blue at other source.
Problems of Information Integration • Semantic Differences • Terms may be given different interpretations at different source. • One dealer might include trucks in Cars relation while other dealer includes only car information in Cars relation.
Problems of Information Integration • Missing Values • A source might not record information of a type that all or most of the other sources provide. • For e.g. a dealer might not record colors at all.
Federated Database Systems • Simplest method of Information Integration • One to One connection between all pairs of databases that need to talk to one another. • These connections allow one database system D1 to query another D2 in terms that D2 can understand.
Considering the schemas from slide 7,this is how Dealer 1 queries Dealer 2 for needed cars
Problem with federated database Systems • Large number of pieces of code must be written to provide communication between databases that need to talk. • If n databases want to talk with each other, then the total number of connections required is n (n-1)
Data Warehouses • Data from several sources is extracted and combined into a global schema. • The Integrated data is then stored at warehouse, which looks like an ordinary database to the user. • User can issue queries to warehouse exactly in the same manner in which he would to an ordinary database. • Updates to the warehouse are generally forbidden.
Data warehouse • Data warehouse stores integrated information in a separate database.
Three approaches to constructing the data in the Warehouse • Periodic reconstruction • Incremental Updates • Warehouse changed immediately in response to changes at the source.
Data warehouse • Periodic reconstruction • Most common approach. • Warehouse reconstructed periodically from the current data at source. • Disadvantages • Warehouse must be shut down for reconstruction • Data in the warehouse can become seriously outdated.
Data warehouse • Incremental Updates • Warehouse updated periodically based on the changes that have been made to source. • Involves smaller amounts of data. • Useful in situations where warehouse needs to be updated in a short period of time and the warehouse is large. • Disadvantage • Complex compared to periodic reconstruction.
Data warehouse • Warehouse changed immediately in response to changes at the source. • Involves too much communication and processing. • Difficult to implement for large warehouses. • Practical only in situations where the underlying source changes very slowly.
Extractor for Data warehouse • Considering the schemas on slide 7 this is how we can extract data from dealer 1
Mediators • Mediator supports a virtual view or collection of views. • Integrates several sources in the same way as a Warehouse does. • The major difference is that Mediator does not store any data.
Mediators • A mediator translates queries into the terms of the source and combines the answers
Mediators • No separate combiner is required as Mediator performs the function of combining the results. • Mediator is also responsible for determining where the queries should be directed.