280 likes | 408 Views
E N D
An Architecture for Online Information Integration on Concurrent Resource Access on a Z39.50 Environment Michalis Sfakakis1 and Sarantos Kapidakis21National Documentation Centre / National Hellenic Research Foundationmsfaka@ekt.gr2Laboratory on Digital Libraries and Electronic PublishingArchive and Library Sciences Department / Ionian Universitysarantos@ionio.gr 7th European Conference on Digital Libraries 17-22 August 2003, Trondheim, Norway
Presentation Summary • Main Contributions • Resource Access in a Network Environment (models, characteristics, issues, implementations) • Proposed Architecture (goal, critical points, characteristics, benefits) • Technical Details of the Proposed Architecture • Conclusions • Future Research
Main Contributions • Analysis of problems (in a networked environment) for: • Concurrent resource access via parallel search • Information integration • Proposal of architecture for these problems: • Able to improve online information integration • Taking into account the restrictions imposed by the: • Network environment • Z39.50 information retrieval protocol
Resource Access in Union Catalogues • Give access to library content from one central point • Functional requirements • Consistent searching & indexing • Consolidation of Records (information integration) • Performance & Management • … conformance to current implementation models • Centralized (the vast majority of the current implementations): conform well to all functional requirements • Distributed (current approaches – virtual union catalogues): all functional requirements vary
Why Virtual Union Catalogues (VUC) Why Centralized Distributed: • Local autonomy and control of the participating systems • Retention of the specific resource characteristics • User ability to dynamically define his own collections of resources • Vast and increasing number of available resources
Pre-requirements for VUC • Ensure systems interoperability, derived from the implementation of international metadata standards and information retrieval protocols • Provide information integration (indicated by user studies) • Achieve accepted performance from the systems which emulate the union catalogue • Have ability for parallel searching • Have adequate network performance
Is it possible to implement VUC now? Depends on: • Current technology and network improvements • Existence and wide acceptance of metadata standards (e.g. DC, MARC, MODS, etc) • Wide acceptance of the Z39.50 information retrieval protocol and its associated profiles
Requirements for Information Integration • The Information Integration (Consolidation of Records)is a two step process: • Identification of the duplicate records • Presentation: Creation of a union record, or, according to the Z39.50 duplicate detection model, the clustering of records in ‘equivalence classes’ and the selection of a representative record • Its effectiveness & quality is affected by the: • Differences in semantic models and formats of the metadata • Metadata Quality (i.e. specificity, completeness of fields, syntactic correctness and consistency as implemented by authority files)
Methods for Information Integration • Depending on the challenge: • High quality duplicate detection and merging on large amount of data, offline - without hard time restrictions • Development of centralized union catalogues, or creation of collection by harvesting techniques • Good de-duplication quality on medium to small amount of data, online and present them to the user in accepted response time • Development of virtual union catalogues
Z39.50 Information Retrieval Protocol • A complicated, state full, client /server protocol, widely used in the area of libraries • For every session (Z-association) a server: • Holds a search history (at least the last query) • During the session the client can request data from any result set included in the search history • The search history stays alive during the session • The session can be abruptly terminated by the server (timeout), on ‘lack of activity’ • The timeout period is server dependent • Depending of the implementation level, a server could implement in a number of variations the: • Sort service • Duplicate detection service
Summary of VUC Implementation Issues • Network dependent: • Network links performance & availability • Protocol dependent: • Interoperability level (e.g. supported services and their implementation variations) • Timeout period and session reactivation • Participating systems dependent: • Performance, availability, extensibility, metadata encoding and semantics • De-duplication complexity & expensiveness: • Highly affected by the different semantic models & formats, quality, completeness, consistency and the amount of the metadata • Overall system performance
Current VUC Implementations • Server side: • Majority support basic services (e.g. Init, Search, Present, Scan) • A small number support the sort service • A minority supports the duplicate detection service • Client side: • Has to deal with heterogeneity in receiving resulting data • Must overcome timeout issues, avoiding session reactivation • Has to de-duplicate incoming results, even if every individual server reply does not provide duplicates • The majority of the implementations does not make any integration, due to performance issues. • Primitive duplication detection approaches, based on some coded data (e.g. ISBN, ISSN, LC number, etc.)
User – VUC System Interactions • Defines the desired collection of resources • Sends a search request, specifying a desired number of records (Presentation Set) to display each time • After receiving the Presentation Set, subsequently Presentation Sets could be requested – or not
Goal of the Proposed Architecture To improve information integration in online access of a distributed system, which: • Accesses concurrently resources via the network • Applies online good quality duplicate detection procedures (for presenting only once each record that is multiply located in the resources)
Critical Points of the Proposed Architecture We have to deal with: • Performance of the network links and the availability of the resources • Complexity and expensiveness of the duplicate detection algorithms, especially in large amount of records • Extraction of the Presentation set in reasonable response time
Characteristics of the Proposed Architecture What we do: • We do not apply the duplicate detection algorithms in one shot – the duplicate detection process is applied using each received set of data and comparing them against the previously processed results • Incremental comparison and elimination of the duplicates in every Presentation Set – the processed results are sorted and do not contain duplicates • Usage of the sort or duplicate detection service, when supported • During the time the user is reading the results, the system prepares few next sets of unique records
Benefits of the Proposed Architecture • Avoid downloading large amounts of data over the network and unnecessarily loading the servers • Apply the duplicate detection algorithm to a small number of records – especially in the first steps • Every record is compared against a processed set during de-duplication • We deploy the time the user is reading the presented data, without exhausting the system resources
Overviewof the Proposed Architecture • Modules: Request Interface, Data Integrator, Resource Communicator • Components: Data Provider, Local Result Set Manager, De-duplicator, Data Presenter • Interaction is accomplished by messages or synchronous data transmissions
Modulesof the Proposed Architecture • The Request Interface: Receives every user request (search or present), dispatches it to the appropriate modules, waiting the Presentation Set • The Resource Communicator: Access the resources and supplies the data for the integration • The Data Integrator: Receives the data sets, makes the information integration and manages the unique records to be ready for presentation
Componentsof the Proposed Architecture • The Local Result Set Manager: Holds and arranges (e.g. sorts) the de-duplicated records and prepares the Presentation Set • The Data Provider: Receives data from the Resource Communicator Module and sends one at a time for further process • The De-duplicator (s): Receives a record from the Local Result Set Manager and compares it with all the unique records in the Local Result Set • The Data Presenter: Dispatches the received request for data, from the Request Interface, to the Local Result Set Manager and returns back the next unique records for presentation
Resource 1…j Z39.50 Server Z39.50 Server Resource j+1…k Resourcel+1…r Z39.50 Server User Interaction Request Interface Data Integrator Resource Communicator
Accomplishing a search request –Module Interactions • The Request Interface requests p records from the Data Integrator and waits for (at most p) records • The Request Interface, also, forwards the search request including the number p, to the Resource Communicator and continues monitoring for user requests • The Resource Communicator waits for messages from the Request Interface and when it receives a new search request, it concurrently starts the following sequences of steps for every server: • Interprets the search request to the appropriate message format for the server, sends it and waits for its reply • Adds the number of hits from all the replies and sends it to the Request Interface • If the server supports either the duplicate detection or the sort service, it invokes it after its initial response to the search request • Requests a number of records (e.g. p) from every server that replied on its last request • It sends the arrived data to the Data Integrator • Waits for further commands, but if there is no communication with the server for a period close to its timeout, the procedure jumps to step 3.4 • The Data Integrator de-duplicates part of the received data, prepares the set of unique records and when p records are found, it sends them to the Request Interface
Module Interactions:Comments & Clarifications • All modules work in parallel • The number of requested records from every server could vary, depending upon its: performance, timeout, the network links and the Result Set size • For the overall system performance, the Resource Communicator realizes if a server is down, using the Profiles of the Z39.50 servers, and continues the interaction with the other modules • The calculated number of hits is not the actual one • To avoid session reactivation, imposed by the server timeout, the Resource communicator could request data from any server at any time • A threshold value activates the Data Integrator to ‘request data’ from the Resource Communicator
Request Interface Data Integrator Data Presenter De-duplicator Data Provider Presentation Set Local Result Set Input Queue Output Queue Local Result Set Manager Profiles of the Z39.50 Servers Resource Communicator
Accomplishing a search request –Component Interactions • The Data Provider starts to transfer data, possibly by rearranging them. If the number of data contained in it is less than a threshold (e.g. 5p), the Data Provider sends a ‘request data’ message to the Resource Communicator • While the Local Result Set Manager has less than a threshold (e.g. 3 p) unique record, it tries to read from the Data Provider and for every record found, it calls the De-Duplicator to compare the record: • The De-Duplicator compares the record with the records in the Local Result Set and then sends the results back to the Local Result Set Manager • The Local Result Set Manager receives the results from the duplicate detection process and arranges the record into the Local Result Set • If the number of new unique records in the Local Result Set becomes p, it copies the p new unique records into the Presentation Set and activates the Data Presenter • When the Presentation Set is filled with (the p) records, the Data Presenter component dispatches the records to the Request Interface module and waits to receive the next ‘request data’ message from it. If the component does not receive any request during its predefined timeout period, it terminates the system
Component Interactions:Comments & Clarifications • The combination of the threshold values in Data Provider & Local Result Set Manager, controls the ‘request data’ activity from the Resource Communicator • The Local Result Set Manager keeps two orderings for the unique records in order to: • Improve the performance of the De-duplicator • Present and Facilitate easy access of the stored records
Conclusions • The online de-duplication process from resources accessed concurrently in a network environment: • Is a requirement identified by user studies • Is challenged by a number of issues relevant to: • Performance of the participating servers • Their network links • The complexity and the expensiveness of the duplicate detection algorithms • These issues make inefficient any approach to the application of the information integration: • In online environments • Especially when large amounts of data must be processed • In our proposed system: • We do not try to integrate all the results from all the recourses at once • We attack this problem by: • Retrieving a small number of records, independently if the servers provide de-duplicated or sorted results • Appling the de-duplication process on small amounts of sorted records • Creating a presentation set of unique records to display to the user • Deploying the time the user is reading the presented data, without misapplying the system resources
Future Research • To better approximate the number of records satisfying the search request • To derive priorities for the servers and their resources • To select or adapt a good de-duplication algorithm for different record completeness and different provision of records by the servers • To optimize the number of requested records from a server • To implement the system and evaluate its performance