500 likes | 811 Views
Approaches to the Integration of Distributed and Heterogeneous Data Resources . Ahmet Sayar Indiana University Computer Science Department. Motivation. Integrating data from multiple data sources Distributed query and transactions of data.
E N D
Approaches to the Integration of Distributed and Heterogeneous Data Resources Ahmet Sayar Indiana University Computer Science Department
Motivation • Integrating data from multiple data sources • Distributed query and transactions of data. • Definitions and adoptions of data, metadata and their storages. • Accessing the data seamlessly. • Transparency, support for heterogeneity, extensibility and scalability.
Outline • Data Integration Approaches • Application Specific Solutions • Application-Integration Framework • ASIS (Application Specific Information System) • Database Federation • Ogsa-DAI (Ogsa-Data Access and Integration) • CompareASIS with Ogsa-DAI • Digital Libraries • SRB (Storage Resource Broker) • Sompel’s Digital Library Approach • CompareASIS with SRB and Sompel’s DL
Application Specific Solutions • The most common means of data integration • Expensive -in terms of time and skills • Developing and using requires deep system knowledge • Better results for special-purpose applications • Fragile • Changes to the underlying sources may easily break the application • Hard to extend • A new data source requires new code to be written
Outline • Data Integration Approaches • Application Specific Solutions • Application-Integration Framework • ASIS • Database Federation • Ogsa-DAI • CompareASIS with Ogsa-DAI • Digital Libraries • SRB • Sompel’s DL • CompareASIS with SRB and Sompel’s DL
Application-Integration Framework • It can also be called component-based framework • Such as CORBA or Filters with common interfaces • Not necessarily address data integration issues • Based on common data model (such as CML and GML) • With adaptors, if the source change the adaptor may have to change, but application may never see it. • Adding a new source is easy • a new adaptor may need to be written. • The adaptor may already be exist online. • No need to detailed system knowledge • Ex. ASIS - OGC GIS Application Integration Framework
ASIS (1) • Enables inter-service communication through well-defined service interfaces, message formats and capabilities metadata. • Data model is ASL (Application Specific Lang.) • Metadata model is capability document • Data and metadata have common predefined schema • Components are Filter Services • Web Services, comon service interfaces defined in WSDL • Information/data services enabling distributed access, querying and transformation through their predictable input/output interfaces. • Chainable, located, and capable of updating their metadata manually or dynamically
ASIS (2) • Data and data storage model • Any data can be integrated into the system after transforming to ASL. • Heterogeneity is handled at the end-Filters with adaptors. • ASL is community-accepted application specific language • GML (Geographic Markup Lang.) in GIS applications • CML (Chemistry Markup Lang.) in Chemistry applications • Filter’s common service interfaces • getCapabilities, getData, getFeatureInfo. • Requests to Filter’s interfaces • getCapabilitiesReq, getDataReq, getFeatureInfoReq • Expected return types are defined in Filters’ capability metadata
ASIS (3) • Metadata and Metadata storage model: • Data integration is done through Filters’ capability metadata • Metadata is stored in local Filter’s file system as a flat file. • Capability: • Inspired from OGC WMS capability specification. • Look like Dublin Core format. • Capability like structure is also used in Gannon’s approach (XPOLA), for Grid services’ security issues. • Describes dynamic Web/Grid resources. • Updated manually or dynamically. • Consists of descriptor, service and provider metadata • Inter-service communication is achieved without a third-party. Enables chain of Filters.
State Boundary Earth Fault ASIS (4)Data Access and Filter Chaining • Each Filter is capable of acting as both a server and a client • Capability integration is done through “getCapability” service interface • Requests for common service interfaces are created in accordance with predefined XML schema F3 F1 State Boundary F2 F4 Earth Fault Fault
Outline • Data Integration Approaches • Application Specific Solutions • Application-Integration Framework • ASIS • Database Federation • Ogsa-DAI • Compare ASIS with Ogsa-DAI • Digital Libraries • SRB • Sompel’s DL • CompareASIS with SRB and Sompel’s DL
Database Federation • Middleware consisting of database management system • Uniform access to number of heterogeneous data sources • Provides query language used to combine, contrast, analyze and manipulate the data • Data integration is done through Database integration. • Combine data from multiple sources in a single SQL statement – query recreation. • Ex. Ogsa-DAI (Open Grid Service Architecture – Data Access and Integration)
Ogsa-DAI (1) • Provides common Java API for accessing and integrating data resources –such relational and XML databases, and files- in Grid environment • Specifically designed for OGSA architecture • SQL queries on relational resources and XPath statements on XML collections • Provides data pipelining (similar to Filter chaining) via an XML document called “perform” document. • Allows developers to easily add or extend functionality within Ogsa-DAI, “activity” document.
Ogsa-DAI (2) • Data and storage model : • Any data stored in XML or relational databases, files • No common data model • Data is provided through GDS (Grid Data Services) • Uses Ogsa-DQP (Distributed Query Processor) to coordinate to access to multiple data services • The enactment engine is the core of Ogsa-DAI. Orchestrate running of the perform document • Information in perform document includes: • The list of activities and their XML schemas and implementation classes. • The list of role mappers and details • The info about data resource
Ogsa-DAI (3) • Metadata storage model: • Metadata is kept in Catalog Service (MCS) • MCS enables attribute-based querying • Metadata is for the datasets, data can be anything (binary, text ..) • Data integration is done through XML based activity file mixing activities (in SQL queries) and metadata • Simple data access scenario • A client contacts a DAISGR first to locate the GDSFs. • Accesses suitable GDSFs directly to find out more about their properties and the data resources they represent. • Asks GDSF to instantiate a GDS • Accesses resource by sending the GDS the GDS-Perform doc.
Ogsa-DAI (4) • Metadata model: • No common schema for metadata like capability • Defines Metadata for the datasets • No schema in XML • Stored in Database tables as attributes • Defines Metadata for the Database system to enable querying and defining activities • Schema in XML (mcsActivity.xsd schema file) • Kept as XML file in the file system (mcsActivity.xml)
ASIS vs. Ogsa-DAI • Ogsa-DAI does not define metadata and data in XML schema. Metadata is mixed with Database schema. ASIS has predefined data and metadata models. • Ogsa-DAI uses any data, and they have predefined Database schema to enable querying and accessing data. • ASIS’s data integration is on demand and based on capability federation. Instead, Ogsa-DAI’s data integration is coded in XML struc perform and activity documents. • Ogsa-DAI has central (MCS), ASIS has distributed metadata approach. • Both system are based on Web Services. • Ogsa-DAI uses GridFTP, and ASIS uses NaradaBrokering for the performance issues in data transfers.
Outline • Data Integration Approaches • Application Specific Solutions • Application-Integration Framework • ASIS • Database Federation • Ogsa-DAI • Compare ASIS with Ogsa-DAI • Digital Libraries • SRB • Sompel’s DL • Compare ASIS with SRB and Sompel’s DL
Digital Libraries • Main focus is publishing and discovering of the digital objects. • Digital Objects : file, URL, SQL command string and any string of bits. • Collects data from multiple different data sources. • It is little bit different from the other data integration approaches • Data curation services – such as publishing and removing data from the data sources. • Ex. SRB (Storage Resource Broker) and Sompel’s Digital Library Approach
SRB (1) • A federated client server system • Each server managing/brokering a set of resources • An implementation architecture for • Data grids • Digital Libraries. • Storage resources include digital libraries, MSS, UniTree and file systems • SRB consists of three components • MCAT services, • SRB servers to access to storage repositories and • SRB clients • Mediates access to distributed heterogeneous resources • Uses MCAT (Metadata Catalog Service) to facilitate brokering and attribute based querying. • Integrates data and metadata
SRB (2) • Data and storage model: • Uniform storage interface • Resource-specific drivers to map from defined storage to interface • Storage resources are registered within SRB as physical resources • Logical resources (LSR) enable replication. • LSR = one or more than one physical resource • Client API refers to LSR. Collections are created by LSR • Metadata storage model (MCAT): • Serves both a core-metadata and domain-dependent metadata • Core-metadata is a standardized schema like Dublin Core • Stores metadata about data, collections, users, resources, methods • Attribute based access and querying, updating metadata catalog • Implemented as a relational database. Oracle, DB2 or Sybase • Abstraction and Replica information for data • “Global user” name space and authentication • Authorization through ACL and tickets
SRB (3) • Metadata and Metadata Exchange Model: • MAPS (Metadata Attribute Presentation Structure) • Independent of the internal representation of the attributes inside the catalog. • Provides a uniform interface specification that can be used between user applications and the MCAT catalog and vice verse. • Structures which form the MAPS: • MAPS_Query_Struct, • MAPS_Result_Struct, • MAPS_Update_Struct and • MAPS_Definition_Struct • Mapping from MAPS to other models and exchange format. Dublin Core format is under implementation.
SRB (4) • Simple data access scenario: • SRB server spawns SRB agent to authenticate the user/Application by comparing it with information stored in MCAT. • Find the location in MCAT. • Check user request against permissions stored in MCAT. • SRB agent contacts user with the result of his request. • SRB agent communicates with the user through a port specific to this client session. • SRB server chaining scenario (integrated SRBs): • First 3 steps from simple data access case. • SRB agent contacts remote SRB agent via remote SRB server. • The second SRB agent returns the pointer to the data item to the first SRB agent which passes it on to the user. • The SRB client interact with the data item directly. The federated SRB scheme -SRB server acts as a client to another.
ASIS vs. SRB • SRB doesn’t define metadata in XML structure (as ASIS does) • SRB uses any data but ASIS uses ASL • SRB keeps the metadata in Catalogue Services (MCAT). ASIS uses XML structured capability metadata • SRB has central metadata handling approach, ASIS has distributed metadata handling approach • ASIS’s data integration is based on metadata federation, SRB’s data integration is based on SRB server federation. • Instead of Filters, SRB uses SRB server and agents for accessing data resources.
Sompel’s DL (1) • Scholarly communication as a network-based workflow • Instead of Filters and ASL in ASIS, Sompel defines “repositories” and “digital objects”, respectively. • Repository is a networked system that provides services pertaining to a collection of Digital Objects • Repositories have common service interfaces. • “Obtain”, “Harvest” and “Put”. • Two classes of participants. • Data providers (DP) and Service providers (SP) • SP collect metadata from DPs (via 3 service interface); normalize and cluster it to deal with duplicates. • DP offer some type of search mechanism for their own repositories.
Sompel’s DL (2) • Data and storage model: • Data is the abstraction of the Digital Objects • Digital Objects = Digital data + key metadata. • Serialization of Digital Objects = Surrogates • Surrogates • Information for the value chains and service • information used at repository service interfaces. • In the XML/RDF format • Composed of “dataStream” and/or “Entity” tag elements. • Chained object is defined by keymetadataID or “providerInfo”. • Different storage types: book repositories, teaching object repositories, dataset repositories etc. • Repositories are active nodes. Repositories enable the use and re-use of materials in many contexts.
Sompel’s DL (3) • Metadata model: • Surrogates are essentially metadata records for objects • Based on Dublin Core format with domain specific extensions. • Dublin core has 15 standard entities to define resources. • For more details see http://doublincore.org • Chaining for integrating data: • Application/User doesn’t need to use workflow engine or script to create or run the chain. (As in ASIS) • Chain (they call “value chain”) is hidden in the surrogates. • Surrogates are updated through the common interfaces (“put” “obtain” and “harvest”) of the resources. • Chain is defined in the “Entity” element in the surrogate document with the “Lineage” sub element. • Sample chaining scenario: • A paper might have references to some papers and these papers might be references to some other papers…. • Value chain does not stop. • Papers have different metadata (value added) through value chain
ASIS vs. Sompel’s Approach • Instead of Filters and ASL in ASIS, Sompel defines “repositories” and “digital objects” respectively • DP correspond to End-Filters, and SP correspond to Filters in ASIS • ASIS do not have publishing or putting service interfaces • “Obtain” corresponds to “getData” in ASIS • “Harvest” corresponds to “getCapabilities” in ASIS • Both have distributed metadata approaches for data integration • ASIS – direct communication between Filters by using “GetCapabilities” interface • Sompe’s DL – direct communication between repositories and services by using “Harvest” interface • Sompel’s DL uses Dublin Core for the representation of the resources – ASIS uses its own schema. • ASIS uses ASL for the representation of the data - Sompel’s approach doesn’t have common data model.
Summary • Application-Integration Framework (ASIS) • Easy to add new sources • Using online Filters providing required adaptors • peer-to-peer chain of Filters • no central metadata catalog server – Distributed capability exchange and aggregation • SOA • Re-usable components (Filters) for different applications in predefined domain • Implications of Filter services • Scalable and Fault-tolerant • Load-balancing and caching • Dynamically updating capability metadata
Capability in Grid Services Security • XPOLA • The infrastructure is built on a peer-to-peer chain-of-trust model. No central admins • WS-Security compliant • Extensible – PKI and SAML based • Dynamic and reusable (manually or automatically generated) • Composed of two sectors. • Policy document (SAML, lifetime info, binding info etc.) • Provider’s signature • Existing grid security solutions to fine-grained authorization were not addressing general Web/Grid services in compliant with Web Services security specs. • With central admins, other approaches don’t address dynamic services
Sample Capabilities File (too simplified) – GIS Domain • <?xml version='1.0' encoding="UTF-8" standalone="no" ?> <!DOCTYPE WMT_MS_Capabilities SYSTEM "http://toro.ucs.indiana.edu:8086/xml/capabilities.dtd"> <Capabilities version="1.1.1" updateSequence="0"> <Service> <Name>CGL_Mapping</Name> <Title>CGL_Mapping WMS</Title> <OnlineResource xmlns:xlink="http://www.w3.org/1999/xlink" xlink:type="simple“ xlink:href="http://toro.ucs.indiana.edu:8086/WMSServices.wsdl" /> <ContactInformation> ….. </ContactInformation> </Service> <Capability> <Request> <GetCapabilities> <Format>WMS_XML</Format> <DCPType><HTTP><Get> <OnlineResource xmlns:xlink="http://w3.org/1999/xlink" xlink:type="simple“ xlink:href="http://toro.ucs.indiana.edu:8086/WMSServices.wsdl" /> </Get></HTTP></DCPType> </GetCapabilities> <GetMap> <Format>image/GIF</Format> <Format>image/PNG</Format> <DCPType><HTTP><Get> <OnlineResource xmlns:xlink="http://w3.org/1999/xlink" xlink:type="simple“ xlink:href="http://toro.ucs.indiana.edu:8086/WMSServices.wsdl" /> </Get></HTTP></DCPType> </GetMap> </Request> <Layer> <Name>California:Faults</Name> <Title>California:Faults</Title> <SRS>EPSG:4326</SRS> <LatLonBoundingBox minx="-180" miny="-82" maxx="180" maxy="82" / > </Layer> </Capability> </Capabilities>
Dublin Core • Challenge of resource description and discovery • Language for making a particular class of statements about resources • There 2 namespaces – Dublin Core element set (dc)and Dublin Core qualifiers (dcq ex. dcq:iso8601). • Some of Dublin core metadata element set • Title (dc:title), subject, description, creator, publisher, type, format, source, language, rights • Using DC in RDF, specifications for DC in RDF (work in progress) • Resource has(verb) property(dc:creator) X(dc:Ahmet)
Sample Dublin Core http://www.ils.unc.edu/mrc/jcdl2006/slides/kunze.pdf
OAI • Deals with e-print server world • Need to develop services that permitted searching across papers housed at multiple repositories • Repositories also needed capabilities to automatically identify and copy papers that had been deposited in them. • Definition of an interface to permit e-print servers to expose the metadata for the papers that it held. • Service providers with similar metadata standards need to harvest this metadata • Service providers act as a federation of repositories, by indexing documents, so that multiple collections cen be searched as though they form a single collection
OAI-PMH • For the variety of the communities engaged in publishing content on the Web • Any networked server can emplly the protocol to enable service providers to collect its metadata • HTTP-based request-response transaction • Service Providers • Harvest metadata from Data Providers using the OAI protocol and use the returned metadata as a basis for building value-added services. • Data Providers (repositories) • Adopt OAI technical as a means of exposing metadata about their content.
Comments on OAI • OAI-PMH is ultimately only as useful as the metadata it transports. • The tendency of implementers to almost exclusively apply the lowest common denominator of unqualified dublin core makes it difficult to implement more advanced search interface features. • Content providers should prefer more expressive metadata schema like MARC or qualified DC and find ways to augment human-generated descriptive metadata.
Sompel’s ApproachHierarchy steps http://msc.mellon.org/Meetings/Interop/lagoze_data_model.pdf
Sompel’s DLData Model msc.mellon.org/Meetings/Interop/lagoze_data_model.pdf
Ogsa-DAI Figure http://www.globus.org/grid_software/data/dai.php
Perform Document http://www.ogsadai.org.uk/documentation/ogsadai-wsi-2.2/doc/interaction/Perform.html
MCS • MCS present a design of Metadata Catalog Service that provides mechanism for storing and accessing descriptive metadata attributes • Requirements: Store domain-independent attributes, user-defined attributes, query with a set of attributes, query with a logical name, authentication, authorization and auditing • Allows users to discover data sets based on the value of descriptive attributes, rather then requiring to know specific names or physical locations of data items
MCAT vs. MCS • MCAT can be used just with SRB • MCS can be used just in OGSA architecture • MCAT stores both physical and logical addresses • MCS stores logical metadata attributes and handles that can be resolved by a data location or data access services. • They can both be extended for serving application-specific metadata, but they don’t have generalized way for doing that.
CLIENT • Example interaction with SRB using Scommands: • Sinit • Start interaction with SRB • Spwd • Display current position within SRB repository • Smeta -i –I “UDSMD0=‘author’” –I “UDSMD1=‘bob’” myfile • Add metadata describing the author the file • Smeta -i –I “UDSMD0=‘author’” –I “UDSMD1=‘arthur’” • Search for files with author metadata set as arthur • Sget myFile • Copy myFile from SRB to local storage • Sreplicate –S anotherResource myFile • Create a replica of myFile on anotherResource • Srm myFile • Remove myFile (and all replicas) from SRB • Sexit • End interaction with SRB