300 likes | 438 Views
MEDIATORS. Mediation. Typical file-sharing systems have a single global schema for describing their data P2P networks have to consider heterogeneous schemas in the network and have to rely on local transformation mechanisms and rules.
E N D
Mediation • Typical file-sharing systems have a single global schema for describing their data • P2P networks have to consider heterogeneous schemas in the network and have to rely on local transformation mechanisms and rules. • Mediation TRIPLE. As has been pointed out in [8], defining views appears to be the right means for mediation, especially in case of schemas or ontologies modeled with the help of description logics. Since on the Semantic Web current approaches for schema/ontology languages build on description logics, e.g., DAML+OIL and its W3C successor OWL [11, 34], a powerful rule language with the capability to define views seems to be a promising candidate for mediation. Especially relevant in our Edutella TRIPLE peer is its capability to define parameterized views which add the flexibility to define multi-step mappings (by nesting/sequencing such The expansion of these abstract query plans can be based on different strategies, related to the quality of clustering in the P2P network. If the data are clustered well with respect to the queries, it is most efficient to push joins in the query as near as possible to the data sources, and then take the union of the results for these joins. If the clustering does not reflect the partitions needed by the views). This peer (currently being developed as part of the ELENA project5) allows advanced querying, inferencing and mediation, and also provides reasoning services to be used in ELENA for providing personalization in the context of a smart learning space. It can also be used to express query correspondence assertions and model correspondences as flexible mechanisms to express mappings between heterogeneous schemas as discussed in [21].
Information Integration (Ullman, 1997) • Aerosol-relevant information arises from multiplicity of sources, each having specific evolution history, driving forces, formats etc. • Data analysis, i.e. the transformation of raw data into ‘actionable’ knowledge, requires data from numerous sources • Hence, there is a value in combining information from various sources but there are problems: • Legacy data systems can not be altered to support integration • Data systems use different terms or meaning of similar terms • Distributed sources, such as the web, may not have as schema at all.
Integration Architecture (Ullman, 1997) • Heterogeneous sources are wrapped by software that translates between the sources local language, model and concepts and the shared global concepts • Mediators obtain information from one or more components (wrappers or other mediators) and pass it on to other mediators or to external users. • In a sense, a mediator is a view of the data found in one or more sources; it does not hold the data but it acts as it it did. The job of the mediator is to go to the sources and provide an answer to the query.
MediationWiederhold Slide show on Mediation • We have worked for many years on information architectures which are intended to provided extra value to the customers, beyond the value provided by the information sources. The modules, called Mediators are interposed between databases and other information sources, and client applications. They often carry out the roles that used to be performed by human intermediaries, as reviewers, abstracters, critics, writers of surveys and anthologies, staff experts, advices givers as consumer organizations and colleagues, librarians, and the person sitting next to you on a bus. We have less access to such human resources now. The resulting disintermediation leads to problems of access, information overload, and maintenance of sharable information resources. We observed that there were many ad-hoc tools and aproaches being developed to deal with this issue, including work in our own DARPA-funded KBMS project. An early paper, mapping earlier work into a simple concept and architecture was written in 1991: • Mediators. Information pops on the web up spontaneously … no master plan but it is awkward to obtain due to lack of a catalog. We need services, composed of software and people, that select, filter, digest, integrate, and abstract data for specific topics of interest [Resnick:97]. There will be meta-services as well, helping to locate those services and reporting on their quality. We refer to the combination of experts and software to perform these functions, as mediators • Functions.The role of mediators is to translate data to information for multiple customers by intelligent processing, statistics or by other means [W:91]. Traditional middleware [Kleinrock:94] connects and transports data, but a mediator also transforms the content.
Human-computer Interaction User interface Application- specific code Service interface Domain- specific code MEDIATION Services Resource access interface Source- specific code Real-world interface Functional Service Layers Client Available Sources
_ _ …. …. …. . …. . Architecture instances Applications . . . . Mediators . . . . . . Resources . . . include computational resources
Architecture of Dvoy Federated Information SystemAfter Busse et. al., 1999 • The main software components of Dvoy are wrappers, which encapsulate sources and remove technical heterogeneity, and mediators, which resolve the logical heterogeneity. • Wrapper classes are available for geo-spatial (incl. satellite) images, SQL servers, text files,etc. The mediator classes are implemented as web services for uniform data access to n-dimensional data.
NSF Middleware • Middleware: The Basics • Middleware is software that connects two or more otherwise separate applications across the Internet or local area networks. More specifically, the term refers to an evolving layer of services that resides between the network and more traditional applications for managing security, access and information exchange to • Let scientists, engineers, and educators transparently use and share distributed resources, such as computers, data, networks, and instruments, • Develop effective collaboration and communications tools such as Grid technologies, desktop video, and other advanced services to expedite research and education, and • Develop a working architecture and approach that can be extended to the larger set of Internet and network users. • Middleware makes resource sharing seem transparent to the end user, providing consistency, security, privacy, and capabilities. The diagram below represents the relationship between Middleware and the technical and policy components of an information technology system that are required to work with it.
4 D Geo-Environmental Data Cube (X, Y, Z, T) Environmental data represent measurements in the physical world which has space (X, Y, Z) and time (T) as its dimensions. The specific inherent dimensions for geo-environmental data are: Longitude X, Latitude Y, Elevation Z and DateTime T. The needs for finding, sharing and integration of geo-environmental data requires that data are ‘coded’ in this 4D data space – at the minimum. Additional for
DataCube DataGranule DataSeries Dimension Z Dimension Y Dimension X Hierarchy of Data Objects:DataGranule, Data Series, DataCube Measure A measure (in OLAP terminology) represent numerical values for a specific entity to be analyzed (e.g. temperature, wind speed, pollutant). A collection of measures form a special dimension ‘ Measures’ (??Can Measures be Dimensions??) Data Granules A data granules– discrete, atomic data entities that cannot be further broken down. A data series is an ordered collection of data granules DataSeries is a collection of DataGranules having common attributes All data points in a measure represent the same measured parameter e.g. temperature. Hence, they share the same units and dimensionality. The data points of a measure are enclosed in a conceptual multidimensional data cube; each data point occupies a volume (slice or point) in the data cube. Data points in a measure share the same dimensions; Conversely, each data point has to have the dimensional coordinates in the data cube of the measure that it belongs to.
DataCube DataGranule 3D 1D Dimension Z Data Array Dimension Y 2D Dimension X
Data Space i View 2 View 1 k j j i i Multi-Dimensional Data Model 1 Dimensional e.g. Time dimension Data can be distributed over 1,2, …n dimensions 1 Dimensional e.g. Location & Time Data Granule 1 Dimensional e.g. Location, Time & Parameter Views are orthogonal slices through multidimensional data cubes Spatial and temporal slices through the data are most common
src_margin_top src_lon_min src_lat_max src_margin_right src_img_height src_margin_left src_img_width src_lat_min src_lon_max src_margin_bottom
XY MAP: Z,T fixed Image Registration Data Catalog Data Query Image Delivery Image Data Browser Data Selection: Measure, X, Y, Z, T Dvoy: Components and Image Data Flow and Distributed Image Data Measure: Elevation Dvoy Components: • Distributed Data • Image Registration • Data Catalog • Data Query • Image Delivery • Image Viewer Measure: TOMS Web Service Web Service Measure: SeaWiFS Web Service
The ‘Minimal’ Star Schema For integrative, cross-Supersite analysis, data queries by time, location and parameter, the database has to have time, location and parameter as dimensions • The minimal Site table includes SiteID, Name and Lat/Lon. • The minimal Parameter table consists of ParamterID, Description and Unit • The time dimensional table is usually skipped since time is self-describing • The minimal Fact (Data) table consists of the Obs_Value and the three dimensional codes for Obs_DateTime, Site_ID and Parameter_ID The above minimal (multidimensional) schema was used in the CAPITA data exploration software, Voyager for the past 22 years, encoding 1000+ datasets. Most Supersite data require a more elaborate schema to fully capture the content
Database Schema Design • Fact Table: A fact table (yellow) contains the main data of interest, i.e. the pollutant concentration by location, day, pollutant and measurement method. • Star Schema consists of a central fact table surrounded by de-normalized dimensional tables (blue) describing the sites, parameters, methods.. • Snowflake Schema is an extension of the star schema where each point of the star ‘explodes’ into further fully normalized tables, expanding the description of each dimension. • Snowflake schema can capture all the key data content and relationships if full detail. It is well suited for capturing and encoding complex monitoring data into a robust relational database.
Extended Star Schema for SRDS The Supersite program employs a variety of instrument/sampling/procedures Hence, at least one additional dimension table is needed for Methods A example extended star schema encodes the IMPROVE relational database (B. Schichtel)
Snowflake Example: Central Calif. AQ Study, CCAQS CCAQS schema incorporates a rich set of parameters needed for QA/QC (e.g. sample tracking) as well as for data analysis. The fully relational CCAQS schema permits the enforcing of integrity constraints and it has been demonstrated to be useful for data entry/verification. However, no two snowflakes are identical. Similarly, the rich snowflake schemata for one sampling/analysis environment cannot be easily transplanted elsewhere. More importantly, many of the recorded parameters ‘on the fringes’ are not particularly useful for integrative, cross-supersite, regional analyses. Hence the shared ( exposed) subset of the entire data set may consist of a small subset of the ‘snowflake’
Subset used From Heterogeneous to Homogeneous Schema • Individual Supersite SQL databases can be queried along spatial, temporal and parameter dimensions. However, the query to retrieve the same information depends on the of the particular database. • A way to homogenize the distributed data is access all the data through a Data Adapter using only a subset of the tables/fields from any particular database (red) • The proposed extracted uniform (abstract) schema is the Minimal Star Schema, (possibly expanded). The final form of the uniformly extracted data schema will be arrived at by consensus. Uniform Schema Data Adapter Extraction of homogeneous data from heterogeneous sources Fact Table
Catalog Wrapper Mediator Service Chaining in Spatio-Temporal Data Browser Data Sources Homogenizer OGC-Compliant GIS Services Spatial Portrayal Spatial Overlay XDim Data SQL Table OLAP Client Browser GIS Data nDim Data Cube Vector Spatial Slice Time-Series Services Time Portrayal Time Overlay Satellite Images Time Slice Cursor/Controller Maintain Data Find/Bind Data Portray Overlay Render
DataView 3 DataView 2 Overlay of multiple Datasets 3 D DataCube DataView 1 • Each DataCube may have 0-n dimensions • Each dimension is assigned a view Null Layer 2 D DataCube Layer 1 Layer 2 • In a view, the number of layers is the number of datasets • If a DataCube does not have a data for a view, a Null Layer is assigned
3 D DataCube DataView 2 Overlay of multiple Datasets Data Access Connections Data Render Connections DataView 1 • Each DataCube may have 0-n dimensions • Each dimension is assigned a view DataView 3 • In a view, the number of layers is the number of datasets • If a DataCube does not have a data for a view, a Null Layer is assigned
Dvoy Data Flow and Processes View Wrapper Abstr.Data Access View Portrayal Device Driver Trans - mission View Data Abstract Portrayal Device Portrayal Render Device Physical Data Abstract Data DataView 1 DataView 2 DataView 3 Physical Data Physical Data reside in servers Data are accessed by view-specific wrappers yielding homogeneous abstract data ‘slices’ Abstract Data Abstract Data are virtual Abstract data are requested by viewers; homogeneous real data are delivered by abstract interface View Data View Data enriched for portrayal View data from abstract interface are enriched by parameters useful for portrayal/processing
Federated Data Warehouse Proxy Tier Data homogenization, transformation Provider Tier Heterogeneous data in distributed SQL Servers User Tier Data presentation, processing Three-Tier Federated Data Warehouse Architecture (Note: In this context, ‘Federated’ differs from ‘Federal’ in the direction of the driving force. Federated meant to indicate a driving force for sharing from ‘bottom up’ i.e. from the members, not dictated from ‘above’, by the Feds) • Provider Tier: Back-end servers containing heterogeneous data, maintained by the federation members • Proxy Tier: Retrieves designated Provider data and homogenizes it into common, uniform Datasets • User Tier: Accesses the Proxy Server and uses the uniform data for presentation, integration or processing
Federated Data Warehouse Proxy Tier Data Homogenization, etc. Provider TierHeterogeneous Data User Tier Data Consumption Presentation SQLDataAdapter1 SQLServer1 SQLDataAdapter2 SQLServer2 Processing CustomDataAdapter LegacyServer Integration Proxy Server Data Access & Use Member Servers Fire Wall, Federation Contract Web Service, Uniform Query & Data Federated Data Warehouse Interactions • The Provider servers interact only with the Proxy Server in accordance with the Federation Contract • The contract sets the rules of interaction (accessible data subsets, types of queries) • Strong server security measures enforced, e.g. through Secure Socket layer • The data User interacts only with the generic Proxy Server using flexible Web Services interface • Generic data queries, applicable to all data in the Warehouse (e.g. data sub-cube by space, time, parameter) • The data query is addressed to the Web Service provided by the Proxy Server • Uniform, self-describing data packages are passed to the user for presentation or further processing
Federated Data Warehouse Architecture • Three-tier architecture consisting of • Provider Tier: Back-end servers containing heterogeneous data, maintained by the federation members • Proxy Tier: Retrieves Provider data and homogenizes it into common, uniform Datasets • User Tier: Accesses the Proxy Server and uses the uniform data for presentation, integration or further processing • The Provider servers interact only with the Proxy Server in accordance with the Federation Contract • The contract sets the rules of interaction (accessible data subsets; types of queries submitted by the Proxy) • The Proxy layer allows strong security measures, e.g. through Secure Socket layer • The data User interacts only with the generic Proxy Server using flexible Web Services interface • Generic data queries, applicable to all data in the Warehouse (e.g. space, time, parameter data sub-cube) • The data query is addressed to a Web Service provided by the Proxy Server of the Federation • Uniformly formatted, self-describing XML data packages are handed to the user for presentation or further machine processing Federated Data Warehouse Proxy Tier Data Homogenization, etc. Provider TierHeterogeneous Data User Tier Data Consumption Presentation SQLDataAdapter1 SQLServer1 ImageDataAdapter2 ImageServer2 Processing CustomDataAdapter LegacyServer Integration Proxy Server Data Access & Use Member Servers Firewall; Federation Contract Web Service, Uniform Query & Data