70 likes | 480 Views
Database Schema Design for the Federated Data Warehouse. Nov 27, 2001, RBH. The ‘Minimal’ Star Schema. For integrative, cross-Supersite analysis, data queries by time, location and parameter, the database has to have time, location and parameter as dimensions .
E N D
Database Schema Designfor theFederated Data Warehouse Nov 27, 2001, RBH
The ‘Minimal’ Star Schema For integrative, cross-Supersite analysis, data queries by time, location and parameter, the database has to have time, location and parameter as dimensions • The minimal Site table includes SiteID, Name and Lat/Lon. • The minimal Parameter table consists of ParamterID, Description and Unit • The time dimensional table is usually skipped since time is self-describing • The minimal Fact (Data) table consists of the Obs_Value and the three dimensional codes for Obs_DateTime, Site_ID and Parameter_ID The above minimal (multidimensional) schema was used in the CAPITA data exploration software, Voyager for the past 22 years, encoding 1000+ datasets. Most Supersite data require a more elaborate schema to fully capture the content
Database Schema Design • Fact Table: A fact table (yellow) contains the main data of interest, i.e. the pollutant concentration by location, day, pollutant and measurement method. • Star Schema consists of a central fact table surrounded by de-normalized dimensional tables (blue) describing the sites, parameters, methods.. • Snowflake Schema is an extension of the star schema where each point of the star ‘explodes’ into further fully normalized tables, expanding the description of each dimension. • Snowflake schema can capture all the key data content and relationships if full detail. It is well suited for capturing and encoding complex monitoring data into a robust relational database.
Extended Star Schema for SRDS The Supersite program employs a variety of instrument/sampling/procedures Hence, at least one additional dimension table is needed for Methods A example extended star schema encodes the IMPROVE relational database (B. Schichtel)
Snowflake Example: Central Calif. AQ Study, CCAQS CCAQS schema incorporates a rich set of parameters needed for QA/QC (e.g. sample tracking) as well as for data analysis. The fully relational CCAQS schema permits the enforcing of integrity constraints and it has been demonstrated to be useful for data entry/verification. However, no two snowflakes are identical. Similarly, the rich snowflake schemata for one sampling/analysis environment cannot be easily transplanted elsewhere. More importantly, many of the recorded parameters ‘on the fringes’ are not particularly useful for integrative, cross-supersite, regional analyses. Hence the shared ( exposed) subset of the entire data set may consist of a small subset of the ‘snowflake’
Subset used From Heterogeneous to Homogeneous Schema • Individual Supersite SQL databases can be queried along spatial, temporal and parameter dimensions. However, the query to retrieve the same information depends on the of the particular database. • A way to homogenize the distributed data is access all the data through a Data Adapter using only a subset of the tables/fields from any particular database (red) • The proposed extracted uniform (abstract) schema is the Minimal Star Schema, (possibly expanded). The final form of the uniformly extracted data schema will be arrived at by consensus. Uniform Schema Data Adapter Extraction of homogeneous data from heterogeneous sources Fact Table
Data Warehouse Features • As much as possible data should reside in their respective home environment. ‘Uprooted’ data in decoupled databases tend to decay ie can not be easily updated, maintained, enriched. • Abstract (universal) query/retrieval facilitates integration and comparison along the key dimensions (space, time, parameter, method) • The open architecture data warehouse (based on Web Services) promotes the building of further value chains: Data Viewers, Data Integration Programs, Automatic Report Generators etc..