1.39k likes | 1.66k Views
NIC Exposure Level Training. Vijayendra Gururao Business Intelligence Consultant. Agenda. Data warehousing Concepts - Day 1 Govt Case Study - Day 2 Defense HR Case study - Day 3 Manufacturing Case Study - Day 4 Data mining - Day 5. What to do about it . 2001. Active. Act.
E N D
NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant
Agenda • Data warehousing Concepts - Day 1 • Govt Case Study - Day 2 • Defense HR Case study - Day 3 • Manufacturing Case Study - Day 4 • Data mining - Day 5
What to do about it 2001 Active Act (Intelligent Agents) What it means Recommend 1996 (Data Mining) Action What’s important Analyze (OLAP) 1991 Focus (EIS) 1986 Summarize (MIS) Report Passive Technology Human Type of Analysis The Evolution of Business Intelligence
Introduction: • Definitions • Legacy Systems • Dimensions • Data Dependencies Model • Dimensional Model
ShipType Shipper Ship To Product DistrictCredit OrderItem ContactLocat. ProductLine SalesOrder Cust.Locat. ProductGroup Contract ContractType Customer Contact SalesRep SalesDistrict SalesRegion SalesDivision An ER Model
Why Data Warehouses? • To meet the long sought after goal of providing the user with more flexible data bases containing data that can be accessed “every which way.”
OLTP vs. OLAP • OLTP (Online transaction processing) has been the standard reason for IS and DP for the last thirty years. Most legacy systems are quite good at capturing data but do not facilitate data access. • OLAP (Online analytical processing) is a set of procedures for defining and using a dimension framework for decision support
The Goals for and Characteristics of a DW • Make organizational data accessible • Facilitate consistency • Adaptable and yet resilient to change • Secure and reliable • Designed with a focus on supporting decision making
The Goals for and Characteristics of a DW • Generate an environment in which data can be sliced and diced in multiple ways • It is more than data, it is a set of tools to query, analyze, and present information • The DW is the place where operational data is published (cleaned up, assembled, etc.)
Data Warehousing is Changing! Application requirements--not just data requirements--are now driving need. Customer Relationship Mgmt. Campaign Management ERP Call Center Knowledge Management ERP Target Marketing Supply Chain E-commerce
Organization of data in the presentation area of the data warehouse • Data in the warehouse are dimensional, not normalized relations • However, data that are ultimately presented in the data warehouse will often be derived directly from relational DBs • Data should be atomic someplace in the warehouse; even if the presentation is aggregate • Uses the bus architecture to support a decentralized set of data marts
Updates to a data warehouse • For many years, the dogma stated that data warehouses are never updated. • This is unrealistic since labels, titles, etc. change. • Some components will, therefore, be changed; albeit, via a managed load (as opposed to transactional updates)
Basic elements of the data warehouse Operational Source Systems DataStaging Area DataPresentation Area DataAccessTools • Services: • Clean, combine, and standardizeConform DimensionsNo user query services • Data Store: • Flat files and relational tables • Processing: • Sorting and sequential processing • Data Mart #1 • DimensionalAtomic and summary dataBased on a single business process Ad hoc query tools Report Writers Analytical Applications Modeling: Forecasting Scoring Data Mining Extract Load Access DW Bus:Conformed facts and dimensions Extract • Data Mart #2 • Similar design Extract Load Access
Data Staging Area • Extract-Transformation-Load • Extract: Reading the source data and copying the data to the staging area • Transformation: • Cleaning • Combining • Duplicating • Assigning keys • Load: present data to the bulk loading facilities of the data mart
Dimensional Modeling Terms and Concepts • Fact table • Dimension tables
Fact Tables • Fact table: a table in the data warehouse that contains • Numerical performance measures • Foreign keys that tie the fact table to the dimension tables
Fact Tables • Each row records a measurement describing a transaction • Where? • When? • Who? • How much? • How many? • The level of detail represented by this data is referred to as the grain of the data warehouse • Questions can only be asked down to a level corresponding with the grain of the data warehouse
Dimension tables • Tables containing textual descriptors of the business • Dimension tables are usually wide (e.g., 100 columns) • Dimension tables are usually shallow (100s of thousand or a few million rows) • Values in the dimensions usually provide • Constraints on queries (e.g., view customer by region) • Report headings
Dimension tables • The quality of the dimensions will determine the quality of the data warehouse; that is, the DW is only as good as its dimension attributes • Dimensions are often split into hierarchical branches (i.e., snowflakes) because of the hierarchical nature of organizations • Product part Product Brand • Dimensions are usually highly denormalized
Dimension tables • The dimension attributes define the constraints for the DW. Without good dimensions, it becomes difficult to narrow down on a solution when the DW is used for decision support
Bringing together facts and dimensions – Building the dimensional Model • Start with the normalized ER Model • Group the ER diagram components into segments based on common business processes and model each as a unit • Find M:M relationships in the model with numeric and additive non-key facts and include them in a fact table • Denormalize the other tables as needed and designate one field as a primary key
Sales Fact Time Dimension Product Dimension time_key product_key store_key dollars_sold units_sold dollars_cost time_key day_of_Week month quarter year holiday_flag product_key description brand category Store Dimension store_key store_name address floor_plan_type A Dimensional Model
Kimball Methodology Conformed Dimensions
Review: A Private Data Mart • A data mart containing one fact table and three dimension tables. We delivered all the tables by executing a fact build. • What if we want to add another fact table called F_Sales that will reference the three existing dimension tables?
Conformed Dimensions Understand Conformed Dimensions • A conformed dimension is a dimension that is standardized across all data marts. Time Location Sales Fact Conformed Dimensions Customer Distribution Fact Distributor Promotion Product Order Fact
Advantages of Conformed Dimensions • Deliver incremental data marts in a short period of time. • Independent data marts become part of a fully integrated data warehouse. • Deliver a consistent view across your business.
Dimensions Promotion Distributor Customer Location Product Time Facts X X X X X Sales Fact Distribution Fact X X X X X X X Order Fact Conformed Dimensions Within Bus Architecture • Identifying and designing the conformed dimensions is a critical step in the architecture phase of data warehouse design.
Design of Conformed Dimensions • A commitment to using conformed dimensions is more than just a technical consideration. It must be a business mandate. • Lay out a broad dimensional map for the enterprise. • Define conformed dimensions at the most granular (atomic) level possible. • Conformed dimensions should always use surrogate keys. • Define standard definitions for dimension and fact attributes.
Granularity in Conformed Dimensions • Conformed dimensions should be defined at the most granular (atomic) level so that each record in these tables corresponds to a single record in the base-level fact table. Order FactDay IdProduct IdCustomer IdCostNumberOrdered D_TimeDayDay IdDayMonth Year D_ProductProduct IdDescriptionProduct TypeType DescriptionProduct LineLine Description D_CustomerCustomer IdLast NameFirst NameAddress
Flexibility of Conformed Dimensions • Conformed dimensions are usually designed within star schema data marts. For multiple granularity fact tables, higher level views of dimensions can be used (or a snowflake table). View or Snowflake table Time Dimension Order FactDay IdProduct IdCustomer IdCostNumberOrdered CustomerCustomer IdLast NameFirst NameAddress Time(Day)Day IdDayMonth Id Period Sales FactMonth IdProduct IdCustomer IdAmountSoldRevenue ProductProduct IdDescriptionProduct TypeProduct Line Time(Month) ViewMonth IdMonthPeriod
So, What is a DW? • A data warehouse is a subject-oriented, integrated, non-volatile, and time-variant collection of data in support of management’s decisionsW.H. Inmon (the father of DW)
Subject Oriented • Data in a data warehouse are organized around the major subjects of the organization
Integrated • Data from multiple sources are standardized (scrubbed, cleansed, etc.) and brought into one environment
Non-Volatile • Once added to the DW, data are not changed (barring the existence of major errors)
Time Variant • The DW captures data at a specific moment, thus, it is a snap-shot view of the organization at that moment in time. As these snap-shots accumulate, the analyst is able to examine the organization over time (a time series!) • The snap-shot is called a production data extract
comparison of operational and informational systems Need for Data Warehousing • Integrated, company-wide view of high-quality information (from disparate databases) • Separation of operational and informational systems and data (for improved performance)
Data Warehouse Architectures • Generic Two-Level Architecture • Independent Data Mart • Dependent Data Mart and Operational Data Store • Logical Data Mart and @ctive Warehouse • Three-Layer architecture All involve some form of extraction, transformation and loading (ETL)
Generic two-level architecture L One, company-wide warehouse T E Periodic extraction data is not completely current in warehouse
Data marts: Mini-warehouses, limited in scope L T E Separate ETL for each independent data mart Data access complexity due to multiple data marts Independent Data Mart
ODS provides option for obtaining current data L T E Simpler data access Single ETL for enterprise data warehouse (EDW) Dependent data marts loaded from EDW Dependent data mart with operational data store
ODSand data warehouse are one and the same L T E Near real-time ETL for @active Data Warehouse Data marts are NOT separate databases, but logical views of the data warehouse Easier to create new data marts Logical data mart and @ctive data warehouse
DW Design Mainly consists of • Logical Design • Physical Design
Logical Design of DW Identification of • Entities • Relationships • Attributes • Uniqe identifiers Conceptual and abstract. Results in Fact and dimension tables Created using Pen and Paper OR Modeling tools also
Physical Design of DW • Conversion of data gathered in Logical design to Physical database structure • Mainly driven for query performance
Status Status Data CharacteristicsStatus vs. Event Data Example of DBMS log entry Event = a database action (create/update/delete) that results from a transaction
Data CharacteristicsTransient vs. Periodic Data Figure 11-8: Transient operational data Changes to existing records are written over previous records, thus destroying the previous data content