590 likes | 688 Views
Discover the definition and concept of Data Warehousing, its key characteristics, types like Data Marts and Operational Data Stores, the role of metadata, and the process overview including data extraction, transformation, loading, and database summarization. Learn about different Data Warehousing architectures like client/server and n-tier, and the importance of real-time data in decision-making support systems.
E N D
Business Intelligence:A Managerial Approach(2nd Edition) Chapter 2: Data Warehousing
DW definition and Concept • Using real-time data warehousing (RDW) in conjunction with decision support system • (DSS) and BI tools is an important way to conduct business processes
What is a Data Warehouse? • a data warehouse (DW) is a pool of data produced to support decision • making; it is also a repository of current and historical data of potential interest to managers throughout the organization • Data are usually structured to be available in form ready for analytical processing activities
Characteristics of DW • Subject-oriented. Data are organized by detailed subject, such as sales, products, or customers. • Integrated. Data are from different sources and must be in a consistent form. DWs must deal with naming conflicts and discrepancies among units of measure.
Characteristics of DW • Time variant (time series). A DW usually contains historical data (e.g. daily, weekly, monthly), except in real-time systems. They detect trends, deviations, long-term relationships for forecasting and comparisons, leading to decision making. • Nonvolatile. Users cannot change data in DWs. Obsolete data are discarded, and changes are recorded as new data
Additional DW Characteristics • Web-based. DWs are typically web-based application. • Relational/Multidimensional. DWs use either a relational structure or multidimensional structure. • Client/server. DWs use client/server architecture to provide easy access to end users. • Real-time. Newer DWs provide real-time or active, data access and analysis capabilities. • Include metadata. DWs contain metadata (data about data) about how data organized and how to effectively use them.
Parts of DWs. • Data Marts. A data mart is a subset of a DW, typically consisting of a single subject area (e.g. marketing, operations). • Dependent data mart A subset that is created directly from a data warehouse • Independent data mart A small data warehouse designed for a strategic business unit or a department its not from DW
Parts of DWs. 2. Operational Data Stores (ODS). An ODS provides a fairly recent form of customer information file (CIF) and is used for short-term decision involving mission-critical applications. • Enterprise Data Warehouses (EDW). is a large scale DW that is used across the enterprise for decision support. EDW are used to provide data for many types of DSS, • customer relation management (CRM), • supply chain management (SCM), • business performance management (BPM), • business activity monitoring (BAM), • product lifecycle management (PLM), • revenue management, • knowledge management systems (KMS), etc.
MetaData • Data about data. • In a data warehouse, metadata describe the contents of a data warehouse and the manner of its acquisition and use. • Syntactic metadata, structure metadata and semantic metadata.
DW Process Overview • Components of the data warehousing process: • Data sources. Data are sourced from multiple independent operational "legacy“ systems and possibly from external data providers (such as the U.S . Census). Data may also come (OLTP) or ERP system,Web data. • Data extraction and transormation. Data are extracted and properly transformed using custom-written or commercial software called ETL.
DW Process Overview • Components of the data warehousing process: • Data loading. Data are loaded into a staging area, where they are transformed and cleansed. The data are then ready to load into the data warehouse and/or data marts. • Compreheusive database. Essentially, this is the EDW to support all decision analysis by providing relevant summarized and detailed information originating from many different sources.
DW Process Overview • Components of the data warehousing process: • Metadata. Metadata are maintained so that they can be assessed by IT personnel and users. Metadata include soft-vvare programs about data and rules for organizing data summaries that are easy to index and search, especially with Web tools. • Middleware tools. Middleware tools enable access to the data warehouse. Power users such as analysts may write their own SQL queries.
2.3 DATA WAREHOUSING ARCHITECTURES • architectures are commonly called client/ server or n-tier architectures (2-tier and 2-tier OR 1-tire) • DW Arch. Devide into three parts: 1. The data warehouse itself 2. Data acquisition (back-end) software 3. Client (front-end) software,
3-tire DW Architectures • The advantage of the three-tier architecture is its separation of the functions of the data warehouse, which eliminates resource constraints and makes it possible to easily create data marts
2-tire Dw • the DSS engine physically runs on the same hardware platform as the data warehouse • Therefore, it is more economical than the three-tier structure.
Alternative Data Warehousing Architectures • data warehouse architecture design viewpoints can be categorized into enterprise-wide data warehouse (EDW) design and data mart (DM) design. • Independent Data Marts Architecture • the simplest and the least costly architecture • The data marts are developed to operate independently of each other to serve for the needs of individual organizational units
Alternative Data Warehousing Architectures • Data mart bus architecture. • This architecture is a viable alternative to the independent • data marts where the individual marts are linked to each other via some kind of middleware.
Alternative Data Warehousing Architectures (c) Hub-and-Spoke Architecture (Corpo rate Information Factory) • the most famous data warehousing architecture today. • focused on building a scalable and maintainable infrastructure that includes a centralized data warehouse and several dependent data marts
Alternative Data Warehousing Architectures (d) Centralized Data Warehouse Architecture • no dependent data marts • provides users with access to all data in the data warehouse • it reduces the amount of data the technical team has to transfer or change, therefore simplifying data management and administration
Alternative Data Warehousing Architectures (e) Federated Architecture • It uses all possible means to integrate analytical resources from multiple sources to meet changing needs or business conditions
Alternative Data Warehousing Architectures Ten factors that potentially affect the architecture selection decision: 6. Strategic view of the data warehouse prior to implementation 7. Compatibility with existing systems 8. Perceived ability of the in-house IT staff 9. Technical issues 10. Social/political factors 1. Information interdependence between organizational units 2. Upper management’s information needs 3. Urgency of need for a data warehouse 4. Nature of end-user tasks 5. Constraints on resources
2.4 DATA INTEGRATION AND THE EXTRACTION, TRANSFORMATION, AND LOAD (ETL) PROCESSES • A decision maker typically needs access to multiple sources of data that must be integrated. • As data warehouses grow in size, the issues of integrating data grow as well
Data Integration • Data integration comprises three major processes: • Data access (i.e., the ability to access and extract data from any data source), • Data federation (i.e., the integration of business views across multiple data stores), and • Change capture (i.e., based on the indentification, cpature, and delivery of the changes made to enterprise data sources).
Data Integration Techniques • Enterprise application integration (EAI) • Service-oriented architecture (SOA) • Enterprise information integration (Ell) • Extraction, transformation, and load (ETL)
Extraction, Transformation, and Load (ETL) • ETL is the heart of DW. • ITmanagers are often faced with challenges because the ETL process typically consumes 70 percent of the time • ETL is composed of • Extraction: reading data from one or more databases, • Transformation: converting the extracted data from its previous form into the form in which it needs to be so that it can be placed into a data warehouse or simply another database, and • Load: putting the data into datawarehouse
ETL • The purpose of the ETL process is to load the warehouse with integrated and cleansed data. • Several issues affect whether an organization will purchase data transformation tools or build the transformation process itself: • Data transformation tools are expensive. • Data transformation tools may have a long learning curve. • It is difficult to measure how the IT organization is doing until it has learned to use the data transformation tools.
ETL The following are some of the important criteria in selecting an ETL tool: • Ability to read from and write to an unlimited number of data source architectures • Automatic capturing and delivery of metadata • A histoty of conforming to open standards • An easy-to-use interface for the developer and the functional user
2.5 DATA WAREHOUSE DEVELOPMENT • A data warehouse provides several benefits that can be classified as direct and indirect. Direct benefits include the following: • • Encl users can perform extensive analysis in numerous ways. • • A consolidated view of corporate data (i.e., a single version of the truth) is possible. • • Better and more-timely information is possible • • Enhanced system performance can result • • Data access is simplified.
Indirect benefits • enhance business knowledge, present competitive advantage, improve customer service and satisfaction, facilitate decision making, and help in reforming • business processes
Data Warehouse Development Approaches • THE INMON MODEL: THE EDW APPROACH • Bill Inmon, who is often called "the father of data warehousing. • Inmon's approach emphasizes top-clown development, • The EDW approach does not preclude the creation of data marts.
Data Warehouse Development Approaches • THE KIMBALL MODEL: THE DATA MART APPROACH • Is "plan big, build small" approach. • A data mart is a subject-oriented or department-oriented data warehouse. • It is a scaled-down version
Representation of Data in Data Warehouse • Dimensional Modeling – a retrieval-based system that supports high-volume query access • Star schema – the most commonly used and the simplest style of dimensional modeling • Contain a fact table surrounded by and connected to several dimension tables • Fact table contains the descriptive attributes (numerical values) needed to perform decision analysis and query reporting • Dimension tables contain classification and aggregation information about the values in the fact table • Snowflakes schema – an extension of star schema where the diagram resembles a snowflake in shape
Analysis of Data in Data Warehouse • OLAP versus OLTP • OLTP (online transaction processing) • is a term used for transaction system that is primarily responsible for capturing and storing data related to day-to-day business functions • The main focus is on efficiency of routine tasks • OLAP (online analytic processing) • A system is designed to address the need of information extraction by providing effectively and efficiently ad hoc analysis of organizational data • The main focus is on effectiveness
OLAP Operations • The main operational structure in OLAP is based on a concept called cube. • A cube in OLAP is a multidimensional data structure (actual or virtual) that allows fast analysis of data.
OLAP Operations • Slice – a subset of a multidimensional array • Dice – a slice on more than two dimensions • Drill Down/Up – navigating among levels of data ranging from the most summarized (up) to the most detailed (down) • Roll Up – computing all of the data relationships for one or more dimensions • Pivot – used to change the dimensional orientation of a report or an ad hoc query page display
2.6 DW Implementation Issues The following is a list of major tasks that could be performed: • Establishment of service-level agreements and data-refresh requirements • Identification of data sources and their governance policies • Data quality planning • Data model design • ETL tool selection • Relational database software and platform selection • Data transport • Data conversion • Reconciliation process • Purge and archive planning • End-user support
DW Implementation Guidelines • Project must fit with corporate strategy & business objectives • There must be complete buy-in to the project by executives, managers, and users • It is important to manage user expectations about the completed project • The data warehouse must be built incrementally • Build in adaptability, flexibility and scalability • The project must be managed by both IT and business professionals • Only load data that have been cleansed and are of a quality understood by the organization • Do not overlook training requirements • Be politically aware
Successful DW ImplementationThings to Avoid • Starting with the wrong sponsorship chain • Setting expectations that you cannot meet • Engaging in politically naive behavior • Loading the data warehouse with information just because it is available • Believing that data warehousing database design is the same as transactional database design • Choosing a data warehouse manager who is technology oriented rather than user oriented
Successful DW ImplementationThings to Avoid - Cont. • Focusing on traditional internal record oriented data and ignoring the value of external data and of text, images, etc. • Delivering data with confusing definitions • Believing promises of performance, capacity, and scalability • Believing that your problems are over when the data warehouse is up and running • Focusing on ad hoc data mining and periodic reporting instead of alerts
2.7 Real-Time Data Warehousing • Traditionally, DWs work mainly on historical data to support strategic and tactical decision making. • For many business, making fast and consistent decisions across the enterprise requires real-time data warehousing. Decision support has become operational. • Real-time data warehouse (RDW), also known as active data warehouse (ADW), is the process of loading and providing data via the data warehouse as they become available.
Traditional Vs Real-Time • Strategic, tactic and Operational decisions • Results measured with operations • Only comprehensive detailed data available within minutes is acceptable • High number (1000 or more) of users accessing and querying the system simultaneously • Flexible ad hoc reporting, as well as machine-assisted modeling (e.g., data mining) to discover new hypotheses and relationships • Operational staffs, call centers, external users • Strategic decision only • Results sometimes hard to measure • Daily, weekly, monthly data currency acceptable; summaries often appropriate • Moderate user concurrency • Highly restrictive reporting used to confirm or check existing processes and patterns; often uses predeveloped summary tables or data marts • Power users, knowledge workers, internal users
Critical • Concerns about real-time BI • Not all data should be updated continuously.