430 likes | 535 Views
Data Warehousing. High quality information is vital to good decision making Data is different from information Organizations have a lot of data, may not have a lot of information information gap. Data Warehousing.
E N D
Data Warehousing • High quality information is vital to good decision making • Data is different from information • Organizations have a lot of data, may not have a lot of information • information gap
Data Warehousing Databases, in an organization, may have been developed in a fragmented manner over time need to reorganize, filter, .. The data before mining it Most systems are designed for operational processing, not informational processing
Data Warehousing Operational Processing: the processing of daily data Information processing: analysis of summarized data in order to support decision making
Data Warehousing Ex 1: WalMart: what to stock, when, .. Ex 2: Airlines: how much to overbook a flight by (forecast no-shows, ..) Ex 3: Amazon: people who bought this book also bought these books
Data Warehousing A lot of transactions are done via the web today fully automated, lots of data, real time opportunities to mine the data
In this chapter How data warehouses relate to operational systems 3-tier architecture for data warehousing Extracting data from operational systems and loading them into a data warehouse Big data / NoSql Data mining
Data Warehousing: Basic Concepts A data warehouse is: subject oriented (organized around key topics) Integrated (formats, names, ..) time variant (time dimension trends, … non-updateable (loaded by operation systems, not by end users)
Data Warehousing: Brief History Improvements in database technology Advances in computer hardware (cpu, memory, storage, ..) End user computing Advances in middleware products (networks, internet, ..)
Need for Data Warehousing Business require an integrated, company-wide view of high-quality information IS department must separate informational from operational systems to improve performance in managing company data
Need for a company-wide view Data scattered around company Different formats Different names/synonyms Free data vs structured data Inconsistent data (?) Missing data
Operational vs informational systems Operational system: designed to manage day to day operations; reservation systems, sales transactions, .. Informational systems: designed to support decision making; trends, data mining, forecasting, planning, ..
Informational systems A data warehouse is centralized A data warehouse adds value to data by improving its quality and consistency A data warehouse eliminates the contention/bottleneck for resource use
Data Warehouse Architectures Independent data marts Dependent data marts Logical data marts
Independent data marts Figure 9-2: 4 steps from left to right Data is extracted from various sources Data is transformed and integrated Data warehouse = collection of data marts; contains both summary and detailed data Tools for users
Data Mart A data warehouse that is limited in scope, whose data is generated from either a data warehouse or other source data systems Independent data mart = data mart filled with data extracted from the operational environment, not from a data warehouse
Dependent data marts Independent data marts have some limitations: Need separate process for each data mart Possible consistency issues Limited drill down capabilities Scaling costs can be high
Dependent data mart A data mart filled exclusively from an enterprise data warehouse and its reconciled data EDW = Enterprise Data Warehouse = a centralized, integrated data warehouse that is the entry point and single source of data for decision support applications
Dependent data marts Figure 9-3: 4 steps from left to right Data is extracted from various sources Data is transformed and integrated EDW and data marts Tools for users, pulling data from EDW and/or data marts
Logical data marts and Real time Data Warehousing Figure 9-4: 3 steps from left to right Data is extracted from various sources Data is transformed and integrated and fed real-time to data marts Tools for users, pulling data from EDW and/or data marts
Real time Data Warehouse An Enterprise Data Warehouse that accepts near-real-time feeds of transactional data from the systems of record, analyzes warehouse data, and in near real-time relays business rules to the data warehouse
Real time Data Warehouse Enable real time data analysis and response Capture customer data at time of event Analyze customer behavior and possibly predict customer response Develop rules for optimizing customer interaction Take immediate action with customer
Real time Data Warehouse E-commerce, abandoned shopping cart can trigger instant promotional email message Fraud detection in credit card transaction (unusual pattern detected) triggers call to credit card owner
Three-Layer Data Architecture Figure 9-5: Operational data (Operational systems) Reconciled data (EDW) Derived data (Data marts) Note: for each data, there is metadata associated with it (describes the data, ..)
Data warehouse data Status vs event data Figure 9-6 Status data = data before and after event Event data = data of the event
Data warehouse data Transient vs periodic data Figure 9-7 Transient data = can be overwritten less data needed, but less information Periodic data = never altered more data needed, but more information
Transient vs periodic data Likes of a post Do we just store the number of likes (transient)? Can we prevent somebody from voting twice? Probably not Do we store every like (periodic)? aggregate to get total
The Derived Data Layer Data layer associated with physical and logical data marts (from figure 9.5) Users interact with this layer for their decision support applications
Characteristics of Derived Data The source of derived data is the reconciled data Typically, derived data is aggregated (per topic, user group, ..) it looks more like information than data
The Star Schema A simple database design in which dimensional data are separated from fact or event data. A dimensional model is another name for the starts schema.
The Star Schema Figure 9.9: a start schema includes: Dimension Tables (descriptive data) Fact table (actual data) ER model looks like a star
The Star Schema Figure 9.10: example of a star schema: Dimension Tables: product, period, store Fact table: sales Primary key in fact table (sales) is a combination of all primary keys ion the dimension tables
The Star Schema A row of the product table contains information for a particular product A row of the period table contains information about a unit of time A row of the store table contains information about a store
The Star Schema A row of the sales table contains information about the sales of a particular product in a particular product during a particular unit of time: unit sold, dollars sold, and dollars cost Example with actual data in figure 9.11
Fact table Can be big: example Sears Assumption: 1000 stores, 10000 products, 5 years of daily data 1000 X 100000 * ( 5 * 365) rows
Grain of the Fact table = level of detail of the fact table = intersection of primary keys of the dimension tables The finer the grain, the bigger the table Tradeoff between size and information quality/detail
Fact table Variations If we are interested in storing events, we can have a “factless” fact table it contains foreign key values only; no extra data Example figure 9.14: an event took place at that time, between these people, at that facility, on that topic (course)
BIG DATA A database whose size strains the ability of RDBMS to capture, manage, and process data efficiently scalability issues Large scale data gathering and analytics (particularly in web applications) big data challenging for traditional relational database management systems
BIG DATA NewSQL Make SQL based, relational databases more scalable NoSQL alternative databases to SQL
NoSQL databases NoSQL Stands for Not Only SQL Less functionality than SQL Higher performance (fast), more scalable than SQL
NoSQL Non relational No schema Scalable Hash tables and similar data structures
NoSQL No joins No constraints No complex transactions (do this but if that, rollback the transaction)
NoSQL Data Structures Hash tables: Key, Value pairs Key, Document pairs (can retrieve the document via its key, then query the document) Graph
Hash tables Key, Value pairs MD Maryland, CA California, NY New York, .. Jan January, Feb February, .. originalPostId the whole thread for this original post (this value contains a lot of data)