580 likes | 788 Views
BI & DM for CRM. Lecture 2 Data Warehouse and OLAP. Steps in Knowledge Discovery. Steps in KD cont. Data Cleaning : To remove noise and inconsistent data / check data validity & correctness – PLAY WİTH DATA ( good step to know your data)
E N D
BI & DM for CRM Lecture 2 Data Warehouse and OLAP
Steps in KD cont... • Data Cleaning: Toremovenoiseandinconsistent data / check data validity & correctness – PLAY WİTH DATA (good step toknowyour data) • Data İntegration: Tocombinemultiplesources of data / integratevarious data sources on variousplatforms (DBMS/flatfiles/spreedsheetsetc..) • Data Selection: Retrieverelevent data from data repositoryforanalysis • Data Transformation: Totransformorconsolidate data intodifferentformstodeploy in DM operations. (If Data Warehouseswereused, theprocesswould be beforethe data selection.) • Data Mining: Toapplyintelligentmodelsforextractingpatternsfrom data. • Pattern Evaluation: Toidentifyinterestingpatterns, result of DM process, on somemeasures. • Knowledge presentation: Tovisualizetherepresentation of knowledge, drivenby DM processandpatternevaluationz
So, what is Data Warehouse? Looselyspeaking; • A decision support database that is maintained separately from the organization’s operational database • Support information processing by providing a solid platform of consolidated, historical data for analysis. • W.H. Inmon- ‘ A data warehouse is a subject-oriented, integrated, time-varientandnonvolatilecollection of data in support of management’sdecisionmakingproces.’
DW—Subject-Oriented • Organized around major subjects, such as customer, product, sales. • Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing. • Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process. 6
DW—Integrated • Constructed by integrating multiple, heterogeneous data sources • relational databases, flat files, on-line transaction records • Data cleaning and data integration techniques are applied. • Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources • E.g., Hotel price: currency, tax, breakfast covered, etc. • When data is moved to the warehouse, it is converted. 7
DW—Time Variant • The time horizon for the data warehouse is significantly longer than that of operational systems. • Operational database: current value data. • Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) • Every key structure in the data warehouse • Contains an element of time, explicitly or implicitly • But the key of operational data may or may not contain “time element”. 8
DW—Non-Volatile • A physically separate store of data transformed from the operational environment. • Operational update of data does not occur in the data warehouse environment. • Does not require transaction processing, recovery, and concurrency control mechanisms • Requires only two operations in data accessing: • initial loading of data and access of data. 9
On-line transaction processing vs On-line Analytical Processing 10
Why seperate Database? • Main reason is high performance for both systems – processing OLAP queries in transactional DBs would substantially degrade the performance of operational tasks • Unlike OLTP, OLAP usually needs read only access from the database. So, for OLAP operations DBs does not require concurrent control and recovery. If applied for OLAP this may jeopardize the execution of OLTP operations • Systems are mainly used for different operations...
Data Model- Multidimentional DW & OLAP are based on multidementional data models... Actually... It is all about multidimentional data models...
2D view of data Facebook Data...
LATTICE OF CUBOIDS ALL 0 D – Apex cuboid Posting 1-D Cuboids Gender Education Time Time, Posting Time, Gender Time, Education Posting, Gender 2-D Cuboids Time, Education Gender, Education Time, Posting, Gender Time, Posting, Education Time, Gender, Education Posting, Gender, Education 3-D Cuboids Time, Posting, Gender, Education 4-D Cuboids
Latice & Cuboids • Any n-D data as a series of (n-1)-D “cubes” • In data warehousing literature, • A data cube is referred to as a cuboid • The lattice of cuboids forms a data cube. • The cuboid holding the lowest level of summarization is called a • the 4-D cuboid is the base cuboid for the given four dimensions • base cuboid. • The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. • Here this is the total posting • typically denoted by all
Conceptual Modelling of DW • Modeling data warehouses: dimensions & measures • Star schema: A fact table in the middle connected to a set of dimension tables • Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake • Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation
Snow Flake Schema FACT TABLE
Categorization & Computation • How are measures computed? • Data Cube measure is a numarical function that can be evaluated at each point in the data cube space. • Distributive- sum(), min(), max()- • How many posts are from USA? • Sum (Washington)+Sum(Nevada).... • Algebraic: sum(),count(), standard_deviation()... • Holisatic: median(), mode()
Concept Hierarchies • What is it? Flow of a concept (could be dimension) from lower levels to higher levels. More specific More General
Example for concept hierarchyLocation all all Europe ... North_America region Germany ... Spain Canada ... Mexico country Vancouver ... city Frankfurt ... Toronto L. Chan ... M. Wind office 23
HierarchicalvslatticeStructures(ConceptHierarchy) year country quarter Province or state week month city Street day We also have user defined concept hierarcies like Fiscal year or Academic Year
How can we use Concept Hierarchies useful in OLAP? • In multi dimensional model, data are organized in multi dimensions and each dimension contains multi level of abstraction defined by concept hierarchies • This type of organization provides user to view the data from various perspectives • Basically, OLAP provides user friendly environment for interactive data analysis. OLAP Operations in Multi Dimensional Data
Roll-up (drill-up) • Drill-down • Slice and Dice: The slice operation performs a selection on one dimension of the given cube, resulting in a subcube. The operation defines a subcube by performing a selection on two or more dimensions. • Pivot: visualization operation that rotates the data axes in view in order to provide an alternative presentation of the data. OLAP ACTIONS
Roll-up (drill-up):Performs aggregation on a data cube, either by climbing up a concept hierarchy for a dimension or dimension reduction. Hierarchy Roll Up Dimension Roll Up Roll up action exp...
Drill Down:Drill down is reverse of roll-up. It navigates from general hierarchy to more specific hierarchy. Adding new dimension to data Drill Down Drill down action exp...
The slice operation performs a selection on one dimension of the given cube, resulting in a subcube. The operation defines a subcube by performing a selection on two or more dimensions. Slice Dice Slice & Dice Action exp...
drill across: involving (across) more than one fact table • drill through: through the bottom level of the cube to its back-end relational tables (using SQL) • ranking the top N or bottom N items in lists • moving averages • growth rates • interests Other OLAP Actions
Radial lines from a central point • each line represents a concept hierarchy for a dimension • each abstraction level is called a footprint • granularities available for use by OLAP • four radial lines for concept hierarchies • location,customer,item,time • time line has 4 footprints: • day,month,quarter,year Star-Net Query Model
A Star-Net Query Model Customer Orders Shipping Method Customer CONTRACTS AIR-EXPRESS ORDER TRUCK PRODUCT LINE Time Product ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP CITY SALES PERSON COUNTRY DISTRICT REGION DIVISION Each circle is called a footprint Location Promotion Organization
Design of a Data Warehouse: A Business Analysis Framework • Four views regarding the design of a data warehouse • Top-down view • allows selection of the relevant information necessary for the data warehouse • Data source view • exposes the information being captured, stored, and managed by operational systems • Data warehouse view • consists of fact tables and dimension tables • Business query view • sees the perspectives of data in the warehouse from the view of end-user
Data Warehouse Design Process • Top-down, bottom-upapproachesor a combination of both • Top-down: Starts withoveralldesignandplanning (mature) • Bottom-up: Starts withexperimentsandprototypes (rapid) • From software engineeringpoint of view • Waterfall: structuredandsystematicanalysis at each step beforeproceedingtothenext • Spiral: rapidgeneration of increasinglyfunctionalsystems, shortturnaround time, quickturnaround • Typical data warehousedesignprocess • Choose a businessprocessto model, e.g., orders, invoices, etc. • Choosethegrain (atomiclevel of data) of thebusinessprocess • Choosethedimensionsthatwillapplytoeachfacttablerecord • Choosethemeasurethatwillpopulateeachfacttablerecord
Enterprise Warehouse • Collects all information about subject spanning of the entire organization • Data Mart • A subset of corporate-wide data that is valuable to specific groups of users. Such as marketing • Virtual Warehouse • A set of views over operational databases • Only some of the possible summary views may be materialized Three Data Warehouse Models
Cuboids are referred as aggregations • One factor affecting storage requirements • Sparsity: the amount of empty cells in a cube • The base cuboid is likely to contain many empty cells • it is a spares cube or array • the 0 or lower dimensional cuboids are less spares than the higher dimensional ones • it is not likely that they contain empty cells • Moving along higher levels for the dimension hierarchy • the cuboids becomes less spares or more dense Storage of a cube
Two dimensional sparse cuboid One dimensional dense cuboid
ROLAP – Relational OLAP • MOLAP – Multidimensional OLAP • HOLAP – Hybrid OLAP OLAP Server Architectures
Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware to support missing pieces • query response is generally slower • low storage requirement • Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services • greater scalability • appropriate for large data sets that are infrequently queried • historical data from less recent previous years ROLAP
Array-based multidimensional storage engine (sparse matrix techniques) • fast indexing to pre-computed summarized data • a two-level storage representation • dense subcubes are stored as array structures • spars subcubes are stored by compression techniques • appropriate for cubes with frequent use and rapid query response MOLAP
combines ROLAP and MOLAP benefiting from • greater scalability of ROLAP • faster computation of MOLAP • Large volumes of data base cuboid is stored in a relational database • aggregations are stored as arrays • appropriate for for cubes that requre • rapid query response for summaries based on a large amount of base data HOLAP
Data cubes can be viewed as lattice of cuboids • The bottom cuboid is the base cuboid • The top is the apex cuboid • What is the number of cuboids for N dimensional data cube? 2N • OLAP computes at least some of the cuboids • For fast response • For avoiding redundant calculation Efficient Data Cube Computation
all 0-D(apex) cuboid country product date 1-D cuboids product,date product,country date, country 2-D cuboids 3-D(base) cuboid product, date, country Example
Three Types • Materialize every cuboids • Huge amounts of memory space • Non materialization – Zero cube calculation • Show processing of queries • Some (partial) materialization • Trade off between storage space and response time • Selection of which cuboids to materialize • Based on size, sharing, access frequency and etc Materialization of data cube
Complete load of the cube • all dimension and fact table data is read and • all specified aggregations cuboids are calculated • process a cube when • its structure is new or • its dimensions or measures have been edited • Incrementally updating a cube • new data is added but existing data not changed and cube structure si the same • Refreshing • data cleared and reloaded • its aggregations recalculated • faster then processing:no design of aggregation tables Processing the cubes
Dimension member or measure whose value is computed at run time using an expression • Only the definitions are stored but values exists only in memory upon a query • do not increase in cube size • Ex: if sales and cost are included in the base fact table • a profit measure can be a calculated member • profit = sales – cost • Average_sales = sales/#_items_sold Calculated Members
Combination of multiple cubes in one logical cube • can be based on a single cube to expose only selected subsets of measures and dimensions • Require no physical space • store only the dimensionsinformation not actualdata • provide a valuable security function • limiting the access of some users Virtual Cubes
Attribute of a dimension member • provides additional information about the member • a column in the same dimension table as the associated members • used in queries • provide users more options when analysing cube data Member Cubes
A typical time table: (time_id,day,month,quarter,year,businessday,leap,day of the week) dimension levels day<month<quarter<year • member properties for day: weekend or business day:0 or 1 • day of the week:1,2,3,...,7 • a member property for year is whether it is leap year or not:0 or 1 Exp Member Property – Time Table
Logical dimension based on a member property of a level in a physical dimension • enables users to analyse cube data based on the member properties of dimension levels • add a virtual dimension to a cube only if • the dimension that supplies its member property is also included in the cube • adding a virtual dimension does not increase cube size • not affect cube processing time • calculated in memory when needed • query processing time is slower Virtual Dimension