540 likes | 673 Views
Design of DW. (). (city). (item). (year). (city, item). (city, year). (item, year). (city, item, year). Remember. item, city, year , and sales_in_Euro. 2 leveles of hierarchies for each dimension Item(part,color) i 1 ,i 2 City(downtown,suburb) c 1 ,c 2
E N D
() (city) (item) (year) (city, item) (city, year) (item, year) (city, item, year) Remember item, city, year, and sales_in_Euro
2 leveles of hierarchies for each dimension • Item(part,color) i1,i2 • City(downtown,suburb) c1,c2 • Year(good_year,bad_year) y1,y2 • For a 3-dimensional data cube, where Li is the number of all levels (L1,2,3=2), the total number of cuboids that can be generated is
{(), • (c1),(c2),(i1),(i2),(y1),(y2), • (c1,i1),(c1,i2),(c2,i1),(c2,i2), • (c1,y1),(c1,y2),(c2,y1),(c2,y2), • (i1,y1),(i1,y2),(i2,y1),(i2,y2), • (c1,i1,y1),(c1,i1,y2),(c1,i2,y1),(c1,i2,y2), • (c2,i1,y1),(c2,i1,y2),(c2,i2,y1),(c2,i2,y2)}
DMQL Data Mining Query Language • Relational database schema • Translation into SQL query • Example, star schema, and relational data base • MDX • Multifeature cubes • Design of a Data Warehouse • Lifecycle models • Data Warehouse models
DMQL • DMQL: A Data Mining Query Language for Relational Databases (Han et al, Simon Fraser University) • Data warehouses and data marts can be defined by cube definition and dimension definition
DMQL • Create and manipulate data mining models through a SQL-based interface (“Command-driven” data mining) • Abstract away the data mining particulars • Data mining should be performed on data in the database (should not need to export to a special-purpose environment) • Approaches differ on what kinds of models should be created, and what operations we should be able to perform
Cube Definition Syntax in DMQL • Cube Definition (Fact Table) define cube <cube_name> [<dimension_list>]: <measure_list> • Dimension Definition (Dimension Table) define dimension <dimension_name> as (<attribute_or_subdimension_list>)
item time item_key item_name brand type supplier_type time_key day day_of_the_week month quarter year location branch location_key street city state_or_province country branch_key branch_name branch_type Example of Star Schema Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures
Defining Star Schema in DMQL define cube sales_star [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country)
The star schema contains two measures • dollars_sold and units_sold • How are the DMQL commands interpreted to generate a specified data cube?
Relational database schema time(time_key,day_of_week,month,quater,year) item(item_key,item_name,brand,type,supplier_type) branch(branch_key,branch_name,branch_type) location(location_key,street,city,province_or_state,country) sales(time_key,item_key,branch_key,location_key,number_of_units_sold,price)
The DMQL specification is translated into the following SQL query which generates the base cuboid SELECT s.time_key,s.item_key,s.branch_key,s.location_key, SUM(s.number_of_units_sold*s.price), SUM(s.number_of_units_sold) FROM time t, item i, branch b, location l, sales s, WHERE s.time_key=t.time_key AND s.item_key=i.item_key AND s.branch_key=b.branch_key AND s.location_key=l.location_key GROUP BY (s.time_key,s.item_key,s.branch_key,s.location_key);
The granularity (resolution) of each dimension is at the join key level • A join key is the key that links a fact table and the dimension table • The fact table associated with a base cuboid is sometimes referred as base fact table
By changing GROUP BY we may generate other cuboids • The apex cuboid representing the total sum of dollars_sold and total count of units_sold is generated by GROUP BY (); • Other cuboids may be generated by applying selection and projection operations on the base cuboid • To generate a data cube we may as well use GROUP BY CUBE(s.time_key,s.item_key,s.branch_key,s.location_key);
Defining Snowflake Schemain DMQL define cube sales_snowflake [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type)) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city(city_key, province_or_state, country))
Defining Fact Constellationin DMQL define cube sales [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country) define cube shipping [time, item, shipper, from_location, to_location]: dollar_cost = sum(cost_in_dollars), unit_shipped = count(*) define dimension time as time in cube sales define dimension item as item in cube sales define dimension shipper as (shipper_key, shipper_name, location as location in cube sales, shipper_type) define dimension from_location as location in cube sales define dimension to_location as location in cube sales
Example (Exercício 7) • Suponha um datawarehouse que contém as seguintes quatro dimensões: Data, Espectador, Localizaçaõ e Jogo e o facto “preço” que consiste no valor, em euros • Valor pago por um espectador quando assiste a um determinado jogo numa data • Espectadores podem ser estudantes, adultos ou séniores, e que cada uma destas categorias tem o seu preço de bilhete • Dimensão data contenha o dia, mês e ano; que a dimensão Localização contenha o nome do estádio, e que a dimensão Jogo contenha o nome das duas equipas defrontadas
Diagrama em estrelapara este DW • jogo(jogoId, equipa1, equipa2) • data (dataId, dia, mes, ano) • espectador (espId, nome, categoria) • localizacao(localId, estadio) • factos(jogoId, dataId, espId, localId, preco)
Escreva em SQL a interrogaçã o que devolve o preço total pago por espectadores estudantes para assistir ao jogo que se realizou no Estádio da Luz no dia 1 de Março de 2005
The corresponding SQL querry SELECT SUM(preco) FROM Factos F, Data D, Localizacao L, Espectador E WHERE F.dataId = D.dataId AND F.espId = E.espId AND F.localId = L.localId AND D.dia = 1 AND D.mes = 3 AND D.ano = 2005 AND L.estadio = ‘Estadio da Luz’ AND E.categoria= ‘Estudante’;
Esquema relacional que modele a mesma informação e uma interrogaçã o SQL que devolva a mesma informação: jogo(jogoId, equipa1, equipa2, localId, data) localizacao(localId, estadio) espectador(espId, nome, categoriaId) categoria (categoriaId, nomeC, preco) jogoEspectador(jogoId, espId)
The corresponding SQL querry SELECT SUM (C.preco) FROM Categoria C, Espectador E, JogoEspectador JE, Jogo, J, Localizacao L WHERE C.nomeC = ‘Estudante’ AND J.data = 1/3/2005 AND L.estadio = ‘Estadio da Luz’ AND C.categoriaId = E.categoriaId AND E.espId = JE.espId AND JE.jogoId = J.jogoId AND J.localId = L.localId;
Difference between both approaches • Com o modelo em estrela, existe três joins, da tabela Factos com cada uma das 3 dimensões relevantes • Com o esquema relacional, existem 4 joins • In the star schema has less joins, corresponding to the relevant dimensions • In multidimensional model the base cuboid is already precomputed
MDX • Multidimensional Expressions (MDX) as a Language • MDX emerged circa 1998, when it first began to appear in commercial applications. MDX was created to query OLAP databases, and has become widely adopted within the realm of analytical applications
Provide the total sales and total cost amounts for the years 1997 and 1998 individually for all USA-based stores (including all products) • We are asked, moreover, to provide the information in a two-dimensional grid, with the sales and cost amounts (called measures in our data warehouse) in the rows and the years (1997 and 1998) in the columns SELECT {[Time].[1997],[Time].[1998]} ON COLUMNS, {[Measures].[Warehouse Sales],[Measures].[Warehouse Cost]} ON ROWS FROM Warehouse WHERE ([Store].[All Stores].[USA])
The cube that is targeted by the query (the query scope) appears in the FROM clause of the query • The FROM clause in MDX works much as it does in SQL, where it stipulates the tables used as sources for the query • The query syntax also uses other keywords that are common in SQL, such as SELECT and WHERE. • Important difference is that the output of an MDX query, which uses a cuboid as a data source, is another cuboid, whereas the output of an SQL query (which uses a columnar table as a source) is typically columnar
A query has one or more dimensions. The query above has two. (The first three dimensions (=axes) that are found in MDX queries are known as rows, columns and pages) SELECT{[Time].[1997].[Q1],[Time].[1997].[Q2]}ON COLUMNS,{[Warehouse].[All Warehouses].[USA]} ON ROWSFROM WarehouseWHERE ([Measures].[Warehouse Sales]) • Curled brackets "{}" are used in MDX to represent a set of members of a dimension or group of dimensions
Complex Aggregation at Multiple Granularities • Multifeature cubes compute complex queries involving multiple aggregates at multiple granularities (resolution) • Example • item is purchased in a sales region on a business day (year,month,day) • The shelf life in months of a given item is stored in shelf • The item price and sales is stored in price and sales
Find the total sales in 2000, broken down by item, region, and month with subtotal for each dimension • A data cube is constructed • {(item,region,month),(item,region),(item,month),(month,region),(item),(month),(region,()} • Simple data cube, since it does not involve any dependent aggregates • What are dependent aggregates?
Example • Grouping by all subsets (cuboids) {item,region,month} (=data cube) • Find maximum price for each group (cuboid) in 2000 • Among the maximum price tuples find the minimum and maximum shelf lives
Multifeature cube graph for the example query R2 cube {=MIN(R1.shelf)} R3 cube {=MAX(R1.shelf)} R1 cube {=MAX(price} R0 cube
The multifeature graph illustrates the aggregate dependencies • R0,R1,R2,R3 are the grouping variables • The grouping variables R2,R3 are dependent on R1 • In extended SQL • R2 in R1 • R3 in R1
Query in extended SQL SELECT item,region,month,MAX(price),MIN(R1.shelf),MAX(R1.shelf) FROM Purchases WHERE year=2000 CUBE BY item,region,month:R1,R2,R3 SUCH THAT R1.price=MAX(price) AND R2 IN R1 and R2.shelf=MIN(R1.shelf) AND R3 IN R1 and R3.shelf=MAX(R1.shelf);
Design of Data Warehouse: A Business Analysis Framework • Four views regarding the design of a data warehouse • Top-down view • allows selection of the relevant information necessary for the data warehouse • Data source view • exposes the information being captured, stored, and managed by operational systems • Data warehouse view • consists of fact tables and dimension tables • Business query view • sees the perspectives of data in the warehouse from the view of end-user
Data Warehouse Design Process • Top-down, bottom-up approaches or a combination of both • Top-down: Starts with overall design and planning (mature) • Bottom-up: Starts with experiments and prototypes (rapid) • From software engineering point of view • Waterfall: structured and systematic analysis at each step before proceeding to the next • Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around
Lifecycle planning • Translation from user requirements into software requirements • Transformation of the software requirements into software design • Implementation of the design into programming code • The sequence of this steps is defined by the lifecyle model
A software lifecycle model must be defined for every project! • The lifecycle model you choose has as much influence over your project’s success as any other planning decision you make!
Pure Waterfall model Software Concept Requirements Analysis Architecural Design Detailed Design Coding and Debugging System Testing
Pure Waterfal model • Document driven model which means that the main work products that are carried from phase to phase are documents • The disadvantage of the pure waterfall model arise from the difficulty of fully specifying requirements at the beginning of the project, before any design works has been done and before any code has been written
Salmon model Software Concept Requirements Analysis Architecural Design Detailed Design Coding and Debugging System Testing
Code-and-Fix I Code-And-Fix System Specification (maybe) Release
Code-and-Fix II Advantages • No overhead, you don’t spend any time on planning, documentation, quality assurance, enforcement, or other activities than pure coding • Since you jump right into coding, you can show signs of progress immediately • It requires little expertise
Code-and-Fix III • Maintainability and reliability decrease with the complexity and the time • For any kind of project other than a tiny project, this model is dangerous. It might have no overhead, but it also provides no means of assessing progress, you just code until you’re done
Spiral I Determine objectives, Alternatives, and constraints Risk analysis Evaluate IV Prototypes III II Review Start I Simulation models Reqirements Development Plan code test Realse
Spiral II • The basic idea behind the diagram is that you start on a small scale in the middle of the spine, explore the risks, make a plan to handle the risks, and then commit to an approach of the next iteration. Each iteration moves your project to a larger scale
Spiral III • The spiral model is a risk-oriented lifecycle model that breaks a software project up into mini projects. Each mini project addresses one or more major risks until all the major risks have been addressed
Spiral IV • Determine objectives, and constraints • Identify and resolve risks • Evaluate alternatives • Develop the deliverables for that iteration, and verify that they are correct • Plan next iteration • Commit to an approach for the next iteration One of the most important advantages of the spiral model is that as costs increase, risk decrease. The more time and money you spend, the less risk your’re taking
Other sources Extract Transform Load Refresh Operational DBs Data Warehouse: A Multi-Tiered Architecture Monitor & Integrator OLAP Server Metadata Analysis Query Reports Data mining Serve Data Warehouse Data Marts Data Sources Data Storage OLAP Engine Front-End Tools
Three Data Warehouse Models • Enterprise warehouse • collects all of the information about subjects spanning the entire organization • Data Mart • a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart • Independent vs. dependent (directly from warehouse) data mart • Virtual warehouse • A set of views over operational databases • Only some of the possible summary views may be materialized