200 likes | 348 Views
Data Analysis. OLTP and OLAP Data Warehouse SQL for Data Analysis Data Mining. Data Processing. Data processing types OLTP ( On Line Transaction Processing ) OLAP ( On-line Analytical Processing ) Database types Transactional Numerous users Dynamic ( flickering )
E N D
Data Analysis OLTP and OLAP Data Warehouse SQL for Data Analysis Data Mining Bogdan Shishedjiev Data Analysis
Data Processing • Data processing types • OLTP (On Line Transaction Processing) • OLAP(On-line Analytical Processing ) • Database types • Transactional • Numerous users • Dynamic (flickering) • Always maintaining current state • Critical (very loaded) • Warehouses • Aa few users (analyzers) • Relatively stable • Maintaining the data history (all states in the time) • Not loaded Bogdan Shishedjiev Data Analysis
Architecture Bogdan Shishedjiev Data Analysis
Architecture • Source compnents • Filter – separate and assure the coherency of data to be exported • Export – do the transfer of data portions in precise moments of time. • Warehouse components • Loader – Initial loading and preparing the warehouse. • Refresh – loads the portions • достъп • data mining • Export – to other warehouses. This creates an hierarchy of warehouses Bogdan Shishedjiev Data Analysis
Star Relational Schemes Bogdan Shishedjiev Data Analysis
Snowflake Relational Schemes Bogdan Shishedjiev Data Analysis
Data Warehouse Design • Stages • Choose the activity processes to model • Choose the granularity of the activity procesus • Choose the dimensions that can be applied to every record of the fact table. • Choose the facts that must be recorded in the fact table Bogdan Shishedjiev Data Analysis
Data Warehouse Design • Fact types – the most valuable are numerical continuous values • Additive – they can be added along all dimensions (Money amounts) • Semi-additive – they can be added along some of dimensions (Precipitations, Product quantities) • Non-additive – they cannot be added along any dimension (Wind speed, wind direction) Bogdan Shishedjiev Data Analysis
Data Warehouse Design • Recommendations • Use continuous additive numerical values • The fact table is highly normalized • Don’t normalize the dimensions. The gain is < 1% • Design thoroughly the dimension attributes. Most often they are textual and discrete. They are used as headings and constraint sources in the answers to users Bogdan Shishedjiev Data Analysis
Dimension time Promotion Dimension shop Facts - Sales time_key day_of_ week day_no_in_month day_no_overall week_no_ln_year week_no_overall month month_no_overall quarter fiscal_period holiday_flag weekday_flag last_day_in_month_flag season event promotion_key promotion_name price__reduction_ type ad_type dlsplay_ type coupon_type ad_media_name display_provider promo_cost promo_begin_date promo_end-date ..and others stoie_key store_ name store_number store_street_address store_city store_county store_state store_zip store__manager store_phone store_FАX floor_plan_ type photo_processing_type finance_services_type first_opened_date last_remodel_date store_surface grocery_surface frozen-surface ..and others Time_key product_key Store_key promotion key dollar_sales units_sales dollar_cost customer-count Example – Hypermarket Chain Dimension product product_key SKU(stock keeping units )_description SKU_number package_size brand subcategory category departement package_ type diet_type weight weight_unit_of_mesure units_per_retail_case units_per_shipping_case cases_per_pallet shelf_width shelf_height shelf_depth ..and others Bogdan Shishedjiev Data Analysis
Hypermarket Chain • Fact table • Granularity – each sell of a product (SKU – Stock Keeping Unit) • Values – Total cost (additive), SKU quantity (semi-additive), price (non-additive), customer count (non additive) • Dimensions • Time • Product • Store • Promotion Bogdan Shishedjiev Data Analysis
Hypermarket Chain • Calculation of disk space needed • Dimension time : 2 years x 365 days = 730 days • Dimension shop : 300 shops, everyday records • Dimension product : 30.000 products in each shop; 3000 are sold every day in each shop. • Dimension promotion : An article can participate in only one promotion in a shop during one day. • Elementary fact records 300 x 730 x 3000 x 1 = 657 .106 records • Key field number 4; Value field number 4 ; Total number osf fields =8 • Fact table size -657 .106 x 8 fields x 4B = 21 GB Bogdan Shishedjiev Data Analysis
Data Operations for Data Analysis • General form of a SQL statement select D1.C1, ... Dn.Cn, Aggr1(F,Cl),…, Aggrn(F,Cn)from Fact as F, Dimension1 as D1,... DimensionNas Dn where join-condition (F, D1) and... and join-condition (F, Dn) and selection-condition group by D1.C1, ... Dn.Cn order by D1.C1, ... Dn.C Bogdan Shishedjiev Data Analysis
Data Operations for Data Analysis • Example select Time.Month, Product.Name, sum(Qty) from Sale, Time, Product, Promotion where Sale.TimeCode = Time.TimeCode and Sale.ProductCode = Product.ProductCode and Sale.PromoCode = Promotion.PromoCode and (Product. Name = ' Pasta' or Product.Name = 'Oil') and Time.Month between 'Feb' and 'Apr' and Promotion.Name = 'SuperSaver' group by Time.Month, Product.Name order by Time.Month, Product.Name pivot Time.Month Feb Mar Apr Oil 5K 5K 7K Pasta 45K 50K 51K Bogdan Shishedjiev Data Analysis
Data Cube The cube is used to represent data along some measure of interest. Although called a "cube", it can be 2-dimensional, 3-dimensional, or higher-dimensional. Each dimension represents some attribute in the database and the cells in the data cube represent the measure of interest. Bogdan Shishedjiev Data Analysis
Data Cube • Data cube representation Bogdan Shishedjiev Data Analysis
Totals - the value ANY or ALL or NULL Data Cube Bogdan Shishedjiev Data Analysis
Data Cube • Drill down – adding a dimension for more detailed results Bogdan Shishedjiev Data Analysis
Data Cube • Roll-up - removing dimension Bogdan Shishedjiev Data Analysis
Data Cube • The whole data cube Bogdan Shishedjiev Data Analysis