Decision support systems for E-commerce

Decision support systems for E-commerce

Decision support systems for EC • DSS: help the knowledge worker (executive, manager, analyst) make faster and better decisions • what were the sales volumes by region and product category for the last year? • How did the share price of computer manufacturers correlate with quarterly profits over the past 10 years? • Will a 10% discount increase sales volume sufficiently? • Data Warehousing: enables On-line analytical processing (OLAP) • OLAP is a component of decision support system • Data mining • Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases. • Data mining is a powerful, high-performance data analysis tool for decision support.

Potential Applications of Data Warehousing and Mining in EC • Analysis of user access patterns and buying patterns • Customer segmentation and target marketing • Cross selling and improved Web advertisement • Personalization • Association (link) analysis • Customer classification and prediction • Time-series analysis • Typical event sequence and user behavior pattern analysis • Transition and trend analysis

Data Warehousing • The phrase data warehouse was coined by William Inmon in 1990 • Data Warehouse is a decision support database that is maintained separately from the organization’s operational database • Definition: A DW is a repository of integrated information from distributed, autonomous, and possibly heterogeneous information sources for query, analysis, decision support, and data mining purposes

Characteristics (cont’d) • Integrated • No consistency in encoding, naming conventions, … among different application-oriented data from different legacy systems, different heterogeneous data sources • When data is moved to the warehouse, it is consolidated converted, and encoded • Non-volatile • New data is always appended to the database, rather than replaced • The database continually absorbs new data, integrating it with the previous data • In contrast, operational data is regularly accessed and manipulated a record at a time and update is done to data in the operational environment.

Characteristics (cont’d) • Time-variant • The time horizon for the data warehouse is significantly longer than that of operational systems. • Operational database contain current value data. Data warehouse data is nothing more than a sophisticated series of snapshots, taken as of some moment in time. Operational data is valid only at the moment of access-capturing a moment in time. Within seconds, that data may no longer be valid in its description of current operations • Operational data may or may not contain some element of time. Informational data has a time dimension: each data point is associated with a point in time, and data points can be compared along that axis.

Reasons to Separate DW from Operational Systems • Performance: • special data organization, access methods, and implementation methods are needed to support multidimensional views and operations typical of OLAP • Complex OLAP queries would degrade performance for operational transactions, Thus DW avoids interruption of the operational processing at the underlying information sources • Concurrency control and recovery of OLTP mode are not compatible with OLAP analysis • Provide fast access to integrated information

Reasons to Separate DW from Operational Systems • Decision support requires • historical data which operational databases do not typically maintain • consolidation of data from heterogeneous sources: operational databases, external sources • different sources typically use inconsistent data representations, codes and formats which have to be reconciled. • aggregation, summarization, annotation of raw data

System Architecture End User Analysis, Query Reports, Data Mining . . . Detector Detector Detector Detector Legacy Flat-file RDBMS OODBMS

DW Components • Underlying information sources • often the operational systems, providing the lowest level of data. • designed for operational use, not for decision support, and the data reflect this fact. • Multiple data sources are often from different systems run on a wide range of hardware and much of the software is built in-house or highly customized. • Multiple data sources introduce a large number of issues, such as semantic conflicts. • Distributed, autonomous, and possibly heterogeneous

DW Components (cont’d) • Integrator • Receives updates • makes the data conform to the conceptual schema used by the warehouse • integrates the changes into the warehouse • merges the data with existing data already present • resolves possible update anomalies • Modifies warehouse views accordingly • User interface • Tools to query and perform data analysis and data mining

DW Components (cont’d) • Change detectors/propagators • Refresh the warehouse by detecting to an information source that are of interest to the warehouse and propagating updates on source data to the data stored in the warehouse • when to refresh • determined by usage, types of data source, etc. • how to refresh • data shipping: using triggers to update snapshot log table and propagate the updated data to the warehouse (define triggers in a full-functionality DBMS) • transaction shipping: shipping the updates in the transaction log (examine the updates in the log file) • write programs for legacy systems

Multidimensional Data • Sales volume as a function of product, time, and geography

OLAP Servers • Relational OLAP (ROLAP) • Extended relational DBMS that maps operations on multidimensional data to standard relations operations • Multidimensional OLAP (MOLAP) • Special purpose server that directly implements multidimensional data and operations • Hybrid OLAP (HOLAP) • give users/system administrators freedom to select different partitions.

Warehouse Design: Conceptual Modeling • Star schema • A single object (fact table) in the middle connected to a number of objects (dimension tables) • Snowflake schema • A refinement of star schema where the dimensional hierarchy is represented explicitly by normalizing the dimension tables. • Fact constellation schema • Multiple fact tables share dimension tables.

A Multidimensional fact table scheme

Example of The Star Schema

Example of the Snowflake Schema

Example of the Fact Constellation Schema

Sales Data

A Sample Data Cube Date 2Qtr 1Qtr sum 3Qtr 4Qtr TV Product China PC VCR sum India Country Japan sum All, All, All Total annual sales of TV in China.

OLAP Operations • roll-up • aggregating on a specific dimension, I.e., summarize data • total sales volume last year by product category by region • drill-down • also called roll down, drill through • inverse of roll-up, go from higher level summary to lower level summary or detailed data • For a particular product category, find the detailed sales data for each salesperson by date

OLAP Operations (cont’d) • slicing • projecting data along a subset of dimensions with an equality selection of other dimensions • Sales of beverages in the West for Jan 98 • dicing • similar to slicing except that instead of equality selection of other dimensions, a range selection is used • Sales of beverages in the West over the last 6 months • Pivot • reorient cube

Cube Operation SELECT date, product, customer, SUM (amount) FROM SALES CUBE BY date, product, customer Need to compute the following Group-By (date, product, customer), (date,product),(date, customer), (product, customer), (date), (product) (customer)

Cube Computation -- Array Based Algorithm • An MOLAP approach: the base cuboid is stored as multidimensional array. • Read in a number of cells to compute partial cuboids B {ABC} {AB} {AC} {BC} {A} {B} {C} { } A C {}

ROLAP versus MOLAP • ROLAP • exploits services of relational engine effectively • provides additional OLAP services • design tools for DSS schema • performance analysis tool to pick aggregates to materialize • SQL comes in the way of sequential processing and column aggregation • Some queries are hard to formulate and can often be time consuming to execute

ROLAP versus MOLAP • MOLAP • the storage model is an n-dimensional array • Front-end multidimensional queries map to server capabilities in a straightforward way • Direct addressing abilities • Handling sparse data in array representation is expensive • Poor storage utilization when the data is sparse

Example Designed, developed and implemented - • QDMS (Quality Database Management System) - A working system • Provides fast access to integrated production and inspection data • Provides complex data analysis for decision support • Isolates data analysis processing from operational systems • Encourages manufacturers to evaluate and improve their performance

Example (Cont’d) Information Sources • Heterogeneous information sources • Flat files • RDBS - Oracle, Sybase, Paradox, MS Access, FoxPro • Non-relational DBS - IBM IMS • Others - Lotus Notes • Data • Uniform in some cases, e.g., Lot_no; Product id: NSN • Non-uniform in some other cases, e.g. Defect id • Temporal ordering for production records

Actual Application Com.1 • Query: • “overall & detail production performance” • manufacturer: Com1 • products: all products • date interval: 01-Jan-94 until 01-Jan-1999 • source: USDA

Lot#1 Com.1 Contract Number 1 Com.1 Contract Number 2 Lot#2 Com.1 Contract Number 3 Lot#3

Data Mining • Characterization and Comparison • Generalize, summarize, and possibly contrast data characteristics, e.g., dry vs. wet regions. • Association • finding rules like: buys(x, diapers) ® buys(x, milk) • Classification and Prediction • Classify data based on the values in a classifying attribute, e.g., classify countries based on climate, or classify cars based on gas mileage. • Predict some unknown or missing attribute values based on other information.

Data Mining (Cont’d) • Clustering: • Group data to form new classes, e.g., cluster houses to find distribution patterns. • Time-series analysis: • Trend and deviation analysis: Find and characterize evolution trend, sequential patterns, similar sequences, and deviation data, e.g., stock analysis. • Similarity-based pattern-directed analysis: Find and characterize user-specified patterns in large databases. • Cyclicity/periodicity analysis: Find segment-wise or total cycles or periodic behavior in time-related data.

Classification • Data categorization based on a set of training objects. • Applications: credit approval, target marketing, medical diagnosis, treatment effectiveness analysis, etc. • Example: classify a set of diseases and provide the symptoms which describe each class or subclass. • The classification task: Based on the features present in the class_labeled training data, develop a description or model for each class. It is used for • classification of future test data, • better understanding of each class, and • prediction of certain properties and behaviors.

Decision support systems for E-commerce