D ata M ining and OLAP

Data Miningand OLAP Stages of Data Mining and OLAP (with thanks to Janet Francis) CREATE THE DIFFERENCE

Aims • This lecture aims to cover • The nature of data mining • Stages of Data Mining • OLAP CREATE THE DIFFERENCE

What is Data Mining • The term Data Mining is used because mining for valuable data in a large database is similar to mining for a valuable ore in a huge mountain. • In a mining operation large amounts of low grade materials are sifted through in order to find something of value. • In its computing counterpart large volumes of data are searched in an attempt to find something worthwhile. CREATE THE DIFFERENCE

A useful Scenario • The following scenario will be used in this lecture in order to make the processes seem more relevant. • Beaconside JAMS PLC (BJP) supplies jams/cake fillings/sauces to confectioners and bakers. Customers include large multi nationals and small specialist outlets. • BJ has centres in England, Scotland and Spain • BJ is not a manufacturing organisation – it is a retailer which means that it buys from a distributor and sells on to customers. CREATE THE DIFFERENCE

Human vs Data Mining • Human • Usually takes the form of hypothesis verification • The analyst has a theory – we sell more high margin goods pro rata to specialist outlets than to multinationals and specialist outlets are more profitable • The analyst gathers the necessary data and proves or disproves or amends and re tests the hypothesis • Data Mining • can perform hypothesis verification – on vast quantities of data • Data Mining allows the user to discover patterns that the user did not know existed! CREATE THE DIFFERENCE

Types of Data Mining • Directed • Undirected CREATE THE DIFFERENCE

Directed Data Mining • A top down approach – used when there is some idea of what is being looked for (some direction for the search) or and idea of what might be predicted • The goal is to create a predictive model or set of models from the existing data which can then be used to predict future trends. • For example - which customers are most likely to be interested in a new type of cake filling CREATE THE DIFFERENCE

Undirected Data Mining • A bottom-up approach - the data itself determines the relationships – for example using clustering– If patterns are found it is for the user to determine whether the patterns are useful or not. • The goal is to find patterns in the existing data. • Human interaction is necessary because only people can determine what significance, if any, the patterns have • This type of data mining is one of the key steps in Knowledge Discovery in Databases (KDD). • Necessary to know how the model works and how it comes up with the answer in order to decide if patterns are valid • Example: People who are over 5ft tall with brown hair like Blackcurrent Jam CREATE THE DIFFERENCE

Approaches to Data Mining • Descriptive • Describes the current data in terms of rules or patterns • Predictive • Identify a set of rules/model which can be used to predict currently unknown values CREATE THE DIFFERENCE

Descriptive Data Mining uses • Market Basket Analysis • Clustering • Classification CREATE THE DIFFERENCE

Descriptive Data Mining uses: Market Basket Analysis • Identifies relationships between data – for example, patterns in transaction purchases • A rule(s) can be developed. The rule is supported depending on the frequency of the occurrence and a confidence interval can be calculated and expressed as a ratio • This is also known as market basket analysis • For example: People who buy Blackcurrent Jam also buy Redcurrent Jelly • Beer and Nappies? CREATE THE DIFFERENCE

Example • BJP analysts discovered that sales of Strawberry Jam increased: • When the customer was offered a small pot of Blackcurrent Jam free with the purchase • With the height of the person buying the product • How commercially useful is this information? • Just because there is a correlation, does not mean it is useful CREATE THE DIFFERENCE

Descriptive Data Mining uses: Clustering • Identifies the natural groupings within data – e.g. customers may be classified into groups – known as customer segmentation this is useful in Customer Relationship Management (CRM) • Data items within groups should be as similar as possible to each other and as different as possible to other groups • Need to determine parameters which will result in realistic clusters CREATE THE DIFFERENCE

Example • BJP has identified clusters of customers who buy only jam, customers who buy only cake fillings, customers who buy both • How would this be commercially useful? CREATE THE DIFFERENCE

Descriptive Data Mining uses: Classification • Data of interest is sorted into predefined classes • BJP classifies customers as • Multinational; • UK based; • independent chain; • single outlet CREATE THE DIFFERENCE

Predictive Data Mining Use • Customers in the single outlet category typically order jams and sauces but not cake fillings • A new client is placed in the single outlet category – it is possible to predict likely ordering patterns CREATE THE DIFFERENCE

Stages in Data-Mining • Preparation of data • This stage involves selection and preparation of input data from a variety of sources • Data integration • Data cleansing • Data warehousing (this usually includes the above) • Mining stage • This stage involves producing useful predictive models (OLAP) 3. Interpretation and Evaluation – Knowledge Discovery • The final stage involves deploying the models and applying them to new data in order to generate predictions or new knowledge. CREATE THE DIFFERENCE

1. Preparation of Data • Input data must be in or converted to electronic form. It could come from a variety of different sources such as: • Operational Databases (sales, finance etc.) • Commercial Databases (demographics) • Internet documents • Spreadsheets or other “office” documents • The input data must be integrated and cleansed. • Note – much of the preparation is complete in a data warehouse CREATE THE DIFFERENCE

Data Integration • Data from different sources must be integrated to provide heterogeneity • Involves de-normalisation of databases • Dates and times must be of the same format. • Records must be in the same type CREATE THE DIFFERENCE

Data Cleansing • Once integrated, the data must be cleansed to resolve the following issues • Duplicate data • Need to delete • Missing values (unrecorded or really missing?) • Unrecorded - might not have been required in one or more of the contributing data sets. Could be added if based on other values eg. Post code. • Really missing- could actually denotes a missing value eg. An unpaid bill. • Need to decide how missing values will be represented. • Irrelevant values • Need expert to identify sets and delete • Inaccurate data • can identify anomalies by using graphs and clusters. Values outside the normal expected range can be investigated. • Old data • Need to delete CREATE THE DIFFERENCE

What are demographic overlays? • Most customer databases include post codes. • Various data is collected via census and based on post codes eg. • Gender Distribution • Age distribution • Other data is known about areas eg • Proximity to the coast • Major employers • Proximity to National parks • This data could be used in conjunction with customer data to predict trends. Eg • If a product sells well in one area close to the coast with a higher than average percentage of old ladies, then it might be worth marketing that product in other such areas. CREATE THE DIFFERENCE

2. Mining stage A Typical Data Set Customer names in a certain post code area It is known that in this area 75% of the population is considered Rich and 75% is male CREATE THE DIFFERENCE

Histograms 1 dimensional 2 dimensional CREATE THE DIFFERENCE

F M Into the 3rd Dimension • Even with just two attributes each with two values the table is more difficult to understand. • What if there were 16 attributes each with multiple values? • The number of 2d histograms which could be potentially useful would be over 100. • This structure is known as an OLAP cube. CREATE THE DIFFERENCE

On-line Analytical Processing OLAP • OLAP functionality is characterised by dynamic multi-dimensional analysis of consolidated enterprise data: • Slice: A slice is a subset of a multi-dimensional array corresponding to a single value for one or more members of the dimensions not in the subset. • Dice: The dice operation is a slice on more than two dimensions of a data cube (or more than two consecutive slices). • Drill Down/Up: Drilling down or up is a specific analytical technique whereby the user navigates among levels of data ranging from the most summarized (up) to the most detailed (down). • Roll-up: A roll-up involves computing all of the data relationships for one or more dimensions. To do this, a computational relationship or formula might be defined. • Pivot: To change the dimensional orientation of a report or page display CREATE THE DIFFERENCE

OLAP • Uses various algorithms – examples are: • Decision trees b. K-Nearest neighbour CREATE THE DIFFERENCE

Decision trees • The Decision Tree is one of the most popular classification algorithms in current use in Data Mining • A decision tree takes as input an object or situation described by a set of properties, and outputs a yes/no decision. • Algorithm is recursive partitioning – divide and conquer. • Internal nodes denote a test. • A branch represents the outcome • The leaf nodes represent the class. • The algorithm is simple but extremely powerful. CREATE THE DIFFERENCE

Y N Y N Y N Y N Y N Decision tree example Candidate for class label Rabbit Rules Tests Rabbit does not have Wings Wings? Rabbit does not swim Not in class! Swims? Rabbit has legs legs? Rabbit has whiskers Internal node Whiskers? Rabbit does not eat meat Eats Meat? Leaf node CREATE THE DIFFERENCE

Need to decide • Which attributes to select in order to identify the class of the sample as quickly as possible? • When to stop? • No remaining attributes to test or when the class is determined CREATE THE DIFFERENCE

K-Nearest neighbour • k-nearest neighbour algorithm (k-NN) is a popular method for classification • feature space is a multidimensional space where each pattern sample is represented as a point whose dimension is determined by the number of features used to describe the patterns. • Firstly the training samples and their class labels are plotted in the multidimensional feature space. The space is then partitioned into regions by class labels of the training samples. The training phase of the algorithm consists simply of plotting the points in the feature space. • In the actual classification phase, the same features as before are computed for a test sample. Distances from the new point to all stored points are computed and k closest samples are selected. The test sample is assigned to the class whose label is the most frequent among the k nearest training samples. • The algorithm is easy to implement, but it is computationally intensive, especially when the size of the training set grows. CREATE THE DIFFERENCE

K-Nearest neighbour Example This is simplistic – usually 16 or more attributes are used. The small coloured dots are the training samples Each colour represents a different class label The large black dots are test samples When K> 5 boundaries are less distinct in most cases CREATE THE DIFFERENCE

Need to decide • A value for k • The best choice of k depends upon the data • generally, larger values of k reduce the effect of noise on the classification, but make boundaries between classes less distinct CREATE THE DIFFERENCE

3. Interpretation and Evaluation • Uses of Data Mining in business • Market segmentation • Identify the common characteristics of customers who buy certain products from a company. • Customer churn • Predict which customers are likely to leave your company and go to a competitor. • Fraud detection • Identify which transactions are most likely to be fraudulent. • Direct marketing • Identify which prospects should be included in a mailing list to obtain the highest response rate. • Supermarket basket analysis • Understand what products or services are commonly purchased together. • Trend analysis • Reveal the difference between a typical customer this month and last. Allows organisations to map trends and CREATE THE DIFFERENCE

Further Reading • The OLAP report • A view from QUB • Date chapter 22 CREATE THE DIFFERENCE

D ata M ining and OLAP

D ata M ining and OLAP

Presentation Transcript

Paper 37 M ining Web Pages for D ata R ecords (MDR)

D ata sovereignty

D ATA M INING A N O VERVIEW

USPTO P atent D ata S ource and D ata E xtraction

Benchmarking Infrastructure for Mutation Text M ining

D ata A ssimilation

D ata D irector

Schoolwide D ata Meetings

MI Measurement And D ata

D ata Voluems Comaprisson

d ata[0]

D ata s tructures

LHC d ata rate and filtering

Text m ining

100 % D ata Meetings

D ata M ining and OLAP