220 likes | 236 Views
Explore the many meanings and applications of data mining, including accessing, reporting, clustering, predictive models, and discovering interesting patterns in large databases.
E N D
Advanced Database SystemsF24DS2 / F29AT2 About Data Mining David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me
Data Mining has Many meanings • There are lots of things you can do with a database: • Access data via straightforward queries, to answer • straightforward questions about instances. E.g.: • ``What is Ellen McArthur’s home phone number?” • “What is the ISBN number of Eats, shoots and leaves, • by Lynne Truss?”, • “What grade did Larry Page get for the Internet module??” • “Give me a list of all pages on the www that contain the phrase “fried egg sandwich”. David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me
Data Mining has Many meanings (cont…) • Generate simple reports about data via straightforward queries, to answer questions about sets of instances. E.g.: • ``How many of our customers are called “Trevor?”” • “Which of our books has been borrowed more times in the last month than Eats, shoots and leaves, • by Lynne Truss?”, • “Which student has the highest average marks?” • “ What percentage of house owners also own a car?”. David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me
Data Mining has Many meanings (cont…) • Generate complex/and/or comprehensive statistical reports about the database as a whole, to summarise and understand the data – this is what tends to be done in the Analysis stage of Data Cleaning.. E.g.: • For each field, generate a histogram of the values • Run one or more clustering algorithms to find the clusters in the data. David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me
Data Mining has Many meanings (cont…) • Build Predictive modelsthat may then be useful for business or research. For example: • Based on stock market data, we can construct a model that attempts to predict tomorrow’s Dow Index closing price, given the previous few days’ prices. • Based on blood test data from past patients, we can construct a model that attempts to predict whether or not a patient is developing hepatitis. • Based on historic data on vibrations, we can build a model that tries to predict beforehand if an aircraft wing is likely to fail. David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me
Data Mining has Many meanings (cont…) • Discover INTERESTING and USEFUL rules that are `hidden’ in the data. For example: • An analysis of supermarket basket data will show a surprising amount of baskets that contain both beer and nappies.. • Analysis of crime records data may find that the violent crimes rate in newcastle seems to reduce significantly whenever the violent crimes rate in Sunderland increases significantly. . David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me
All Together • Accessing • Reporting • Clustering/Histograms • Predictive models • Discovery of interesting/surprising things David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me
Data Mining • Accessing • Reporting • Clustering/Histograms • Predictive models • Discovery of interesting/surprising things When you hear the term `data mining’, it can mean any of 2, 3, 4 and 5. In business/industry, `2’ and `3’ are called data mining. In academia we usually take data mining to mean mainly `4’ and `5’ David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me
Notes on 3/4 • Accessing • Reporting • Clustering/Histograms • Predictive models • Discovery of interesting/surprising things These are the things that you would look at more closely in a machine learningcourse. The predictive models are things like neural networks, decision trees and rulesets David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me
What DM means for us • Accessing • Reporting • Clustering/Histograms • Predictive models • Discovery of interesting/surprising things So, 3 and 4 are dealt with in another course. 5 could be an entire MSc course on its own, but that is what DM means for us. . In particular, we take a small bite of it that is relevant to practical discovery of interesting things in very large DBs. We look at a fast algorithm that can discover interesting rules in transaction databases, and that is a component in several advanced commercial systems.. David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me
First, some important motivational/explanatory notes Why do we need something like `type 5’ data mining at all? Couldn’t the `beer and nappies’ thing have been found by types 2 or 3 DM? The next slide shows a tiny `supermarket basket’ database. E.g. Record 11 is a customer who bought eggs and glue only; record 12 Records a transaction where the basket contained only apples. David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me
ID apples, beer, cheese, dates, eggs, fish, glue, honey, ice-cream David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me
Numbers Our example DB has 20 records of supermarket transactions, from a supermarket that only sells 9 things One month in a large supermarket with five stores spread around a reasonably sized city might easily yield a DB of 20,000,000 baskets, each containing a set of products from a pool of around 1,000 David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me
Rules A `rule’ is something like this: If a basket contains apples and cheese, then it also contains beer Any such rule has two associated measures: • confidence – when the `if’ part is true, how often is the `then’ bit true? • coverage or support – how much of the database contains the `if’ part? David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me
Is that interesting ? Is that useful ? What makes a rule interesting or useful? Example: What is the confidence and coverage of: If the basket contains beer and cheese, then it also contains honey 2/20 of the records contain both beer and cheese, so coverage is 10% Of these 2, 1 contains honey, so confidence is 50% David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me
Interesting/Useful rules Statistically, anything that is interesting is something that happens significantly more than you would expect by chance. E.g. basic statistical analysis of basket data may show that 10% of baskets contain bread, and 4% of baskets contain washing-up powder. I.e: • There is a probability 0.1 that a basket contains bread. • There is a probability 0.04 that a basket contains washing-up powder. David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me
Bread and washing up powder What is the probability of a basket containing both bread and washing-up powder? The laws of probability say: If these two things are independent, chance is 0.1 * 0.04 = 0.004 That is, we would expect 0.4% of baskets to contain both bread and washing up powder David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me
Interesting means surprising We therefore have a prior expectation that just 4 in 1,000 baskets should contain both bread and washing up powder. If we investigate, and discover that really it is 20 in 1,000 baskets, then we will be very surprised. It tells us that: • Something is going on in shoppers’ minds: bread and washing-up powder are connected in some way. • There may be ways to exploit this discovery … put the powder and bread at opposite ends of the supermarket? David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me
Finding surprising rules Suppose we ask `what is the most surprising rule in this database? This would be, presumably, a rule whose accuracy is more different from its expected accuracy than any others. But it also has to have a suitable level of coverage, or else it may be just a statistical blip, and/or unexploitable. Looking only at rules of the form: if basket contains X and Y, then it also contains Z … our realistic numbers tell us that there may be around 500,000,000 distinct possible rules. For each of these we need to work out its accuracy and coverage, by trawling through a database of around 20,000,000 basket records. … c 1016 operations … Yes, it’s easy to use `type 2’ DM, say, to work out the confidence and coverage of a given rule. But type 5 DM is all about searching through, somehow, 500,000,000 (or usually immensely more) rules to sniff out what may be the interesting ones. David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me
Here are some interesting onesin our mini basket DB: • If a basket contains glue, then it also contains either beer or eggs confidence: 100% ; coverage 25% • If a basket contains apples and dates, then it also contains honey confidence 100% ; coverage 20% David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me
What this lecture was about • The many different meanings of data mining • Warming up for the next lecture, via gentle discussion on transaction databases, rules, confidence, coverage, and what it takes for a rule to be interesting. David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me
Next A classic fast algorithm for finding useful rules in large databases, David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me