490 likes | 693 Views
Data Mining (and machine learning). DM Lecture 1: Overview of DM, and overview of the DM part of the DM&ML module Some of these slides are derivative of Nick Taylor’s slides used for this module in previous years. Overview of My Lectures.
E N D
Data Mining(and machine learning) DM Lecture 1: Overview of DM, and overview of the DM part of the DM&ML module Some of these slides are derivative of Nick Taylor’s slides used for this module in previous years David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Overview of My Lectures All at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html • Lecture 1: about data and data mining; • Lectures 2 and 3: Basic and useful ways to process and understand data • Lectures 4, 5, 6, 7, 8 Details of useful algorithms for finding knowledge from data; • Lecture 9: overview of what else there is. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Module assessment 100% by coursework Two main items of coursework, 50% each Four small items of coursework, worth nothing, but if you don’t do them adequately you fail the module. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
This Semester PDW lectures on Mondays (machine learning) DWC lectures on Thursdays (data mining) Friday slot usually unused – we may use it, and will let you know in advance All coursework set by DWC David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Coursework submission ALL coursework must be submitted as follows • as PDF • by email to dwcorne@gmail.com • the c/w is an attachment • Subject line: DMML Coursework A • (… or B, C, D, 1, 2) • Body of the email includes your Name and your Course (e.g. Joe Smith, BSc CS – Jill Brown, MSc AI) David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
DWC lectures and c/w, key dates David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
At last, the lecture David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
What some people think can be done with data Answer simple questions like: • How many female clients do we have? • How much paint did we sell in 2007? • Which is the most profitable branch of our supermarket? • Which postcodes suffered the most dropped calls in July? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
that is so David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
that is so Boring David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
More interesting things that can be done with data Answer difficult and valuable questions like: • How can we predict Ovarian cancer early enough to treat it successfully? • How can I make significant profit on the stock market next month? • Two different authors claim to have written this story – how can we resolve the dispute? • How can we get our customers to spend more money in the store? • Is this loan applicant a good credit risk? • Is this sonar image a mine, or a rock? • What other websites will this browser be interested in? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Mining - Definition & Goal Definition • – Data Mining is the exploration and analysis of large quantities of data in order to discover meaningful patterns and rules Goal • – To permit some other goal to be achieved or performance to be improved through a better understanding of the data David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Some examples of large databases Retail basket data: much commercial DM is done with this. In one store, 18,000 baskets per month Tesco has >500 stores. Per year, 100,000,000 baskets ? The Internet ~ >15,000,000,000 pages Lots of datasets: UCI Machine Learning repository How can we begin to understand and exploit such datasets? Especially the big ones? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Like this … David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
and this … David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
and this … David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
or this … • see http://websom.hut.fi/websom/milliondemo/html/root.html David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Mining - Basics • Data Mining is the process of discovering patterns and inferring associations in raw data • Data Mining is a collection of techniques intended to analyse small or large amounts of data • There is no single Data Mining approach • Data Mining can employ a range of techniques, either individually or in combination with each other David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Mining – Why is it important? • Data are being generated in enormous quantities • Data are being collected over long periods of time • Data are being kept for long periods of time • Computing power is formidable and cheap • A variety of Data Mining software is available • All of these data contain `hidden knowledge’ – facts, rules, patterns, that can be usefully exploited if we can find them. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Mining – History • The approach has its roots over 40 years ago • In the early 1960s Data Mining was called statistical analysis, and the pioneers were statistical software companies such as SPSS • By the late 1980s these traditional techniques had been augmented by new methods such as machine induction, artificial neural networks, evolutionary computing, etc. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Some basic terminology David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
This is called a datainstance or a record or just a line of data David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
This is called a field or an attribute; the value of the Age field in the 4th record is 274 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Usually we are interested in predicting the value of a particular field, given the values of the other fields. What we want to predict is called the class field, or the target class David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Some data-mining related projects that I am currently working on (either myself, or with a PhD student or RA) Predicting whether or not two textures will be considered similar by humans. Predicting which of two or more writers is the author of a given piece of text (you will do some work on this) Discovering which subsets of many thousands of genes play a role in specific diseases (cancer, diabetes, etc) (you will do a little work on this too) Discovering technical trading rules for stock market trading (you will do a little work on this too) David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Which pair of textures is most similar? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Which pair of textures is most similar? A line of data … 0.23 1.88 9.64 3.22 … 7.1 1086.9 2.23 … 0.76 %age of people who think they are similar 5,000 features for texture1 5,000 features for texture2 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Who wrote text chunk 4? 0.4 0.2 0.001 0.002 0.6 … AuthorA 0.3 0.15 0 0.1 0.5 … AuthorA 0.2 0.2 0.001 0.002 0.5 … AuthorB 0.2 0.15 0 0.002 0.6 … ? Word usage `Fingerprint’ of a 1,000 word chunk of text David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Did the Dow Jones go up or down in the following week? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Down David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Will the Dow Jones go up or down tomorrow? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Mining – Two Major Types • Directed (Farming) – Attempts to explain or categorise some particular target field such as income, medical disorder, genetic characteristic, etc. • Undirected (Exploring) – Attempts to find patterns or similarities among groups of records without the use of a particular target field or collection of predefined classes • Compare with Supervised and Unsupervised systems in machine learning David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Mining – Tasks Classification - Example: high risk for cancer or not Estimation - Example: household income Prediction - Example: credit card balance transfer average amount Affinity Grouping - Example: people who buy X, often also buy Y with a probability of Z Clustering - similar to classification but no predefined classes Description and Profiling – Identifying characteristics which explain behaviour - Example: “More men watch football on TV than women” David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Warehousing • Note that Data Mining is very generic and can be used for detecting patterns in almost any data – Retail data – Genomes – Climate data – Etc. • Data Warehousing, on the other hand, is almost exclusively used to describe the storage of data in the commercial sector David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
What you should do this week Browse the UCI Machine Learning repository datasets and associated information; get acquainted with data Browse the statlib datasets archive, get acquainted with that too. And then … David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Coursework A (0 marks, but you fail if you don’t submit an adequate attempt) Find three other dataset repositories as follows: • One that specialises in financial data • One that specialises in time series data • One that specialises in anything else. For each of these three, tell me the URL, and write one paragraph, ~100 words, in your own words, describing the contents of this repository, Submit on or before Friday October 30th David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Au revoir David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
If time available … Some slides about data warehousing; I don’t consider this an essential part of this module, but in case you want to know what data warehousing is … David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Warehousing - Definitions “A subject-oriented, integrated, time-variant and nonvolatile collection of data in support of management's decision making process” W. H. Inmon, "What is a Data Warehouse?" Prism Tech Topic, Vol. 1, No. 1, 1995 -- a very influential definition. “A copy of transaction data, specifically structured for query and analysis” Ralph Kimball, from his 2000 book, “The Data Warehouse Toolkit” David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Warehouse – why? For organisational learning to take place data from many sources must be gathered together over time and organised in a consistent and useful way Data Warehousing allows an organisation to remember its data and what it has learned about its data Data Mining techniques make use of the data in a Data Warehouse and subsequently add their results to it David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Warehouse - Contents • A Data Warehouse is a copy of transaction data specifically structured for querying, analysis and reporting • The data will normally have been transformed when it was copied into the Data Warehouse • The contents of a Data Warehouse, once acquired, are fixed and cannot be updated or changed later by the transaction system - but they can be added to of course David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Marts • A Data Mart is a smaller, more focused Data Warehouse – a mini-warehouse • A Data Mart will normally reflect the business rules of a specific business unit within an enterprise – identifying data relevant to that unit’s acitivities David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
From Data Warhousing to Machine Learning, via Data Marts David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
The Big Challenge for Data Mining • The largest challenge that a Data Miner may face is the sheer volume of data in the Data Warehouse • It is very important, then, that summary data also be available to get the analysis started • The sheer volume of data may mask the important relationships in which the Data Miner is interested • Being able to overcome the volume and interpret the data is essential to successful Data Mining David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
What happens in practice … Data Miners, both “farmers” and “explorers”, are expected to utilise Data Warehouses to give guidance and answer a limitless variety of questions The value of a Data Warehouse and Data Mining lies in a new and changed appreciation of the meaning of the data There are limitations though - A Data Warehouse cannot correct problems with its data, although it may help to more clearly identify them David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html