440 likes | 458 Views
Data Mining and Machine Learning. Lecture 1: Why data is useful, and overview of DMML:. Overview of My Lectures. http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html. Overview of My Lectures. http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html.
E N D
Data Mining and Machine Learning Lecture 1: Why data is useful, and overview of DMML: David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Overview of My Lectures http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html David Corne Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Overview of My Lectures http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html might be changes – watch your email David Corne Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Overview of My Lectures http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html might be changes – watch your email C/Ws and Deadlines will slightly change – give me day or two David Corne Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Overview of My Lectures http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html might be changes – watch your email C/Ws and Deadlines will slightly change – give me day or two Lecture material will change a little, one lec to add David Corne Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Module assessment 100% by coursework Three main items of coursework, CW 1: 30% CW 2: 40% CW 3: 30% Two small items of coursework (A and B), worth 0%, but if you don’t do them adequately you fail the module. Extra bit added to each c/w for MSc students David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Coursework submission ALL coursework to be submitted as follows • as PDF • by email to dwcorne@gmail.com • the c/w is an attachment • Subject line: DMML Coursework A • (… or B, 1, 2, 3) • Body of the email includes your Name and your Course (e.g. Joe Smith, BSc CS – Jill Brown, MSc AI) David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
At last, the lecture David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
What some people think can be done with data Answer simple questions like: • How many female clients do we have? • How much paint did we sell in 2007? • Which is the most profitable branch of our supermarket? • Which postcodes suffered the most dropped calls in July? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
that is so David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
that is so Boring David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
More interesting things that can be done with data Answer difficult and valuable questions like: • How can we predict Ovarian cancer early enough to treat it successfully? • How can I make significant profit on the stock market next month? • Two different authors claim to have written this story – how can we resolve the dispute? • How can we get our customers to spend more money in the store? • Is this loan applicant a good credit risk? • Is this sonar image a mine, or a rock? • What other websites will this browser be interested in? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Mining - Definition & Goal Definition • – Data Mining is the exploration and analysis of (often) large quantities of data in order to discover meaningful patterns and rules Goal • – To permit some other goal to be achieved or performance to be improved through a better understanding of the data David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Some examples of large databases Retail basket data: much commercial DM is done with this. In one store, 18,000 baskets per month Tesco has >500 stores. Per year, 100,000,000 baskets ? The Internet ~ >20,000,000,000 pages Lots of datasets: UCI Machine Learning repository How can we begin to understand and exploit such datasets? Especially the big ones?
Like this … David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
and this … David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
or this … • see http://websom.hut.fi/websom/milliondemo/html/root.html David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Or this What on Earth is ‘big data’ anyway?
Data Mining & Machine Learning - Basics • Data Mining is the process of discovering patterns and inferring associations in raw data • … a collection of techniques intended to analyse small or large amounts of data • … can employ a range of techniques, either individually or in combination with each other • Machine Learning is the same, but the term ML emphasises a range of more sophisticated algorithms that try to learn accurate predictive models of data David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Mining – Why is it important? • Data are being generated in enormous quantities • Data are being collected over long periods of time • Data are being kept for long periods of time • Computing power is formidable and cheap • A variety of Data Mining software is available • All of these data contain `hidden knowledge’ – facts, rules, patterns, that can be usefully exploited if we can find them. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Some basic terminology David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
This is called a datainstance or a record or just a line of data David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
This is called a field or an attribute; the value of the Age field in the 4th record is 274 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Usually we are interested in predicting the value of a particular field, given the values of the other fields. What we want to predict is called the class field, or the target class David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Some data-mining related projects that I am currently working on (either myself, or with a PhD student or RA) Analysing sonar images to detect underwater mines Predicting which of two or more writers is the author of a given piece of text Discovering which subsets of many thousands of genes play a role in specific diseases (cancer, diabetes, etc) Analysing the current twitter timeline to detect immediate evidence of an earthquake David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Who wrote text chunk 4? 0.4 0.2 0.001 0.002 0.6 … AuthorA 0.3 0.15 0 0.1 0.5 … AuthorA 0.2 0.2 0.001 0.002 0.5 … AuthorB 0.2 0.15 0 0.002 0.6 … ? Word usage `Fingerprint’ of a 1,000 word chunk of text David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Did the Dow Jones go up or down in the following week? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Down David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Will the Dow Jones go up or down tomorrow? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Warehousing • Note that Data Mining is very generic and can be used for detecting patterns in almost any data – Retail data – Genomes – Climate data – Etc. • Data Warehousing, on the other hand, is almost exclusively used to describe the storage of data in the commercial sector David Corne,, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
What you should do this week Browse the UCI Machine Learning repository datasets and associated information; get acquainted with data Browse the statlib datasets archive, get acquainted with that too. Browse the http://www.kaggle.com/ website - to give you some idea of how hot data mining is And then … David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Coursework A (0 marks, but you fail if you don’t submit an adequate attempt) Find three other dataset repositories as follows: • One that specialises in sports data • One that specialises in time series data • One that specialises in anything else that is interesting. For each of these three, tell me the URL, and write one paragraph, ~100 words, in your own words, describing the contents of this repository, Submit on or before 23:59pm Friday October 9th David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
dataset “repository” ? • A collection of datasets, probably with an overall theme • Not a single dataset • Not a big deal David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
If interested… Some slides about data warehousing; I don’t consider this an essential part of this module, but in case you want to know what data warehousing is … David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Warehousing - Definitions “A subject-oriented, integrated, time-variant and nonvolatile collection of data in support of management's decision making process” W. H. Inmon, "What is a Data Warehouse?" Prism Tech Topic, Vol. 1, No. 1, 1995 -- a very influential definition. “A copy of transaction data, specifically structured for query and analysis” Ralph Kimball, from his 2000 book, “The Data Warehouse Toolkit” David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Warehouse – why? For organisational learning to take place data from many sources must be gathered together over time and organised in a consistent and useful way Data Warehousing allows an organisation to remember its data and what it has learned about its data Data Mining techniques make use of the data in a Data Warehouse and subsequently add their results to it David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Warehouse - Contents • A Data Warehouse is a copy of transaction data specifically structured for querying, analysis and reporting • The data will normally have been transformed when it was copied into the Data Warehouse • The contents of a Data Warehouse, once acquired, are fixed and cannot be updated or changed later by the transaction system - but they can be added to of course David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Marts • A Data Mart is a smaller, more focused Data Warehouse – a mini-warehouse • A Data Mart will normally reflect the business rules of a specific business unit within an enterprise – identifying data relevant to that unit’s acitivities David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
From Data Warhousing to Machine Learning, via Data Marts David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
The Big Challenge for Data Mining • The largest challenge that a Data Miner may face is the sheer volume of data in the Data Warehouse • It is very important, then, that summary data also be available to get the analysis started • The sheer volume of data may mask the important relationships in which the Data Miner is interested • Being able to overcome the volume and interpret the data is essential to successful Data Mining David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
What happens in practice … Data Miners, both “farmers” and “explorers”, are expected to utilise Data Warehouses to give guidance and answer a limitless variety of questions The value of a Data Warehouse and Data Mining lies in a new and changed appreciation of the meaning of the data There are limitations though - A Data Warehouse cannot correct problems with its data, although it may help to more clearly identify them David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html