720 likes | 890 Views
Data Mining (and machine learning). ROC curves Rule Induction Basics of Text Mining. Two classes is a common and special case. Two classes is a common and special case. Medical applications: cancer, or not? Computer Vision applications: landmine, or not?
E N D
Data Mining(and machine learning) ROC curves Rule Induction Basics of Text Mining David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Two classes is a common and special case David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Two classes is a common and special case Medical applications: cancer, or not? Computer Vision applications: landmine, or not? Security applications: terrorist, or not? Biotech applications: gene, or not? … … David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Two classes is a common and special case Medical applications: cancer, or not? Computer Vision applications: landmine, or not? Security applications: terrorist, or not? Biotech applications: gene, or not? … … David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Two classes is a common and special case True Positive: these are ideal. E.g. we correctly detect cancer David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Two classes is a common and special case True Positive: these are ideal. E.g. we correctly detect cancer False Positive: to be minimised – cause false alarm – can be better to be safe than sorry, but can be very costly. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Two classes is a common and special case True Positive: these are ideal. E.g. we correctly detect cancer False Positive: to be minimised – cause false alarm – can be better to be safe than sorry, but can be very costly. False Negative: also to be minimised – miss a landmine / cancer very bad in many applications David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Two classes is a common and special case True Positive: these are ideal. E.g. we correctly detect cancer False Positive: to be minimised – cause false alarm – can be better to be safe than sorry, but can be very costly. False Negative: also to be minimised – miss a landmine / cancer very bad in many applications True Negative?: David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Sensitivity and Specificity: common measures of accuracy in this kind of 2-class tasks David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Sensitivity and Specificity: common measures of accuracy in this kind of 2-class tasks Sensitivity = TP/(TP+FN) - how much of the real ‘Yes’ cases are detected? How sensitive is the classifier to ‘Yes’ cases? Specificity = TN/(FP+TN) - how much of the real ‘No’ cases are deteced? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
YESNO David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
YESNO David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Sensitivity: 100% Specificity: 25% YESNO David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Sensitivity: 93.8% Specificity: 50% YESNO David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Sensitivity: 81.3% Specificity: 83.3% YESNO David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Sensitivity: 56.3% Specificity: 100% YESNO David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Sensitivity and Specificity: common measures of accuracy in this kind of 2-class tasks Sensitivity = TP/(TP+FN) - how much of the real TRUE cases are detected? How sensitive is the classifier to TRUE cases? A highly sensitive test for cancer: if “NO” then you be sure it’s “NO” Specificity = TN/(TN+FP) - how sensitive is the classifier to the negative cases? A highly specific test for cancer: if “Y” then you be sure it’s “Y”. With many trained classifiers, you can ‘move the line’ in this way. E.g. with NB, we could use a threshold indicating how much higher the log likelihood for Y should be than for N David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
ROC curves David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Rule Induction • Rules are useful when you want to learn a clear / interpretable classifier, and are less worried about squeezing out as much accuracy as possible • There are a number of different ways to ‘learn’ rules or rulesets. • Before we go there, what is a rule / ruleset? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Rules IF Condition … Then Class Value is … David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Rules are Rectangular YESNO IF (X>0)&(X<5)&(Y>0.5)&(Y<5) THEN YES 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 11 12
Rules are Rectangular YESNO IF (X>5)&(X<11)&(Y>4.5)&(Y<5.1) THEN NO 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 11 12
A Ruleset IF Condition1 … Then Class = A IF Condition2 … Then Class = A IF Condition3 … Then Class = B IF Condition4 … Then Class = C … David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
What’s wrong with this ruleset? (two things) YESNO 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 11 12
What about this ruleset? YESNO 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 11 12
Two ways to interpret a ruleset: David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Two ways to interpret a ruleset: As a Decision List IF Condition1 … Then Class = A ELSE IF Condition2 … Then Class = A ELSE IF Condition3 … Then Class = B ELSE IF Condition4 … Then Class = C … ELSE … predict Majority Class David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Two ways to interpret a ruleset: As an unordered set IF Condition1 … Then Class = A IF Condition2 … Then Class = A IF Condition3 … Then Class = B IF Condition4 … Then Class = C Check each rule and gather votes for each class If no winner, predict majority class David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Three broad ways to learn rulesets David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Three broad ways to learn rulesets 1. Just build a decision tree with ID3 (or something else) and you can translate the tree into rules! David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Three broad ways to learn rulesets 2. Use any good search/optimisation algorithm. Evolutionary (genetic) algorithms are the most common. You will do this coursework 3. This means simply guessing a ruleset at random, and then trying mutations and variants, gradually improving them over time. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Three broad ways to learn rulesets 3. A number of ‘old’ AI algorithms exist that still work well, and/or can be engineered to work with an evolutionary algorithm. The basic idea is: iterated coverage David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Take each class in turn .. YESNO 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 11 12
Pick a random member of that class in the training set YESNO 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 11 12
Extend it as much as possible without including another class YESNO 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 11 12
Extend it as much as possible without including another class YESNO 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 11 12
Extend it as much as possible without including another class YESNO 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 11 12
Extend it as much as possible without including another class YESNO 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 11 12
Next class YESNO 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 11 12
Next class YESNO 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 11 12
And so on… YESNO 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 11 12
Text as Data: the basics David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
So, the most frequent words in a document carry the most useful information ... ? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Desktops, laptops, LED-TVs Which is which
Some motivations for text mining • Start your own company http://www.text-analytics.com/comp.html • Recommendations • “if you like that, you might also like these …” • On amazon, or any general product sales site, this can be based on distances between (e.g.) 200 word summaries or ToC of a book, or text that describes a product in a catalogue • Document classification • Coping with information overload • Sentiment analysis ... Hotel reviews /product reviews
But a document is a “bag of words” – we need to convert to numbers
A one-slide text-mining tutorial an article about poltics NOW you can do Clustering, Retrieving similar Documents, Supervised Classification Etc... (0.1, 0.2, 0, 0.02 ...) (0.4, 0, 0.1, 0 ...) (0.11, 0.3, 0, 0.01 ..) an essay about sport another article about politics Vectors based on word frequencies. One key issue is to choose the right set of words (or other features)
How did I get these vectors from these two `documents’? <h1> Compilers</h1> <p> The Guardian uses several compilers for its daily cryptic crosswords. One of the most frequently used is Araucaria, and one of the most difficult is Bunthorne.</p> <h1> Compilers: lecture 1 </h1> <p> This lecture will introduce the concept of lexical analysis, in which the source code is scanned to reveal the basic tokens it contains. For this, we will need the concept of regular expressions (r.e.s).</p> 26, 2, 2 35, 2, 0