330 likes | 471 Views
A Brief Tour of Machine Learning. David Lindsay. What is Machine Learning?. Very multidisciplinary field – statistics, mathematics, artificial intelligence, psychology, philosophy, cognitive science… In a nutshell – developing algorithms that learn from data
E N D
A Brief Tour of Machine Learning David Lindsay
What is Machine Learning? • Very multidisciplinary field – statistics, mathematics, artificial intelligence, psychology, philosophy, cognitive science… • In a nutshell – developing algorithms that learn from data • Historically – flourished from advances in computing in the early 60’s, resurgence in the late 90’s
Main areas in Machine Learning #1 Supervised learning assumes a teacher exists to label/annotate data #2 Unsupervised learning no need for a teacher, try to learn relationships automatically #3 Reinforcement learning biologically plausible, try to learn from reward/punishment stimuli/feedback
Supervised Learning Learning with a teacher
More about Supervised Learning Perhaps the most well studied area of machine learning – lots of nice theory adapted from statistics/mathematics. Assume the existence of a training and test set Main sub-areas of research are: • Pattern recognition (discrete labels) • Regression (continuous labels) • Time series analysis (temporal dependence in data) i.i.d. assumption commonly made
The formalisation of data • How to we formally describe our data? Property of the object that we want to predict in the future using our training data – e.g.. screening cancer labels could be Y = {normal, benign, malignant} Label + Commonly represented as a feature vector – this describes the object Object The individual features can be real, discrete, symbolic… eg. patient symptoms: temperature, sex, eye colour…
The formalisation of data (continued) • What is training and test data? y 7 6 1 7 ? ? 2 New test images – labels either not known or withheld from the learner x Training set of images We learn from the training data, and try to predict new unseen test data. More formally we have a set of n training and test examples (information pairs – object + label) from the some unknown probability distributionP(X,Y).
More about Pattern Recognition Lots of algorithms/techniques – the main contenders: • Support Vector Machines (SVM) • Nearest Neighbours • Decision Trees • Neural Networks • Multivariate Statistics • Bayesian algorithms • Logic programming
■ ■ ■ ■ ☺ ■ ■ ☺ ☺ ☺ ☺ ☺ ☺ ☺ The mighty SVM algorithm • Very popular technique – lots of followers, relatively new • Very simple technique – related to the Perceptron, is a linear classifier (separates data into half spaces). Concept – keep the classifier simple, don’t over fit the data the classifier generalises well on new test data (Occams razor) Concept – if data not linearly separable use a kernel Φ map into another higher dimensional feature space and data may be separable
Hot topics in SVM’s • Kernel design – central to the application to data, eg. when the objects are text documents, the features are words incorporate domain knowledge about grammar. • Applying the kernel technique to other learning algorithms e.g.. Neural Networks
The trusty old Nearest Neighbour algorithm • Born in the 60’s – probably the most simple of all algorithms to understand. • Decision rule = classify new test examples by finding the closest neighbouring example in the training set and predict the same label as the closest. • Lots of theory justifying its convergence properties. • Very lazy technique, not very fast – has to search for each test example.
Problems with Nearest Neighbours • View examples in Euclidean space, can be very sensitive to feature scaling. • Finding computationally efficient ways to search for the Nearest Neighbour example.
Decision Trees • Many different varieties C4.5, CART, ID3… • Algorithms build classification rules using a tree of if-then statements. • Constructs tree using Minimum Description Length (MDL) principles (tries to make the tree as simple as possible) IF temperature > 65 Patient has fever IF dehydrated = yes Patient has flu Patient has pneumonia
Benefits/Issues with Decision Trees • Instability – minor changes to training data makes huge changes to decision tree • User can visualise/interpret the hypothesis directly, can find interesting classification rules • Problems with continuous real attributes, must be discretalised. • Large AI following, and widely used in industry
Mystical Neural Networks • Very flexible, learning is a gradient descent process (back propagation) • Training neural networks involves a lot of design choices: • what network structure, how many hidden layers… • how to encode the data (must be values [0,1]) • use momentum to speed up convergence • Use weight decay to keep simple
1 Input layer Hidden Layer Output layer 0 Menopausal status Ultrasound score E(w) CA125 w2 w1 Training a neural network The aim in training the neural network is find the weight vector w that minimises the error E(w) on the training set Learnt hypothesis is represented by the weights that interconnect each neuron Gradient descent problem Sigmoid function
Interesting applications • Bioinformatics: • genetic/protein code analysis • microarray analysis • gene regulatory pathways • WWW: • classifying text/html documents • filtering images • filtering emails
Bayesian Algorithms • Try to model interrelationships between variables probabilistically. • Can model expert/domain knowledge directly into the classifier as prior belief in certain events. • Use basic axioms of probability theory to extract probabilistic estimates
Bayesian algorithms in practice • Lots of different algorithms – Relevance Vector Machine (RVM), Naïve Bayes, Simple Bayes, Bayesian Belief Networks (BBN)… • Has a large following – especially Microsoft Research Weather = sunny Causal links between features can be modelled Temperature < 65 Humidity > 100 Play Monopoly Play Tennis
Issues with Bayesian algorithms • Tractability – to find solutions need numerical approximations or take computational shortcuts • Can model causal relationships between variables • Need lots of data to estimate probabilties using obsevered training data frequencies
Very important side problems • Feature Selection/Extraction – Using Principle Component Analysis, Wavelets, Cananonical Correlation, Factor Analysis, Independent Component Analysis • Imputation – what to do with missing features? • Visualisation – make the hypothesis human readable/interpretable • Meta learning – how to add functionality to existing algorithms, or combine the prediction of many classifiers (Boosting, Bagging, Confidence and Probability Machines)
Very important side problems (continued) • How to incorporate domain knowledge into a learner • Trade off between complexity (accuracy on training) vs. generalisation (accuracy on test) • Pre-processing of data, normalising, standardising, discretalising. • How to test – leave one out, cross validation, stratify, online, offline…
Unsupervised Learning Learning without a teacher
An introduction to Unsupervised Learning • No need for a teacher/supervisor • Mainly clustering – trying to group objects into sensible clusters • Novelty detection – finding strange examples in data Clustering examples Novelty detection
Algorithms available • For clustering: EM algorithm, K-Means, Self Organising Maps (SOM) • For novelty detection: 1-Class SVM, support vector regression, Neural Networks
Issues and Applications • Very useful for extracting information from data. • Used in medicine to identify disease sub types. • Used to cluster web documents automatically • Used to identify customer target groups in buisness • Not much publicly available data to test algorithms with
Reinforcement Learning Learning inspired by nature
An introduction • Most biologically plausible – feedback given through stimuli reward/punishment • A field with a lot of theoryneeding for real life applications (other than playing BackGammon) • But also encompasses the large field of Evolutionary Computing • Applications are more open ended • Getting closer to what public consider AI.
Traditional Reinforcement Learning • Techniques use dynamic programming to search for optimal strategy • Algorithms search to maximise their reward. • Q – Learning (Chris Watkins next door) is most well known technique. • Only successful applications are to games and toy problems. • A lack of real life applications. • Very few researchers in this field.
Evolutionary Computing • Inspired by the process of biological evolution. • Essentially an optimisation technique – the problem is encoded as a chromosome. • We find new/better solutions to problem by sexual reproduction and mutation. • This will encourage mutation
Techniques available in Evolutionary Computing • Lower level optimisers: • Evolutionary Programming, Evolutionary Algorithms • Genetic Programming, Genetic Algorithms, • Evolutionary Strategy • Simulated Annealing • Higher level optimisers: • TABU search • Multi-objective optimisation Pareto front of optimal solutions – which one should we pick? Objective 2 Objective 1
Issues in Evolutionary Computing • How to encode the problem is very important • Setting mutation/crossover rates is very adhoc • Very computationally/memory intensive • Not much theory can be developed – frowned upon by machine learning theorists