Pattern recognition

Pattern recognition – basic concepts

Sample • input attribute, attribute, feature, input variable, • independent variable • (atribut, rys, příznak, vstupní proměnná, nezávisle • proměnná) • class, output variable, dependendent variable • (třída, výstupní proměnná, závislá proměnná) • sample • (vzorek)

Handwritten digits

each digit 28 x 28 pixels • so each digit can be represented by a vector x comprising 784 real numbers • goal: • build a machine that will take x as input and will produce the identity of the digit 0 … 9 as the output • non-trivial problem due to the wide variability of handwriting • could be tackled by using rules for distinguishing the digits based on the shapes of the strokes • in practice such an approach leads to a proliferation of rules and of exceptions to the rules and so on, and invariably gives poor results

better way – adopt machine learning algorithm (i.e. use some adaptive model) input model internal parameters influencing the behavior of the model must be adjusted

tune the parameters of the model using the training set • training set is a data set of N digits {x1, …, xN} • the categories of the digits in the training set are known in advance (inspected manually), they form a target vectort for each digit (one target vector for each digit) 0 1 2 3 4 5 6 7 8 9 0 0 0 1 0 0 0 0 0 0

The result of running the machine learning algorithm can be expressed as a function y(x) which takes a new digit image x as input and that generates an output vector y, encoded in the same way as the target vectors. ... x1 x2 x3 x4 x5 x6 x784 y(x) vector y

The precise form of the function y(x) is determined during the training phase (learning phase). • Model adapts its parameters (i.e. learns) on the basis of the training data {x1, …, xN}. • Trained model can then determine the identity of new, previously unseen, digit images which are said to comprise a test set. • The ability to categorize correctly new examples that differ from those used for training is known as generalization.

For most practical applications, the original input variables are typically preprocessed to transform them into some new space of variables where, it is hoped, the pattern recognition problem will be easier to solve. x1 x2 x3 x4 x5 x6 x784 ... Preprocessing y(x) vector y

Translate and scale the images of the digits so that each digit is contained within a box of a fixed size. • This greatly reduces the variability within each digit class, because the location and scale of all the digits are now the same. • This pre-processing stage is sometimes also called feature extraction. • Test data must be pre-processed using the same steps as the training data.

Feature selection and feature extraction x1 x2 x3 x4 x5 x6 x784 x1 x2 x3 x4 x5 x6 x784 ... ... selection extraction x1 x5 x103 x456 x*1 x*2 x*3 x*4 x*5 x*6 x*784 ... x*18 x*152 x*309 x*666

Dimensionality reduction • We want to reduce number of dimensions because: • Efficiency • measurement costs • storage costs • computation costs • Problem may be solved more easily in the new space • Improved classification performance • Ease of interpretation

Curse of dimensionality Bishop, Pattern Recognition and Machine Learning

Supervised learning • training data comprises examples of the input vectors along with their corresponding target vectors (e.g. digit recognition) • classification – aim: assign an input vector to one of a finite number of discrete categories • regression (data/curve fitting) - desired output consists of one or more continuous variables

Unsupervised learning • training data consists of a set of input vectors x without any corresponding target value • goals: • discover groups of similar examples within the data – clustering • project the data from a high-dimensional space down to two or three dimensions for the purpose of visualization

Polynomial curve fitting • regression problem, supervised • we observe a real-valued input variable x and we wish to use this observation to predict the value of a real-valued target variable t • artificial example - sin(2πxn) + random noise • training set: N observations of x written as x = (x1, … , xN)T + corresponding observations of the values of t: t = (t1, … , tN)T

sin(2πxn) + random noise N = 10 x1 -> t1 x2 -> t2 etc. training data set {x, t} adapted from Bishop, Pattern Recognition and Machine Learning

goal: exploit the training set in order to make prediction of the target variable t’ for new value x’ of the input variable • this is generally difficult, as • we have to generalize from the finite data set • data are corrupted by the noise, so for the given x’ there is uncertainty in the value of t’ t’ x’ adapted from Bishop, Pattern Recognition and Machine Learning

decision: which method to use? • From the plethora of possibilities (you do not know about yet) I chose a simple one – data will be fitted using this polynomial function • polynomial coefficients w0, …, wM form a vector w • they represent parameters of the model that must be set in the training phase • the polynomial model is still linear regression!!!

The values of coefficients will be determined by minimizing the error function • It measures the misfit between the function y(x, w) and the training set data points • one simple choice: the sum of squared errors -SSE between the predictions y(xn, w) for each data point xn and the correspoding values tn

displacement of the data point tn from the function y(xn, w) fitted function Bishop, Pattern Recognition and Machine Learning

solving curve fitting problem means choosing the value of w for which E(w) is as small as possible … w*→y(x, w*) • as small as possible means to find a minimum of E(w) (i.e. its derivatives)

Pattern recognition – basic concepts