Linear Machines for Decision Surfaces with Normal Densities

Linear machines28/02/2017

Decision surface for Bayes classifier with Normal densites

Decison surfaces We focus on the decision surfaces Linear machines = linear decision surface Non-optimal solution but tractable model

Decision tree and decision regions

Linear discriminant function twocategoryclassifier: choose1if g(x) > 0elsechoose2if g(x) <0 If g(x) = 0 thedecision is undefined. g(x)=0 definesthedecisionsurface Linearmachine = lineardiscriminantfunction: g(x) = wtx + w0 w weightvector w0constantbias

c lineardiscriminantfunction: iis predictedifgi(x) > gj(x)  j  i; i.e. pairwisedecisionsurfacesdefinesthedecisionregions More than 2 categories

It is proved that linear machines can only define convex regions, i.e. concave regions cannot be learnt. Moreover the decision boundaries can be higher order surfaces (like elliptoids)… Expression power of linear machines

Homogen coordinates

Training linear machines 10

Training linear machines Searching for the values of w which separates classes Usually a goodness function is utilised as objective function, e.g. 11

Two categories - normalisation 12 ifyibelongs to ω2replace yiby -yi then search forawhichatyi>0 (normalised version) Thereisn’t anyuniquesolution.

Iterative optimalisation The solutionminimalisesJ(a) Iterative improvement ofJ(a) 13 a(k+1) Stepdirection Learningrate a(k)

Gradient descent 14 Learning rate is a function of k, i.e. it describes a cooling strategy

Gradient descent 15

16 Learning rate?

17 Perceptron rule

18 Perceptron rule Y(a): the set of training samples misclassified by a IfY(a)is emptyJp(a)=0;else Jp(a)>0

Perceptron rule UsingJp(a)in the gradient descent: 19

20 Misclassifiedtraining samplesbya(k) Perceptron convergence theorem: If the training dataset is linearly separable the batch perceptron algorithm finds a solution in finete steps.

21 η(k)=1 online learning Stochasticgradientdesent: Estimatethegradientbasedon a fewtraingingexamples

Online vs offline learning • Online learningalgorithms: • The modell is updatedbyeachtraininginstance (orby a small batch) • Offline learningalgorithms: • The trainingdataset is processedas a whole • Advantages of online learning: • Update is straightforward • The trainingdatasetcan be streamed • Implicit adaptation • Disadvantages of online learning: • - Itsaccuracymigth be lower

SVM

Which one to prefer? 24

Margin: the gap around the decision surface. It is defined by the training instances closest to the decision survey (support vectors) 25

Support Vector Machine(SVM) SVM is a linear machine where the objective function incorporates the maximalisation of the margin! This provides generalisation ability 27

SVM Linearlyseparablecase

Linear SVM: linearly separable case Trainingdatabase: Searchingfor w s.t. or 29

Note the size of the margin by ρ Linearly separable: We prefer a unique solution: argmax ρ= argmin Linear SVM: linearlyseparablecase 30

Linear SVM: linearlyseparablecase 31 Convexquadraticoptimisationproblem…

The form of thesolution: bármely t-ből xt is a supportvectoriff Linear SVM: linearlyseparablecase 32 Weightedavearge of traininginstances onlysupportvectorscount

SVM notlinearlyseparablecase

Linear SVM: not linearly separable case ξslackvariableenablesincorrectclassifications („softmargin”): ξt=0 iftheclassification is correct, elseitisthedistancefromthemargin 35 C is a metaparameterforthetrade-offbetweenthemarginsize and incorrectclassifications

SVM non-linearcase

Generalised linear discriminant functions E.g. quadratic decision surface: Generalised linear discriminant functions: yi: Rd →Rarbitrary functions g(x) is not linear in x, but is is linear in yi (it is a hyperplane in the y-space)

Example

Non-linear SVM 42

Non-linear SVM 43 Φis a mapping into a higher dimensional (k) space: There exists a mapping into a higher dimensional space for any dataset where the dataset will be linearly separable in the new space.

The kernel trick g(x)= The calculation of mappingsintohighdimensionalspacecan be omitedifthe kernel of tox can be computed 44

Example: polinomial kernel 45 K(x,y)=(x y) p d=256 (original dimensions) p=4 h=183 181 376 (high dimensional space) on the other hand K(x,y) is known and feasible to calculate while the inner product in high dimensions is not

Kernels in practice 46 • No rule of thumbs for selecting the appropiate kernel

The XOR example 47

48 The XOR example

49 The XOR example

Linear Machines for Decision Surfaces with Normal Densities

Linear Machines for Decision Surfaces with Normal Densities

Presentation Transcript

Introduction to 40 Ar/ 39 Ar geochronology and thermochronology

several vending machines Something Great/tutorialoutletdotcom

Best Rowing Machines In 2017

SCTPCs AR 2017-2018 Batch

28 February 2017

October 22 - 28, 2017

Erbil, 28 May 2017

Thursday September 28, 2017

28 FEBRUARY 2017

Line ar machines márc. 9.

Plasmapheresis Machines Market Size