670 likes | 692 Views
Explore the mathematical classification techniques used in data mining, focusing on neural networks and multi-layer perceptron. Learn about binary classification problems, network topology, variable encodings, and training processes. Discover how neural networks offer high prediction accuracy but may have long training times and challenges in interpretation. Gain insights into handling continuous and ordinal variables and the significance of time series variables in stock market prediction.
E N D
Data Mining: and Knowledge Acquizition — Chapter 5 — BIS 541 2016/2017 Summer
Mathematically Classification • Classification: • predicts categorical class labels • Typical Applications • {credit history, salary}-> credit approval ( Yes/No) • {Temp, Humidity} --> Rain (Yes/No)
Linear Classification • Binary Classification problem • The data above the red line belongs to class ‘x’ • The data below red line belongs to class ‘o’ • Examples – SVM, Perceptron, Probabilistic Classifiers x x x x x x x o x x o o x o o o o o o o o o o
Neural Networks • Analogy to Biological Systems (Indeed a great example of a good learning system) • Massive Parallelism allowing for computational efficiency • The first learning algorithm came in 1959 (Rosenblatt) who suggested that if a target output value is provided for a single neuron with fixed inputs, one can incrementally change weights to learn to produce these outputs using the perceptron learning rule
Neural Networks • Advantages • prediction accuracy is generally high • robust, works when training examples contain errors • output may be discrete, real-valued, or a vector of several discrete or real-valued attributes • fast evaluation of the learned target function • Criticism • long training time • difficult to understand the learned function (weights) • not easy to incorporate domain knowledge
Network Topology • Input variables number of inputs • number of hidden layers • # of nodes in each hidden layer • # of output nodes • can handle discrete or continuous variables • normalisation for continuous to 0..1 interval • for discrete variables • use k inputs for each level • use k output for each level if k>2 • A has three distinct values a1,a2,a3 • three input variables I1,I2I3 when A=a1 I1=1,I2,I3=0 • feed-forward:no cycle to input untis • fully connected:each unit to each in the forward layer
Multi-Layer Perceptron Output vector Output nodes Hidden nodes wij Input nodes Input vector: xi
Example: Sample iterations • A network suggested to solve the XOR problem figure 4.7 of WK page 96-99 • learning rate is 1 for simplicity • I1=I2=1 • T =0 true output • I1 I2 T P • 1.0 1.0 0 0.63
0.1 O3:0.65 I1:1 0.5 O5:0.63 0.3 -0.2 -0.4 I2:1 0.4 O4:0.48
Variabe Encodings • Continuous variables • Ex: • Dollar amounts • Averages: averages sales,volume • Ratio income-debt,payment to laoan • Physical measures: area, temperature... • Transfer between • 0-1 or 0.1 – 0.9 • -1.0 - +1.0 or -0.9 – 0.9 • z scores z = x – mean_x/standard_dev_X
Continuous variables • When a new observation comes • it may be out of range • What to do • Plan for a larger range • Reject out of range values • Pag values lower then min to minrange • higher then max to maxrange
Ordinal variables • Discrete integers • Ex: • Age ranges : young mid old • İncome : low,mid,high • Number of children • Transfer to 0-1 interval • Ex: 5 categories of age • 1 young,2 mid young,3 mid, 4 mid old 5 old • Transfer between 0 to 1
Thermometer coding • 0 0 0 0 0 0/16 = 0 • 1 1 0 0 0 8/16 = 0.5 • 2 1 1 0 0 12/16 = 0.75 • 3 1 1 1 0 14/16 =0.875 • Useful for academic grades or bond ratings • Difference on one side of the scale is more important then on the other side of the scale
Nominal Variables • Ex: • Gender marital status,occupation • 1- treat like ordinary variables • Ex marital status 5 codes: • Single,divorced,maried,widowed,unknown • Mapped to -1,-0.5,0,0.5,1 • Network treat them ordinal • Even though order does not make sence
2- break into flags • One variable for each category • 1 of N coding • Gender has three values • Male female unknown • Male 1 -1 -1 • Female -1 1 -1 • Unknown -1 -1 1
1 of N-1 coding • Male 1 -1 • Female -1 1 • Unknown -1 -1 • 3 replace the varible with an numerical one
Time Series variables • Stock market prediction • Output IMKB100 at t • Inputs: • IMKB100 at t-1, at t-2, at t-3... • Dollar at t-1, t-2,t-3.. • İnterest rate at t-1,t-2,t-3 • Day of week variables • Ordinal • Monday 1 0 0 0 0 ,...,Friday 0 0 0 0 1 • Nominal Monday to Friday map • from -1 to 1 or 0 to 1
- mk x0 w0 x1 w1 f å output y xn wn Input vector x weight vector w weighted sum Activation function A Neuron • The n-dimensional input vector x is mapped into variable y by means of the scalar product and a nonlinear function mapping
- mk x0 w0 x1 w1 f å output y xn wn Input vector x weight vector w weighted sum Activation function A Neuron
Network Training • The ultimate objective of training • obtain a set of weights that makes almost all the tuples in the training data classified correctly • Steps • Initialize weights with random values • repeat until classification error is lower then a threshold (epoch) • Feed the input tuples into the network one by one • For each unit • Compute the net input to the unit as a linear combination of all the inputs to the unit • Compute the output value using the activation function • Compute the error • Update the weights and the bias
Example: Stock market prediction • input variables: • individual stock prices at t-1, t-2,t-3,... • stock index at t-1, t-2, t-3, • inflation rate, interest rate, exchange rates $ • output variable: • predicted stock price next time • train the network with known cases • adjust weights • experiment with different topologies • test the network • use the tested network for predicting unknown stock prices
Other business Applications (1) • Marketing and sales • Prediction • Sales forecasting • Price elasticity forecasting • Customer responce • Classification • Target marketing • Customer satisfaction • Loyalty and retention • Clustering • segmentation
Other business Applications (1) • Risk Management • Credit scoring • Financial health • Clasification • Bankruptcy clasification • Fraud detection • Credit scoring • Clustering • Credit scoring • Risk assesment
Other business Applications (1) • Finance • Prediction • Hedging • Future prediction • Forex stock prediction • Clasification • Stock trend clasification • Bond rating • Clustering • Economic rating • Mutual fond selection
Perceptrons • WK 91 sec 4.2 • N inputs Ii i:1..N • single output O • two classes C0 and C1 denoted by 0 and 1 • one node • output: • O=1 if w1I1+ w2I2+...+wnIn+w0>0 • O=0 if w1I1+ w2I2+...+wnIn+w0<0 • sometimes is used for constant term for w0 • called bias or treshold in ANN
Artificial Neural Nets: Perseptron x0=+1 x1 w1 w0 x2 g w2 y wd xd
Perceptron training procedure(rule) (1) • Find w weights to separate each training sample correctly • Initial weights randomly chosen • weight updating • samples are presented in sequence • after presenting each case weights are updated: • wi(t+1) = wi(t)+ wi(t) • i(t+1) = i(t)+ i(t) • wi(t) = (T -O)Ii • i(t) = (T -O) • O: output of perceptron,T true output for each case, learning rate 0<<1 usually around 0.1
Perceptron training procedure (rule) (2) • each case is presented and • weights are updated • after presenting each case if • the error is not zero • then present all cases ones • each such cycle is called an epoch • unit error is zero for perfectly separable samples
Perceptron convergence theorem: • if the sample is linearly separable the perceptron will eventually converge: separate all the sample correctly • error =0 • the learning rate can be even one • This slows down the convergence • to increase stability • it is gradually decreased • linearly separable: a line or hyperplane can separate all the sample correctly
If classes are not perfectly linearly separable • if a plane or line can not separate classes completely • The procedure will not converge and will keep on cycling through the data forever
o o o o o o o o x o o o x o x o o x x x o x x x x x o x linearly separable not linearly separable
Example calculations • Two inputs w1=0.25 w2=0.5 w0 or =-0.5 • Suppose I1= 1.5 I2 =0.5 • learning rate=0.1 • and T = 0 true output • perceptron separate this as: • 0.25*1.5+0.5*0.5-0.5=0.125>0 O=1 • w1(t+1) = 0.25+0.1(0-1)1.5=0.1 • w2(t+1) = 0.5+ 0.1(0-1)0.5 =0.45 • (t+1) = -0.5+ 0.1(0-1)=-0.6 • with the new weights: • O = 0.1*1.5+0.45*0.5-0.6=-0.225 O =0 • no error
I2 0.25*I1+0.5*I2-0.5=0 1 true class is 0 but classified as class 1 o class 1 0.5 I1 2 1.5 class 0 I2 0.1*I1+0.45*I2-0.6=0 1.33 true class is 0 and classified as class 0 class 1 o 0.5 class 0 6 I1
XOR: exclusive OR problem • Two inputs I1 I2 • when both agree • I1=0 and I2=0 or I1=1 and I2=1 • class 0, O=0 • when both disagree • I1=0 and I2=1 or I1=1 and I2=0 • class 1, O=1 • one line can not solve XOR • but two ilnes can
I2 class 0 1 class 1 I1 0 1 a single line can not separate these classes
Multi-layer networks • Study 4.3 In WK • One layer networks can separate a hyperplane • two layer networks can any convex region • and three layer networks can separate any non convex boundary • Examples see notes
o2 oK o1 wKd x1 xd x2 x0=+1 ANN for classification
I2 + + inside the triangle ABC is class O outside the triangle is class + class =0 if I1+I2>=10 I1<=I2 I2<=10 + C B o o o o o o + o + +1 +1 o + + A + + a + + + I1 I1 d b output of hidden node a: 1 if class O w11I1+w12I2+w10>=0 0 if class is + w11I1+w12I2+w10<0 I2 c so w1i s are W11=1,w12=1 and w10=-10
output of hidden node b: 1 if that is O w21I1+w22I2+w20>=0 0 if that is + w21I1+w22I2+w20<0 so w2i s are W21=-1,w22=1 and w10=-0 I2 + + + C B o o o o o o + o + +1 +1 o + + A + + a + + + I1 I1 d b output of hidden node c: 1 if w31I1+w32I2+w30>=0 0 if w11I1+w12I2+w10<=0 I2 c so w1i s are W31=0,w32=-1 and w10=10
an object is class O if all hidden units predict is as class 0 output is 1 if w’aHa+w’bHb+w’cHc+wd>=0 output is 0 if w’aHa+w’bHb+w’cHc+wd<0 I2 + + + C B o o o o o o + o + +1 +1 o + + A + + a + + + I1 I1 d b weights of output node d: wa=1,wb=1wc=1 wd=-3+x where x a small number I2 c
ADBC is the union of two convex regions in this case triangles each triangular region can be separated by a two layer network Two hidden layers Can seperate any Nonconvex region I2 + + + C B +1 o o o +1 o o o o o + o + o a o + A + w’’f0 o + + + + + I1 I1 b D w’’f1 d d separates ABC e separates ADB ADBC is union of ABC and ADB f I2 c w’’f2 e output is class O if w’’f0+w’’f1He+w’’f2Hf>=0 w’’f0=--0.99,w’’f=1,w’’f2=1 first hidden layer second hidden layer
In practice boundaries are not known but increasing number of hidden node: two layer perceptron can separate any convex region • if it is perfectly separable • adding a second hidden layer and or ing the convex regions any nonconvex boundary can be separated • if it is perfectly separable • Weights are unknown but are found by training the network
For prediction problems • Any function can be approximated with a oe hiden layer network Y X
Network Training • The ultimate objective of training • obtain a set of weights that makes almost all the tuples in the training data classified correctly • Steps • Initialize weights with random values • Feed the input tuples into the network one by one • For each unit • Compute the net input to the unit as a linear combination of all the inputs to the unit • Compute the output value using the activation function • Compute the error • Update the weights and the bias
Multi-Layer Perceptron Output vector Output nodes Hidden nodes wij Input nodes Input vector: xi
Back propagation algorithm • LMS uses a linear activation function • not so useful • threshold activation function is very good in separating but not differentiable • back propagation uses logistic function • O = 1/(1+exp(-N))=(1+exp(-N))-1 • N = w1I1+w2I2+... wNIN+ • the derivative of logistic function • dO/dN = O*(1-O) expressed as a function of output where O =1/(1+exp(-N)), 0<=O<=1
Minimize total error again • E= (1/2)Nd=1Mk=1(Tk,d-Ok,d)2 • where N is number of cases • M number of output units • Tk,d:true value of sample d in output unit k • Ok,d:predicted value of sample d in output unit k • the algorithm updates weights by a similar method to the delta rule • for each output units • wij=d=1 Od(1-Od)(Td-Od)Ii,dor • wij(t) = O(1-O)(T -O)Ii|when objects are • ij(t) = O(1-O)(T -O) | presented sequentially • here O(1-O)(T -O)= is the error term
so wij(t)= *errorj*Ii or ij(t)= *errorj • for all training samples • new weights are • wi(t+1) = wi(t)+ wi(t) • i(t+1) = i(t)+ i(t) • but for hidden layer weights no target value is available • wij(t) = Od(1-Od) (Mk=1errork*wkh)Ii • ij(t) = Od(1-Od)(Mk=1errork*wkh) • the error rate of each output is weighted by its weight and summed up to find the error derivative • The weights from hidden unit h to output unit k is responsible for the error in output unit k
Example: Sample iterations • A network suggested to solve the XOR problem figure 4.7 of WK page 96-99 • learning rate is 1 for simplicity • I1=I2=1 • T =0 true output
0.1 O3:0.65 I1:1 0.5 O5:0.63 0.3 -0.2 -0.4 I2:1 0.4 O4:0.48