670 likes | 784 Views
Data Mining: and Knowledge Acquizition — Chapter 5 —. BIS 541 2013/2014 Summer. Mathematically. Classification. Classification: predicts categorical class labels Typical Applications {credit history, salary}-> credit approval ( Yes/No) {Temp, Humidity} --> Rain (Yes/No).
E N D
Data Mining: and Knowledge Acquizition — Chapter 5 — BIS 541 2013/2014 Summer
Mathematically Classification • Classification: • predicts categorical class labels • Typical Applications • {credit history, salary}-> credit approval ( Yes/No) • {Temp, Humidity} --> Rain (Yes/No)
Linear Classification • Binary Classification problem • The data above the red line belongs to class ‘x’ • The data below red line belongs to class ‘o’ • Examples – SVM, Perceptron, Probabilistic Classifiers x x x x x x x o x x o o x o o o o o o o o o o
Neural Networks • Analogy to Biological Systems (Indeed a great example of a good learning system) • Massive Parallelism allowing for computational efficiency • The first learning algorithm came in 1959 (Rosenblatt) who suggested that if a target output value is provided for a single neuron with fixed inputs, one can incrementally change weights to learn to produce these outputs using the perceptron learning rule
Neural Networks • Advantages • prediction accuracy is generally high • robust, works when training examples contain errors • output may be discrete, real-valued, or a vector of several discrete or real-valued attributes • fast evaluation of the learned target function • Criticism • long training time • difficult to understand the learned function (weights) • not easy to incorporate domain knowledge
Network Topology • Input variables number of inputs • number of hidden layers • # of nodes in each hidden layer • # of output nodes • can handle discrete or continuous variables • normalisation for continuous to 0..1 interval • for discrete variables • use k inputs for each level • use k output for each level if k>2 • A has three distinct values a1,a2,a3 • three input variables I1,I2I3 when A=a1 I1=1,I2,I3=0 • feed-forward:no cycle to input untis • fully connected:each unit to each in the forward layer
Multi-Layer Perceptron Output vector Output nodes Hidden nodes wij Input nodes Input vector: xi
Example: Sample iterations • A network suggested to solve the XOR problem figure 4.7 of WK page 96-99 • learning rate is 1 for simplicity • I1=I2=1 • T =0 true output • I1 I2 T P • 1.0 1.0 0 0.63
0.1 O3:0.65 I1:1 0.5 O5:0.63 0.3 -0.2 -0.4 I2:1 0.4 O4:0.48
Variabe Encodings • Continuous variables • Ex: • Dollar amounts • Averages: averages sales,volume • Ratio income-debt,payment to laoan • Physical measures: area, temperature... • Transfer between • 0-1 or 0.1 – 0.9 • -1.0 - +1.0 or -0.9 – 0.9 • z scores z = x – mean_x/standard_dev_X
Continuous variables • When a new observation comes • it may be out of range • What to do • Plan for a larger range • Reject out of range values • Pag values lower then min to minrange • higher then max to maxrange
Ordinal variables • Discrete integers • Ex: • Age ranges : young mid old • İncome : low,mid,high • Number of children • Transfer to 0-1 interval • Ex: 5 categories of age • 1 young,2 mid young,3 mid, 4 mid old 5 old • Transfer between 0 to 1
Thermometer coding • 0 0 0 0 0 0/16 = 0 • 1 1 0 0 0 8/16 = 0.5 • 2 1 1 0 0 12/16 = 0.75 • 3 1 1 1 0 14/16 =0.875 • Useful for academic grades or bond ratings • Difference on one side of the scale is more important then on the other side of the scale
Nominal Variables • Ex: • Gender marital status,occupation • 1- treat like ordinary variables • Ex marital status 5 codes: • Single,divorced,maried,widowed,unknown • Mapped to -1,-0.5,0,0.5,1 • Network treat them ordinal • Even though order does not make sence
2- break into flags • One variable for each category • 1 of N coding • Gender has three values • Male female unknown • Male 1 -1 -1 • Female -1 1 -1 • Unknown -1 -1 1
1 of N-1 coding • Male 1 -1 • Female -1 1 • Unknown -1 -1 • 3 replace the varible with an numerical one
Time Series variables • Stock market prediction • Output IMKB100 at t • Inputs: • IMKB100 at t-1, at t-2, at t-3... • Dollar at t-1, t-2,t-3.. • İnterest rate at t-1,t-2,t-3 • Day of week variables • Ordinal • Monday 1 0 0 0 0 ,...,Friday 0 0 0 0 1 • Nominal Monday to Friday map • from -1 to 1 or 0 to 1
- mk x0 w0 x1 w1 f å output y xn wn Input vector x weight vector w weighted sum Activation function A Neuron • The n-dimensional input vector x is mapped into variable y by means of the scalar product and a nonlinear function mapping
- mk x0 w0 x1 w1 f å output y xn wn Input vector x weight vector w weighted sum Activation function A Neuron
Network Training • The ultimate objective of training • obtain a set of weights that makes almost all the tuples in the training data classified correctly • Steps • Initialize weights with random values • repeat until classification error is lower then a threshold (epoch) • Feed the input tuples into the network one by one • For each unit • Compute the net input to the unit as a linear combination of all the inputs to the unit • Compute the output value using the activation function • Compute the error • Update the weights and the bias
Example: Stock market prediction • input variables: • individual stock prices at t-1, t-2,t-3,... • stock index at t-1, t-2, t-3, • inflation rate, interest rate, exchange rates $ • output variable: • predicted stock price next time • train the network with known cases • adjust weights • experiment with different topologies • test the network • use the tested network for predicting unknown stock prices
Other business Applications (1) • Marketing and sales • Prediction • Sales forecasting • Price elasticity forecasting • Customer responce • Classification • Target marketing • Customer satisfaction • Loyalty and retention • Clustering • segmentation
Other business Applications (1) • Risk Management • Credit scoring • Financial health • Clasification • Bankruptcy clasification • Fraud detection • Credit scoring • Clustering • Credit scoring • Risk assesment
Other business Applications (1) • Finance • Prediction • Hedging • Future prediction • Forex stock prediction • Clasification • Stock trend clasification • Bond rating • Clustering • Economic rating • Mutual fond selection
Perceptrons • WK 91 sec 4.2 • N inputs Ii i:1..N • single output O • two classes C0 and C1 denoted by 0 and 1 • one node • output: • O=1 if w1I1+ w2I2+...+wnIn+w0>0 • O=0 if w1I1+ w2I2+...+wnIn+w0<0 • sometimes is used for constant term for w0 • called bias or treshold in ANN
Artificial Neural Nets: Perseptron x0=+1 x1 w1 w0 x2 g w2 y wd xd
Perceptron training procedure(rule) (1) • Find w weights to separate each training sample correctly • Initial weights randomly chosen • weight updating • samples are presented in sequence • after presenting each case weights are updated: • wi(t+1) = wi(t)+ wi(t) • i(t+1) = i(t)+ i(t) • wi(t) = (T -O)Ii • i(t) = (T -O) • O: output of perceptron,T true output for each case, learning rate 0<<1 usually around 0.1
Perceptron training procedure (rule) (2) • each case is presented and • weights are updated • after presenting each case if • the error is not zero • then present all cases ones • each such cycle is called an epoch • unit error is zero for perfectly separable samples
Perceptron convergence theorem: • if the sample is linearly separable the perceptron will eventually converge: separate all the sample correctly • error =0 • the learning rate can be even one • This slows down the convergence • to increase stability • it is gradually decreased • linearly separable: a line or hyperplane can separate all the sample correctly
If classes are not perfectly linearly separable • if a plane or line can not separate classes completely • The procedure will not converge and will keep on cycling through the data forever
o o o o o o o o x o o o x o x o o x x x o x x x x x o x linearly separable not linearly separable
Example calculations • Two inputs w1=0.25 w2=0.5 w0 or =-0.5 • Suppose I1= 1.5 I2 =0.5 • learning rate=0.1 • and T = 0 true output • perceptron separate this as: • 0.25*1.5+0.5*0.5-0.5=0.125>0 O=1 • w1(t+1) = 0.25+0.1(0-1)1.5=0.1 • w2(t+1) = 0.5+ 0.1(0-1)0.5 =0.45 • (t+1) = -0.5+ 0.1(0-1)=-0.6 • with the new weights: • O = 0.1*1.5+0.45*0.5-0.6=-0.225 O =0 • no error
I2 0.25*I1+0.5*I2-0.5=0 1 true class is 0 but classified as class 1 o class 1 0.5 I1 2 1.5 class 0 I2 0.1*I1+0.45*I2-0.6=0 1.33 true class is 0 and classified as class 0 class 1 o 0.5 class 0 6 I1
XOR: exclusive OR problem • Two inputs I1 I2 • when both agree • I1=0 and I2=0 or I1=1 and I2=1 • class 0, O=0 • when both disagree • I1=0 and I2=1 or I1=1 and I2=0 • class 1, O=1 • one line can not solve XOR • but two ilnes can
I2 class 0 1 class 1 I1 0 1 a single line can not separate these classes
Multi-layer networks • Study 4.3 In WK • One layer networks can separate a hyperplane • two layer networks can any convex region • and three layer networks can separate any non convex boundary • Examples see notes
o2 oK o1 wKd x1 xd x2 x0=+1 ANN for classification
I2 + + inside the triangle ABC is class O outside the triangle is class + class =0 if I1+I2>=10 I1<=I2 I2<=10 + C B o o o o o o + o + +1 +1 o + + A + + a + + + I1 I1 d b output of hidden node a: 1 if class O w11I1+w12I2+w10>=0 0 if class is + w11I1+w12I2+w10<0 I2 c so w1i s are W11=1,w12=1 and w10=-10
output of hidden node b: 1 if that is O w21I1+w22I2+w20>=0 0 if that is + w21I1+w22I2+w20<0 so w2i s are W21=-1,w22=1 and w10=-0 I2 + + + C B o o o o o o + o + +1 +1 o + + A + + a + + + I1 I1 d b output of hidden node c: 1 if w31I1+w32I2+w30>=0 0 if w11I1+w12I2+w10<=0 I2 c so w1i s are W31=0,w32=-1 and w10=10
an object is class O if all hidden units predict is as class 0 output is 1 if w’aHa+w’bHb+w’cHc+wd>=0 output is 0 if w’aHa+w’bHb+w’cHc+wd<0 I2 + + + C B o o o o o o + o + +1 +1 o + + A + + a + + + I1 I1 d b weights of output node d: wa=1,wb=1wc=1 wd=-3+x where x a small number I2 c
ADBC is the union of two convex regions in this case triangles each triangular region can be separated by a two layer network Two hidden layers Can seperate any Nonconvex region I2 + + + C B +1 o o o +1 o o o o o + o + o a o + A + w’’f0 o + + + + + I1 I1 b D w’’f1 d d separates ABC e separates ADB ADBC is union of ABC and ADB f I2 c w’’f2 e output is class O if w’’f0+w’’f1He+w’’f2Hf>=0 w’’f0=--0.99,w’’f=1,w’’f2=1 first hidden layer second hidden layer
In practice boundaries are not known but increasing number of hidden node: two layer perceptron can separate any convex region • if it is perfectly separable • adding a second hidden layer and or ing the convex regions any nonconvex boundary can be separated • if it is perfectly separable • Weights are unknown but are found by training the network
For prediction problems • Any function can be approximated with a oe hiden layer network Y X
Network Training • The ultimate objective of training • obtain a set of weights that makes almost all the tuples in the training data classified correctly • Steps • Initialize weights with random values • Feed the input tuples into the network one by one • For each unit • Compute the net input to the unit as a linear combination of all the inputs to the unit • Compute the output value using the activation function • Compute the error • Update the weights and the bias
Multi-Layer Perceptron Output vector Output nodes Hidden nodes wij Input nodes Input vector: xi
Back propagation algorithm • LMS uses a linear activation function • not so useful • threshold activation function is very good in separating but not differentiable • back propagation uses logistic function • O = 1/(1+exp(-N))=(1+exp(-N))-1 • N = w1I1+w2I2+... wNIN+ • the derivative of logistic function • dO/dN = O*(1-O) expressed as a function of output where O =1/(1+exp(-N)), 0<=O<=1
Minimize total error again • E= (1/2)Nd=1Mk=1(Tk,d-Ok,d)2 • where N is number of cases • M number of output units • Tk,d:true value of sample d in output unit k • Ok,d:predicted value of sample d in output unit k • the algorithm updates weights by a similar method to the delta rule • for each output units • wij=d=1 Od(1-Od)(Td-Od)Ii,dor • wij(t) = O(1-O)(T -O)Ii|when objects are • ij(t) = O(1-O)(T -O) | presented sequentially • here O(1-O)(T -O)= is the error term
so wij(t)= *errorj*Ii or ij(t)= *errorj • for all training samples • new weights are • wi(t+1) = wi(t)+ wi(t) • i(t+1) = i(t)+ i(t) • but for hidden layer weights no target value is available • wij(t) = Od(1-Od) (Mk=1errork*wkh)Ii • ij(t) = Od(1-Od)(Mk=1errork*wkh) • the error rate of each output is weighted by its weight and summed up to find the error derivative • The weights from hidden unit h to output unit k is responsible for the error in output unit k
Example: Sample iterations • A network suggested to solve the XOR problem figure 4.7 of WK page 96-99 • learning rate is 1 for simplicity • I1=I2=1 • T =0 true output
0.1 O3:0.65 I1:1 0.5 O5:0.63 0.3 -0.2 -0.4 I2:1 0.4 O4:0.48