260 likes | 447 Views
STAT 6601 Classification: Neural Networks V&R 12.2. By Gary Gongwer, Madhu Iyer, Mable Kong. Classification.
E N D
STAT 6601Classification:Neural NetworksV&R 12.2 By Gary Gongwer, Madhu Iyer, Mable Kong
Classification • Classification is a multivariate technique concerned with assigning data cases (i.e. observations) to one of a fixed number of possible classes (represented by nominal output variables). The Goal of classification is to: • sort observations into two or more labeled classes. The emphasis is on deriving a rule that can be used to optimally assign new objects to the labeled classes. • In short, the aim of classification is to assign input cases to one of a number of classes
Simple pattern Classification Example • Let us consider a simple problem of distinguishing handwritten versions of the characters ‘a’ and ‘b’. • We seek an algorithm which can distinguish as reliably as possible between the two characters. • Therefore, goal in this classification problem is to develop an algorithm which will assign any image, represented by a vector x, to one of two classes, which we shall denote by Ck, where k=1,2, so that class C1 corresponds to the character ‘a’ and class C2 corresponds to ‘b’.
Example • A large number of input variables can present severe problems for pattern recognition systems. One technique to alleviate such problems is to combine input variables together to make a smaller number of new variables called features. • In the present example we could evaluate the ratio of the height of the character to its width ( x1) and we might expect that characters from class C2 (corresponding to ‘b’) will typically have larger values of x1 than the characters from class C1 (corresponding to ‘a’).
How can we make the best use of x1 to classify a new image so as to minimize the number of misclassifications? • One approach would be to build a classifier system which uses a threshold for the value of x1 and which classifies as C2 any image for which x1 exceeds the threshold, and which classifies all other images as C1. • The number of misclassifications will be minimized if we choose the threshold to be at the point where the two histograms cross. This classification procedure is based on the evaluation of x1 followed by its comparison with a threshold. • Problem of this classification procedure: There is still significant overlap of the histograms, and many of the new characters we will test will be misclassified.
Now consider another feature x2. We try to classify new images on the basis of the values of x1 and x2. • We see examples of patterns from two classes plotted in the (x1,x2) space. It is possible to draw a line in this space, known as the decision boundary which gives good separation of the two classes. • New patterns which lie above the decision boundary are classified as belonging to C1 while patterns falling below the decision boundary are classified as C2.
We could continue to consider larger number of independent features in the hope of improving the performance . • Instead we could aim to build a classifier which has the smallest probability of making a mistake.
Classification Theory • In the terminology of pattern recognition, the given examples together with their classifications are known as the training set and future cases form the test set. • Our primary measure of success is the error or (misclassification) rate. • Confusion matrix gives the number of cases with true class i classified as of class j. • Assign costs Lij to allocating a case of class i to class j. Therefore we are interested in the average error cost rather than the error rate.
Average Error Cost • The average error cost is minimized by the Bayes rule, which is to allocate to the class c minimizing ∑iLij p(i|x) • where p(i|x) is the posterior distribution of the classes after observing x. • If the costs of all errors are the same this rule amounts to choosing the class c with the largest posterior probability p(c|x). • Minimum average cost is known as the Bayes risk.
Classification and Regression • We can represent the outcome of the classification in terms of a variable y which takes the value 1 if the image is classified as C1, and the value of 0 if it is classified as C2. • yk = yk(x;w) • w denotes the vector of parameters often called weights • The importance of neural networks in this context is that they offer a very powerful and very general framework for representing non-linear mappings from several input variables to several output variables where the form of the mapping is governed by a number of adjustable parameters.
Objective: Simulate the Behavior of a Human Nerve • Inputs are accumulated by a weighted sum. • This sum is the input for output function φ.
A single neuron is not very flexible • Input layer contains the value of each variable • Hidden layer allows approximations by combining multiple logarithmic functions • Output neuron with highest probability determines class
Regression = Learning • The weights are adjusted iteratively (batch or on-line) • Initially, they are random and small • Weight decay (λ) keeps weights from becoming too large
Backpropagation • Adjusts weights “back to front” • Uses partial derivatives and chain rule
Avoiding Local Maxima • Make weights initially random • Use multiple runs and take the average
Cushing’s syndrome is a hypersensitive disorder associated with over-secretion of cortisol by the adrenal gland. Three recognized types of syndromes: a: adenoma b: bilateral hyperplasia c: carcinoma u: unknown type The observations are urinary excretion rates (mg/24hr) of the steroid metabolites tetrahydrocortisone = T and pregnanetriol = P, and are consider on log scale. An Example: Cushing’s Syndrome
Tetrahydrocortisone Pregnanetriol Type a1 3.1 11.70 a a2 3.0 1.30 a a3 1.9 0.10 a a4 3.8 0.04 a a5 4.1 1.10 a a6 1.9 0.40 a b1 8.3 1.00 b b2 3.8 0.20 b b3 3.9 0.60 b b4 7.8 1.20 b b5 9.1 0.60 b b6 15.4 3.60 b b7 7.7 1.60 b b8 6.5 0.40 b b9 5.7 0.40 b b10 13.6 1.60 b c1 10.2 6.40 c c2 9.2 7.90 c c3 9.6 3.10 c c4 53.8 2.50 c c5 15.8 7.60 c u1 5.1 0.40 u u2 12.9 5.00 u u3 13.0 0.80 u u4 2.6 0.10 u u5 30.0 0.10 u u6 20.5 0.80 u Cushing’s Syndrome Data
R Code library(MASS); library(class); library(nnet) cush <- log(as.matrix(Cushings[, -3]))[1:21,] tpi <- class.ind(Cushings$Type[1:21, drop = T]) xp <- seq(0.6, 4.0, length = 100); np <- length(xp) yp <- seq(-3.25, 2.45, length = 100) cushT <- expand.grid(Tetrahydrocortisone = xp, Pregnanetriol = yp) pltnn <- function(main, ...) { plot(Cushings[,1], Cushings[,2], log="xy", type="n", xlab="Tetrahydrocortisone", ylab = "Pregnanetriol", main=main, ...) for(il in 1:4) { set <- Cushings$Type==levels(Cushings$Type)[il] text(Cushings[set, 1], Cushings[set, 2], as.character(Cushings$Type[set]), col = 2 + il) }} #pltnn plots T and P against each other by type (a, b, c, u)
> cush <- log(as.matrix(Cushings[, -3]))[1:21,] > cush Tetrahydrocortisone Pregnanetriol a1 1.1314021 2.45958884 a2 1.0986123 0.26236426 a3 0.6418539 -2.30258509 a4 1.3350011 -3.21887582 a5 1.4109870 0.09531018 a6 0.6418539 -0.91629073 b1 2.1162555 0.00000000 b2 1.3350011 -1.60943791 b3 1.3609766 -0.51082562 b4 2.0541237 0.18232156 b5 2.2082744 -0.51082562 b6 2.7343675 1.28093385 b7 2.0412203 0.47000363 b8 1.8718022 -0.91629073 b9 1.7404662 -0.91629073 b10 2.6100698 0.47000363 c1 2.3223877 1.85629799 c2 2.2192035 2.06686276 c3 2.2617631 1.13140211 c4 3.9852735 0.91629073 c5 2.7600099 2.02814825 > tpi <- class.ind(Cushings$Type[1:21, drop = T]) > tpi a b c [1,] 1 0 0 [2,] 1 0 0 [3,] 1 0 0 [4,] 1 0 0 [5,] 1 0 0 [6,] 1 0 0 [7,] 0 1 0 [8,] 0 1 0 [9,] 0 1 0 [10,] 0 1 0 [11,] 0 1 0 [12,] 0 1 0 [13,] 0 1 0 [14,] 0 1 0 [15,] 0 1 0 [16,] 0 1 0 [17,] 0 0 1 [18,] 0 0 1 [19,] 0 0 1 [20,] 0 0 1 [21,] 0 0 1
plt.bndry <- function(size=0, decay=0, ...) { cush.nn <- nnet(cush, tpi, skip=T, softmax=T, size=size, decay=decay, maxit=1000) invisible(b1(predict(cush.nn, cushT), ...)) } cush – data frame of x values of examples. tpi – data frame of target values of examples. skip – switch to add skip-layer connections from input to output. softmax – switch for softmax (log-linear model) and maximum conditional likelihood fitting. size – number of units in the hidden layer. decay – parameter for weight decay. maxit – maximum number of iterations. invisible – return a (temporarily) invisible copy of an object. predict – generic function for predictions from the results of various model fitting functions. The function invokes particular _methods_ which depend on the 'class' of the first argument. Here: using cush.nn to predict cushT
b1 <- function(Z, ...) { zp <- Z[,3] - pmax(Z[,2], Z[,1]) contour(exp(xp), exp(yp), matrix(zp, np), add=T, levels=0, labex=0, ...) zp <- Z[,1] - pmax(Z[,3], Z[,2]) contour(exp(xp), exp(yp), matrix(zp, np), add=T, levels=0, labex=0, ...) }
par(mfrow = c(2, 2)) pltnn("Size = 2") set.seed(1); plt.bndry(size = 2, col = 2) set.seed(3); plt.bndry(size = 2, col = 3) plt.bndry(size = 2, col = 4) pltnn("Size = 2, lambda = 0.001") set.seed(1); plt.bndry(size = 2, decay = 0.001, col = 2) set.seed(2); plt.bndry(size = 2, decay = 0.001, col = 4) pltnn("Size = 2, lambda = 0.01") set.seed(1); plt.bndry(size = 2, decay = 0.01, col = 2) set.seed(2); plt.bndry(size = 2, decay = 0.01, col = 4) pltnn("Size = 5, 20 lambda = 0.01") set.seed(2); plt.bndry(size = 5, decay = 0.01, col = 1) set.seed(2); plt.bndry(size = 20, decay = 0.01, col = 2)
# functions pltnn and b1 are in the scripts pltnn("Many local maxima") Z <- matrix(0, nrow(cushT), ncol(tpi)) for(iter in 1:20) { set.seed(iter) cush.nn <- nnet(cush, tpi, skip = T, softmax = T, size = 3, decay = 0.01, maxit = 1000, trace = F) Z <- Z + predict(cush.nn, cushT) cat("final value", format(round(cush.nn$value,3)), "\n") b1(predict(cush.nn, cushT), col = 2, lwd = 0.5) } pltnn("Averaged") b1(Z, lwd = 3)
References Bishop, C.M. (1995) Neural Networks for Pattern Recognition. Oxford: Clarendon Press. Ripley, B.D. (1996) Pattern Recognition and Neural Networks. Cambridge: Cambridge University press.