440 likes | 520 Views
Classification. 10/03/07. Diagnose disease by gene expression pattern. Golub et al. 1999. Diagnose disease by gene expression pattern. Golub et al. 1999. Two types of statistical learning. Supervised
E N D
Classification 10/03/07
Diagnose disease by gene expression pattern Golub et al. 1999
Diagnose disease by gene expression pattern Golub et al. 1999
Two types of statistical learning • Supervised • The classes are predefined. The membership for a set of objects are known. Try to develop a rule to predict the membership for a new object. • Unsupervised • Discover clusters of patterns from observed data. Both membership and the clusters need to be identified. • Classification is a kind of supervised learning.
How good is good enough? • Suppose a test is used to screen for a certain disease. The test has 99% sensitivity and 99% specificity. • The disease is rare: 1 case out of 1 million people. • Question: Is this test useful?
How good is good enough? • Misclassification rate = 0.999999 * 0.01 + 0.000001 * 0.01 = 0.01 • If we predict that no one has the disease, the misclassification rate = 0.000001 * 1 = 0.000001 • Does that mean the test is no good?
Loss function • Often our goal is to minimize the misclassification error rate. • Sometimes an error in one direction outweighs an error in the other direction. For example, It is more costly to classify a patient as healthy then to classify a healthy patient as sick. • In general, we want to minimize a loss function L(Ctrue, Cpredict).
Procedure for developing a classifier • Collect data with known class association. • Take out a subset, don’t touch it. This will be the testing subset. • Building a model using information from the rest of the data, i.e., the training set. • Apply the trained model to the testing data. Evaluate model performance. • If you use all data to train your model, then you will be overfitting your model and the performance will be exaggerated.
k-nearest-neighbor classifier • Find k-nearest neighbors 1 2 5 3 4
k-nearest-neighbor classifier • Find k-nearest neighbors • Classify the unknown case by majority vote. • Despite its simplicity, kNN can be effective. 1 2 5 3 4
Issues with k-nearest-neighbor classifier • Computationally intensive • How to choose k • Nearest-neighbors may not be close (especially when X is high dimensional). • Most genes are probably irrelevant to the prediction anyhow. • Pre-select features using dimension reduction methods (discussed by Prof. Cai last time). • Dimension reduction is important for other classifiers as well.
Feature selection • The dimension of the model = number of genes is very high. • It is hard to find close neighbors in high dimensional space • Many genes are irrelevant • Pre-select genes using dimension reduction methods • Dimension reduction is required for other models as well.
Feature selection methods • Stepwise regression • PCA • PLS • Ridge regression • LASSO • etc. (Cai)
Classification Methods • Linear discriminant analysis (LDA) • Logistic regression • Classification trees • Support vector machine (SVM) • Neural network • Many other methods!
Linear methods Class 2 Class 1 ???
Linear Discriminant Analysis (LDA) Class 2 Approximate the probability distribution within each class by a Gaussian distribution. Class 1
Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification rate. Maximum likelihood rule is equivalent to Bayes rule with uniform prior. Decision boundary is
Linear Discriminant Analysis Assume
LDA • The boundary is linear if the variances for the two classes are the same. • Otherwise, the boundary is quadratic and the method is called QDA. Class 2 Class 1
Logistic regression • Model the log-odds between the k-th class vs a reference class: e.g. 1st class. Select k with the largest P(G = k | X = x) Question: How to estimate the b’s?
Fitting logistic regression model • Let Maximize the conditional log-likelihood. where In the special case of two classes, let yi = 0 when gi = 1, and yi = 1 when gi = 2. Then The maximum is achieved when
Fitting logistic regression model (ctd) • Since this is a non-linear equation, it can only be solved numerically. This is achieved by the Newton-Raphson method. where Note: global convergence is not guaranteed. • For multiple classes b can be solved similarly.
Naïve Bayes method From Bayes’ rule, If is high-dimensional (number of genes considered), pk(X) is difficult to estimate. However, if we assume the Xj’s are independent with each other, i.e., then pkj(Xj) can be easily estimated.
Naïve Bayes method Therefore Note: Surprisingly, even though the assumption that Xj’s are independent is almost never met, the naïve Bayes classified often performs well, even beating more sophisticated methods.
Age >=30 <30 Car Type YES sports car minivan NO YES Classification tree Goal: Predict whether a person owns a house by asking a few questions with yes or no answers. Predictors: Age, Car Type, etc.
Age >=30 <30 Car Type YES sports car minivan NO YES Age >= 30 Sports car minivan
Regression tree: Algorithm Response function is continuous. Goal: select a partition of regions (nodes): R1, …, RM, so that the response can be modeled as a constant cm in each region. Step 1: For a splitting variable Xj and a splitting point s, define Seek j and s, so that is minimized. Step 2: For each Rm , refine the partition by repeating step 1, stop when the number of nodes reaches a predefined cutoff.
Classification tree: Pruning Define a subtree to be any tree that can be obtained by pruning T. Let The quality of a tree is given by Define a cost-complexity criterion for a pre-selected level a Seek the subtree Ta that minimized the Ca(T).
Classification tree: Pruning Find the weak link, that is, a node that leads to minimum increase of . Repeat the above procedure until a single node tree is achieved. Theorem (Breiman et al. 1984): The optimal subtree is contained in the above sequence of subtrees. The level of a can be determined through cross-validation. (We will talk about cross-validation later.)
Classification tree • Classification tree differs from regression tree in the quality term. • For regression tree, minimize • For classification tree, minimize • Misclassification error: • Gini index • or Cross-entropy or deviance
Classification tree • Advantage • Visually intuitive • Mathematically “simple” • Drawback • Unstable: tree structures are sensitive to data • Theoretical properties are not well understood
Performance of a classifier • Cross-validation • Bootstrap
Cross-validation • The data is divided into a training subset and a testing subset. • Model building must be independent of testing subset, including variable selection, tree structure, and so on. Example: n-fold cross-validation • A dataset is randomly divided into n subsets of equal size. • Each subset is selected in turn as the testing set, whereas the rest are used as the training set.
Bootstrap methods Idea: Random draw with replacement from the training data, each sample the same size as the original training set. Fit the model using the resampled data, then treat the original training data as testing data. Estimate Improved version
Use cross-validation to select parameters • A classifier may have several tunable parameters. For example, number of nearest neighbors, a for classification tree. • These parameters can be selected by CV. In these cases, the full dataset is divided into three parts: training set, testing set 1, and testing set 2. • Testing set 1 is used to tune parameters. So it cannot be used to objectively estimate model performance. Therefore, testing set 2 is needed.
Acknowledgement • Sources of slides: • Cheng Li • http://www.cs.cornell.edu/johannes/papers/2001/kdd2001-tutorial-final.pdf • www.cse.msu.edu/~lawhiu/intro_SVM_new.ppt