1 / 47

Classification for Lifelog Management

Classification for Lifelog Management. September 30, 2008 Sung-Bae Cho. Agenda. Introduction to Classification Discriminant Analysis Decision Tree Density Estimation Artificial Neural Networks Comparison of Classification Methods An Example of Classification for Life Log Analysis.

zea
Download Presentation

Classification for Lifelog Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Classification for Lifelog Management September 30, 2008 Sung-Bae Cho

  2. Agenda • Introduction to Classification • Discriminant Analysis • Decision Tree • Density Estimation • Artificial Neural Networks • Comparison of Classification Methods • An Example of Classification for Life Log Analysis

  3. Classification • Definition • A procedure in which individual itemsare placed into groups based on quantitative information on one or more characteristics inherent in the items (referred to as traits, variables, characters, etc) and based on a training set of previously labeled items. • Formally, the problem can be stated as follows: • Given training data produce a classifier which maps an object to its classification label . • For example, if the problem is filtering spam, then Xi is some representation of an email and y is either "Spam" or "Non-Spam". • Classification algorithms are typically used in pattern recognition systems.

  4. Other Types of Classification Problem • Can consider classification as an estimation problem, where the goal is to estimate a function of the form • where the feature vector input is , and the function f is typically parameterized by some parameters . • In the Bayesian approach to this problem, instead of choosing a single parameter vector , the result is integrated over all possible thetas • with the thetas weighted by how likely they are given the training data D:

  5. Features of Classification • Structure of a classification task • Prior probability: relative frequency with which the classes occur in the population • Criterion for separating the classes: underlying relation that uses observed attributes to distinguish an individual from each class • Misclassification cost: cost associated with making a wrong classification • Benefits • Mechanical classification is faster • Human operators may have biases • Cheaper, diagnosis made on external symptoms, avoiding surgery • Understanding structural relationships between the response and the measured variables • Issues • Accuracy, Speed, Comprehensibility • Time to learn

  6. Examples of Classification Algorithms • Discriminant analysis • Linear / Quadratic discriminant • Logistic regression • Support vector machine • Decision tree • Density estimation • k-nearest neighbor • Artificial neural networks • Perceptron • Multi layer perceptron • Naive Bayes classifier • Probabilistic classification • Bayesian networks* • Temporal classification • Hidden Markov models* • Dynamic Bayesian networks* * at the next lectures

  7. Agenda • Introduction to Classification • Discriminant Analysis • Decision Tree • Density Estimation • Artificial Neural Networks • Comparison of Classification Methods • An Example of Classification for Life Log Analysis

  8. Discriminant Analysis • Definition • Divide sample space by a series of lines in two dimensions, planes in 3-D and, more generally hyper planes in many dimensions • The line dividing two classes is drawn to bisect the line joining the centers of those classes • The direction of the line is determined by the shape of the clusters of points

  9. Discriminant Analysis: Example

  10. Discriminant Analysis: Issues and Properties • Requires numerical attributes, no missing values • Dummy variables for categories • Maximizes separation between the classes in a least square sense • Attributes must be linearly independent • No attribute can be constant within each class (add random noise) • Canonical variates for multiclass • Specially useful when the class means are ordered, or lie along a simple curve in attribute-space • Logistic maximizes conditional P(class|x) • Assumes that the form of the underlying density functions (or ratio) is known

  11. Support Vector Machines • Definition: • A set of related supervised learning methods used for classification • Viewing input data as two sets of vectors in an n-dimensional space • Constructing a separating hyperplane in the space • which maximizes the margin between the two data sets

  12. Linear Support Vector Machines Support Vectors

  13. Agenda • Introduction to Classification • Discriminant Analysis • Decision Tree • Density Estimation • Artificial Neural Networks • Comparison of Classification Methods • An Example of Classification for Life Log Analysis

  14. Decision Tree • Definition • Space is divided into boxes, and at each stage in the procedure, each box is examined to see if it may be split into more boxes • The split usually being parallel to the coordinate axes • Structure

  15. Decision Tree: Example

  16. Classification Process of a Tree • Generate_tests: • Attr = val, Attr < val, Attr In Subset, … • Test: • Gini, Entropy, KS, Twoing, … • Stop_criterion: • Num cases, significance (e.g., c2-test), overgrow & prune • Info: • Most frequent class ("mode"), … • Prune: • Reduced-error, cost complexity, … Training Data Classification Rules Learning Algorithm Testing Data

  17. Agenda • Introduction to Classification • Discriminant Analysis • Decision Tree • Density Estimation • Artificial Neural Networks • Comparison of Classification Methods • An Example of Classification for Life Log Analysis

  18. Density Estimation • Definition • Classification rule that assigns x to class Ad if: p(Ad|x) = maxi p(Ai|x), maximizes the a posteriori probability • For two classes Ai and Aj • Kernel methods • Naive Bayes  assumes variable independency • Projection methods • Example: Nearest Neighbors • 1-nearest neighbor rule, in which the training set is searched for the ‘nearest’ (in a defined sense) previous example, whose class is then assumed for the new case • K-nearest neighbor, the natural extension taking the most frequent class in the neighbors • Parameter free model (distance!)

  19. Nearest Neighbours: Example

  20. Density Estimation: Issues • Presence of irrelevant variables affect model • Require numerical variables, or special distance function for categorical variables • Variable’s scale affect the model • All of the sample must be stored

  21. Agenda • Introduction to Classification • Discriminant Analysis • Decision Tree • Density Estimation • Artificial Neural Networks • Comparison of Classification Methods • An Example of Classification for Life Log Analysis

  22. Artificial Neural Networks • Black Box Modeling • An artificial neural network (ANN) is a mathematical model or computational model based on biological neural networks • It consists of an interconnected group of artificial neurons and processes information using a connectionist approach to computation • In most cases an ANN is an adaptive system that changes its structure based on external or internal information that flows through the network during the learning phase

  23. The Neuron Model Inputs I1 Weights w1 I2 w2 Output sumO w3  I3 f(sum) w4 w5 I4 sum = wnIn f: activation function I5  Oi= fi(wnIn)

  24. feedforward e31 I1 I4 I5 I2 I3 w31 e32 PATTERN w32 O1 e33 w33 w34 e34 Multi Layer Perceptron Model e = d1 - o1 e: error d: desired output o: network output w: weights feedback

  25. Artificial Neural Networks: Principles • Universal approximator • Weights are modified to minimize classification error • The outputs represent the P ( Class | example )

  26. Downhill = -dE / dW E Wnew = Wold -dE/ dWold  minimum : learning rate W Back-propagation Learning • The error E is a function of the weights and the training data • Minimize error E using the steepest descent

  27. Artificial Neural Networks: Issues • To make a generalizable network one has to determine a suitable number of hidden nodes • Too few  network does not learn enough… • Too many  over-learning

  28. Agenda • Introduction to Classification • Discriminant Analysis • Decision Tree • Density Estimation • Artificial Neural Networks • Comparison of Classification Methods • An Example of Classification for Life Log Analysis

  29. Comparison of Classification Methods • MV: Missing Values • Cost matrix: learning(L),testing(T) or not(N) • Interpretability: easily understood classifier • Method: comprehensibility of the method • Parameters: good guidelines or automatic selection of parameters • User-fr: user friendliness • Data: numerical, categorical

  30. Agenda • Introduction to Classification • Discriminant Analysis • Decision Tree • Density Estimation • Artificial Neural Networks • Comparison of Classification Methods • An Example of Classification for Life Log Analysis

  31. VTT Technical Research Center at Finland • Representing complex entities: Any single methodology alone is often insufficient • For example, knowledge representation needs a hierarchy utilizing heterogeneous methodologies (Minsky) • Laerhoeven and Cakmakci (2000) • 4 layers (statistical cues, Kohonen clustering, kNN classification, and a Markov chain supervisory layer) • Advantage of using high-level features • Control and ‘accurate touch’ to the data: Maintained up to the classification layer • If the feature itself is discriminative enough: classification layer is not even necessary Sensing Representation Fuzzy Logic Clustering Modeling & Classifying Naïve Bayes Service Fuzzy Control

  32. Overview • Disadvantage of using high-level features • The development of proper signal processing algorithms is laborous (compared to the user of very low-level features)

  33. Context Representation • A central problem in utilizing context information in application development • Difficulty to add meaning of concepts from real world to extracted features • “When environment loudness is SILENT, set ringing volume LOW” > “When environment loudness is -20dB, set ringing tone volume to 2.5” • Context extraction: Abstract raw sensor signals and compress information • How well features describe some parts of the real world context • Extracted features: Context atoms (the smallest amount of context information)

  34. Context Information Processing (1)

  35. Context Information Processing (2) • Context atom vector (k=maximum number of context atoms, the values of context atoms are between 0 and 1) • Advantages • Composing the ontology for sensor-based context information • Direct control of various types of applications • Multidimensional context atom vector C: Utilization of explorative data analysis methods • Extendable (Straightforward to add new contexts atoms into a vector C)

  36. Some of the Quantized Features

  37. Naive Bayes Classifier CPT (Conditional probability table)learning Structure Conditional probability distributioncounting Joint Probability Distribution

  38. Experiment • Two kinds of data • Measured in controlled conditions (1) • Real world scenario (2) • Experiment (1) • Nine individual context class data sets were recorded (each five times separately in controlled conditions) • Length of each nine-class data set: 30 seconds (30 sets of 1 seconds atoms) • Nine sets, each to the sum of 150 vectors • To create variability in the data • Five samples of different songs from two albums • Speech and acceleration-based contexts by four different persons • Training data set = test data set

  39. Experiment (2) • Experiment (2) • Home environment • Multiple simultaneous contexts • Only those contexts that can be defined in each phase are counted • Five testees (performed the scenario twice)  ten test data set with a length 6-8 minutes • The last test data set  ignored (box wires had been broken) • Final amount of scenario data sets  9 (60 minutes) • Different songs, three different cars • Manual on-the-fly segmentation is not accurate, a lot of disturbances • 1) training data set = test data set 2) cross validation

  40. Recognition Results as an Intensity Graph Resolution = 1 second (5 second moving average)

  41. Classification Accuracy (9 contexts, 2 BN)

  42. Recognition Results • Recognition Results for 13 Individual Contexts (9 scenarios) BN Direct

  43. Context Data • Data gathering  Laptop • 1 seconds resolution (4 seconds sliding window) • 3 minutes • Experiments  Offline

  44. Experiments with User Scenario • Fluctuation of control signals • Oscillation of various information representation levels • Moving averaging methods or smoothing in changing font sizes • Extension (adding new types of contexts) • Complicate the development of a control system

  45. User Reactions • Ten cell phone users were interviewed • 5 of them  normal users (use basic communication applications) • 5 of them  active cell phone users (use actively most of the cell phone applications and features) • Interview process • Explaining scenario presented in the table to a user • Explaining to a user how adaptive applications operate in general and particularly during the user scenario • Control signals and screenshots of adapted service contents • Users were interviewed • Application adaptation is an acceptable feature, but mistakes in adaptation are not accepted • Users want to have control over the device’s operations

  46. Summary • Classification for context awareness • Classification method application • Applying naïve Bayesian classifier in learning and classifying the contexts of a mobile device user • Mobile device-oriented  sensors are embedded (instead of being distributed into the environment) • To recognize static short-term events (descriptive and generic at the intermediate levels of the context representation layer structure), sequences of these events could be utilized at the upper level • Naïve Bayesian learning and inference: Straightforward and computationally very efficient

  47. Bibliography • Book: Machine Learning, Neural and Statistical Classification. Editors: D. Michie, D.J. Spiegelhalter, C.C. Taylor, Statlog Project • Book: Weiss, S. M. and Kulikowski, C. A. (1991). Computer systems that learn: classification and prediction methods from statistics, neural networks, machine learning and expert systems. Morgan Kaufmann, San Mateo, CA. • Book: Fukunaga, K. (1990). Introduction to Statistical Pattern Recognition. Academic Press, 2nd edition • Paper: C.M. van der Walt and E. Barnard, “Data characteristics that determine classifier performance”, in Proceedings of the Sixteenth Annual Symposium of the Pattern Recognition Association of South Africa, pp.160-165, 2006 • Classifier showdown: A practical comparison of classification algorithms - Link • Statistical Pattern Recognition Toolbox for Matlab - Link

More Related