320 likes | 455 Views
Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008. Document Analysis: Fundamentals of pattern recognition. Outline. Introduction Feature extraction and decision Role of training Feature selection Example : Font recognition Bayesian decision theory Evaluation.
E N D
Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008 Document Analysis:Fundamentals of pattern recognition
Outline Introduction Feature extraction and decision Role of training Feature selection Example : Font recognition Bayesian decision theory Evaluation
Goals of Pattern Recognition Pattern recognition aims at discovering and identifying patterns in raw data it consists of assigning symbols to data (patterns) it is based on a a priori knowledge, often statistical information Pattern recognition is used for computer perception (image/sound analysis) in a preliminary step, a sensor captures raw information this information is interpreted to take decisions Pattern recognition can be thought as a methodic way of reducing the information in order to keep only the relevant meaning
Pattern Recognition Applications Pattern recognition is involved in many applications seismological survey speech recognition scientific imagery (biology, health-care, physics, ...) satellite based observation (military and civil applications, ...) document analysis, with several components: optical character recognition (OCR) font identification handwriting recognition (off-line ) graphics recognition computer vision (3D scene analysis) biometry: person identification and authentication ... Pattern recognition methodologies rely on other scientific domains: statistics, operation research, graph theory, artificial intelligence, ...
Origin of Difficulties Pattern recognition is mainly an information overload problem The difficulty is issued from variability of objects belonging to the same class distortion of captured data (noise, degradations, ...)
Steps Involved in Pattern Recognition Pattern recognition is basically a two stage process: Feature extraction, aiming at removing redundancy while keeping significant information Classification, consisting in making a decision by associating a class label observation class feature vector
Role of Training Classifiers (tools that perform classification tasks) are generally designed to be trained Each class is characterized by a model Models are built with representative training data Features extraction decision classes Models training
Supervised vs. Unsupervised Training Two different situations may occur regarding training material: Supervised training is performed when the training samples are labeled with the class they belong to each class is associated with a set of training samplesTi={xi1, xi2,..., xiNi}, supposed to be statistically representative for the class Unsupervised training is performed when the training samples are statistically representative but mixed over all classesT={x1, x2,..., xn},
Feature Selection Features are selected accordingly to the application Features should be chosen carefully by considering discrimination power between classes robustness to intra-class distortions and noise global statistical independency (spread over the entire feature space) "fast computation" reasonable dimension (number of features)
Features for Character Recognition Given a binary image of a character, a lot of features can be used for character recognition Size, i.e., width and height of the bounding box Position of baseline (if available) Weight (number of black pixels) Perimeter (length of the contours) Center of gravity Moments (second and third order in both directions) Distributions of horizontal and vertical runs Number of intersections with a (eventually random) set of lines Length and structure (singular points, holes) of skeleton ... Local features computed on sub-images …
Font Recognition: Goal Goal: recognize fonts of synthetically generated isolated words as binary (black & white) or grey level images at 300 dpi 12 standard font classes are considered 3 families: Arial Courier New Times New Roman 4 styles: Plain Italic Bold Bold Italic single size : 12 pt
Font Recognition: Extracted Features Words are segmented with a surrounding white border of 1 pixel Some preprocessing steps are used Horizontal projection profile (hp) Derivative of horizontal projection profile (hpd) The following features are calculated hp-mean (or density): mean of hp hpd-stdev (or slanting): standard deviation of hpd hr-mean: mean of horizontal runs (up to length 12) hr-stdev: standard deviation of horizontal runs (up to length 12) vr-mean: mean of vertical runs (up to length 12) vr-stdev: standard vertical of horizontal runs (up to length 12)
Font Recognition: Illustration of Features Basic image processing features used are horizontal projection profile distribution of horizontal runs (from 1 to 11) distribution of vertical runs (from 1 to 11)
Font Recognition: decision boundaries on single feature (1) Some single features are highly discriminant for some font sets hpd-stdev is discriminating ■ roman and ■ italicfonts hr-mean is discriminating ■ normaland ■ bold fonts
Font Recognition: decision boundaries on single feature (2) Other features may partly discriminate font sets hr-mean can partly discriminate ■ Arial, ■ Courier and ■ Times
Font Recognition: decision boundaries on multiple features (1) By combining two features, font discrimination is improved (hpd-stdev, vr-stdev) discriminate ■ roman and ■ italicfonts vr-stdev hpd-stdev
Font Recognition: decision boundaries on multiple features (2) font family discrimination (■ Arial, ■ Courier and ■ Times) becomes possible by combining several couple of features
Bayesian Decision Theory Bayesian decision makes the assumption that all information contributing to the decision can be stated in form of probabilities P(i): the a priori probability (or prior) of each class p(x|i): the class conditional density function of the feature vector x, also called likelihood of the class i with respect to x The goal is to determine the class i, for which the a posteriori probability (or posterior) P(i|x) is the highest
Bayesian Rule The Bayes rule allows to calculate the a posteriori probability of each class, as a function of priors and likelihoods where p(x) is called evidence and can be considered as a normalization factor, i.e.,
Influence of Posterior Probabilities Example with a single feature: posterior probabilities in two different cases regarding a priori probabilities 2 1 2 1 P(1)=0.5, P(2)=0.5 P(1)=0.1, P(2)=0.9
Probability of Error Given a feature x of a given sample, the probability of error for a decision (x)=i is equal to The probability of error is given by
Optimal Decision Boundaries The minimal error is obtained by the decision (x)=i with
Decision Theory In the simplest case a decision consist in assigning to an observation x a class label i = x A natural extension consists in adding a “rejection class” R so that xR In the most general case, the decision results in an action i = x
Optimal Decision Theory Let us consider a loss function ij defining the loss incurred by taking action i when the true state of nature is j ; usually The risk of taking an action i for a particular sample x is The optimal decision consists in choosing i that minimizes the risk
Optimal decision When ii = 0 and ij = 1 j ≠ i , the optimal decision consists of minimizing the probability of error The minimal error is obtained by the decision (x)=i with or equivalently In the case when all a priori probabilities are equivalent
Minimum Risk for Two Classes Let ijij be the loss of action i when the true state is j The conditional risks of each decision is expressed as Then, the optimal decision rule becomes or equivalently And in the case of 11 22
Discriminant Functions In the case of multiple classes a pattern classifier can be specified by a set of discriminant functions gi(x) such that the decision i corresponds to Thus, a Bayesian classifier is naturally represented by The choice of discriminant functions is not unique gi(x) can be replaced by f(gi(x)) for any monotonic increasing function f(x) A minimum error-rate classifier can be obtained with
Bayesian Rule in Higher Dimensions The Bayesian rule can easily be generalized to the multidimensional case, where features are represented by a vector x. where
Conclusion about Bayesian Decision Bayesian decision theory provides a theoretical framework for statistical pattern recognition This theory supposes the following probabilistic information to be known: the number of classes a priori probabilities of each class class dependent feature distributions for each class The remaining problem is: how to estimate all these things feature distributions are hard to be estimated priors are seldom known even the number of classes is not always given
Performance Evaluation Performance evaluation is a very important issue of pattern recognition it gives an objective measure of the performance it allows to compare different methods Performance evaluation requires correctly labeled test data test data should be different from training data a strategy consists in cyclically using 80% of the data for training, and the remaining 20% for evaluation
Performance Measures: Recognition / Error Rates Performance evaluation uses several measures recognition rate corresponds to the ratio number of correct answers / number of total answers error rate corresponds to the ratio number of incorrect answers / number of total answers rejection rate corresponds to the ratio number of rejections / number of total answers recognition rate = 1 – (rejection rate + error rate)
Performance Measures: Recall & Precision On binary decisions (a sample belongs to the class or not) two other measurements are frequently used recall corresponds to the ratio of correctly assigned samples to the size of the class precision corresponds to the ratio of correctly assigned samples to the number of assigned samples Recall and precision are changing in opposite directions equal error rate is sometimes considered to be the best trade-off Additionally, the harmonic mean of precision and recall, calledF-measure is frequently used