580 likes | 607 Views
Explore key concepts in pattern recognition, from Bayesian decision theory to feature extraction and Bayesian belief networks. Understand important issues like model selection, feature selection, and generalization. Get ready to ace your midterm!
E N D
Midterm Review CS479/679 Pattern RecognitionSpring 2019 – Dr. George Bebis
Reminders • Graduate students need to select a paper for presentation (about 15 minutes presentation) • Send me your top three choices by Thursday, March 28th • Presentations will be scheduled on April 30th and May 2nd • Check posted guidelines in preparing your presentation • Guest Lectures • Dr. Tin Nguyen, March 28th (Bioinformatics) • Dr. Emily Hand, April 4th (Face Recognition) • Colloquium • Dr. George Vasmatzis, Mayo Clinic, April 19th
Midterm Material • Intro to Pattern Recognition • Math review (probabilities, linear algebra) • Bayesian Decision Theory • Bayesian Networks • Parameter Estimation (ML and Bayesian) • Dimensionality Reduction • Feature Selection Case studies are also included in the midterm
Intro to Pattern Recognition (PR) • Definitions • Pattern, Class, Class model • Classification vs Clustering • PR applications • What are the main classification approaches? • Generative • Model p(x, ω); estimate P(ω/x) • Discriminative • Estimate P(ω/x) directly x: features ω: class
Some Important Issues • Feature Extraction • Model Selection (i.e., simple vs complex) • Generalization
Features Extraction • Which features? • Discriminative • How many? • Curse of dimensionality • Dimensionality reduction • Feature selection • Missing features • Marginalization (i.e., compute P(ωi/xg))
Simple vs Complex Models • Complex models are tuned to the particular training samples, rather than on the characteristics of the true model (overfitting or memorization).
Generalization • The ability of a classifier to produce correct results on novel patterns. • How can we improve generalization performance ? • More training examples (i.e., lead to better model estimates). • Simpler models usually yield better performance.
Probabilities • Prior and conditional probabilities • Law of total probability • Bayes rule • Random variables • pdf/pmf and PDF • Independence • Marginalization • Multivariate Gaussian • Covariance matrix decomposition • Whitening transformation
Linear Algebra • Vector dot product • Orthogonal/Orthonormal vectors • Linear dependence/independence • Space spanning • Vector basis • Matrices (diagonal, symmetric, transpose, inverse, trace, rank) • Eigenvalues/Eigenvectors • Matrix diagonalization/decomposition
Decision Rule Using Bayes Rule Decide ω1 if P(ω1 /x) > P(ω2 /x);otherwise decide ω2 • The Bayes rule isoptimum (i.e., it minimizes the average probability error): where P(error/x) = min[P(ω1/x), P(ω2/x)]
Decision Rule Using Conditional Risk • Suppose λ(αi / ωj) is the loss (or cost) incurred for taking action αi when the classification category is ωj • The expected loss (or conditional risk) with taking action αi :
Decision Rule Using Conditional Risk • Bayes decision rule minimizes overall risk R by: • Computing R(αi /x) for every αi given an x • Choosing the action αi with the minimum R(αi /x)
Zero-One Loss Function • Assign the same loss to all errors: • The conditional risk corresponding to this loss function: Decide ω1 if P(ω1 /x) > P(ω2 /x);otherwise decide ω2
Discriminant Functions • Assign a feature vector x to class ωi if gi(x) > gj(x)for all Examples:
Discriminant Function for Multivariate Gaussian • Σi=σ2I linear discriminant: Need to know how to derive the decision boundary decision boundary: ) ) Special case: equal priors Minimum distance classifier
Discriminant Function for Multivariate Gaussian (cont’d) • Σi= Σ linear discriminant: Need to know how to derive the decision boundary decision boundary: Special case: equal priors Mahalanobis distance classifier
Discriminant Function for Multivariate Gaussian (cont’d) • Σi= arbitrary Need to know how to derive the decision boundary quadratic discriminant: hyperquadrics decision boundary:
Bayesian Nets • How is it defined? • Directed acyclic graph (DAG) • Each node represents one of the system variables. • Each variable can assume certain values (i.e., states) and each state is associated with a probability (discrete or continuous). • A link joining two nodes is directional and represents a causal influence (e.g., X depends on A or A influences X).
Bayesian Nets (cont’d) • Why are they useful? • Allow us to decompose high dimensional probability density functions into lower dimension probability density functions P(a3, b1, x2, c3, d2)=P(a3)P(b1)P(x2 /a3,b1)P(c3 /x2)P(d2 /x2)
Bayesian Nets (cont’d) • What is the Markov property? • “Each node is conditionally independent of its ancestors given its parents” • Why is the Markov property important?
Computing Joint Probabilities • We can compute the probability of any configuration of variables in the joint density distribution: e.g., P(a3, b1, x2, c3, d2)=P(a3)P(b1)P(x2 /a3,b1)P(c3 /x2)P(d2 /x2)= 0.25 x 0.6 x 0.4 x 0.5 x 0.4 = 0.012
Inference Example Compute probabilities assuming missing information
Bayesian Nets (cont’d) • Given a problem, you should know how to: • Design the structure of the Bayesian Network (i.e., identify variables and their dependences) • Compute various probabilities (i.e., inference)
Parameter Estimation • What is the goal of parameter estimation? • Estimate the parameters of the class probability models. • What are the main parameter estimation methods we discussed in class? • Maximum Likelihood (ML) • Bayesian Estimation (BE)
Parameter Estimation (cont’d) • Compare ML with BE: • ML assumes that the values of the parameters are fixed but unknown. • Best estimate is obtained by maximizing • BE assumes that the parameters q are random variables that have some known a-priori distribution p(q). • Estimates a distribution rather than making point estimates like ML p(D/ q) Note: estimated distribution might not be of the assumed form
ML Estimation • Using independence assumption: • Using log-likelihood: • Find by maximizingln p(D/ θ):
Maximum A-Poteriori Estimator (MAP) • Assuming known p(θ), MAP maximizes p(D/θ)p(θ) = • Find by maximizing ln p(D/ θ)p(θ): • When is MAP is equivalent to ML? • when p(θ) is uniform
θ=μ Multivariate Gaussian Density ML estimate: MAP estimate:
θ=(μ,Σ) Multivariate Gaussian Density ML estimate:
BE Estimation Step 1: Compute p(θ/D) : Step 2:Compute p(x/D) :
Interpretation of BE Solution • The BE solution implies that if we are less certain about the exact value of θ, consider a weighted average of p(x / θ) over the possible values of θ: • Samples D exert their influence on p(x / D) through p(θ / D).
Incremental Learning • p(θ/D) can be computed recursively: n=1,2,…
Relation to ML solution • Ifp(D/θ) peaks sharply at (i.e., ML solution) then p(θ /D) will, in general, peak sharply at too (assuming p(θ) is broad and smooth) • Therefore, ML is a special case of BE! p(θ /D) p(θ /D)
Univariate Gaussian θ=μ x const. as (ML estimate)
Multivariate Gaussian θ=μ BE solution converges to ML solution as
Computational Complexity dimensionality: d # training data: n # classes: c ML approach BE approach Higher learning complexity Same classification complexity
Main Sources of Error in Classifier Design • Bayes error • The error due to overlapping densities p(x/ωi) • Model error • The error due to choosing an incorrect model. • Estimation error • The error due to incorrectly estimated parameters.
Dimensionality Reduction • What is the goal of dimensionality reduction and why is it useful? • Reduce the dimensionality of the data by eliminating redundant and irrelevant features • Less training samples, faster classification • How is dimensionality reduction performed? • Map the data to a sub-space of lower-dimensionality through a linear (or non-linear) transformation y = UTx x ϵ RN, U is NxK, and y ϵ RK • Alternatively, select a subset of features.
PCA and LDA • What is the main difference between PCA and LDA? • PCA seeks a projection that preserves as much information in the data as possible. • LDA seeks a projection that best separates the data.
PCA • What is the PCA solution? • “Largest” eigenvectors (i.e., corresponding to the largest eigenvalues - principal components) of the covariance matrix of the training data. • You need to know the steps of PCA, its geometric interpretation, and how to choose the number of principal components.
Face Recognition Using PCA • You need to know how to apply PCA for face recognition and face detection. • What practical issue arises when applying PCA for face recognition? How do we deal with it? • The covariance matrix AAT is typically very large (i.e., N2xN2 for NxN images) • Consider the alternative matrix ATA which is only MxM (M is the number of training face images)