420 likes | 597 Views
Ensemble Learning Method for Hidden Markov Models. Anis Hamdi May 19 th , 2010 Advisor, Dr. Hichem Frigui. Outline. Introduction Hidden Markov Models Ensemble HMM classifier Motivations Ensemble HMM Architecture Similarity matrix computation Hierarchical clustering Model training
E N D
Ensemble Learning Method for Hidden Markov Models Anis Hamdi May 19th, 2010 Advisor, Dr. Hichem Frigui
Outline • Introduction • Hidden Markov Models • Ensemble HMM classifier • Motivations • Ensemble HMM Architecture • Similarity matrix computation • Hierarchical clustering • Model training • Decision level fusion • Application to Landmine Detection • Proposed Future Work
Introduction • Classification is one of the key tasks in data mining. • Statistical Learning problems in many fields involve sequential data: speech signals, stock market prices, protein sequences, etc. • The scope of this work is the classification of sequential data. • The standard approach in model-based classification is to learn a model for each class. The main challenge for complex classification problems is how to account for the intra-class variability • For static data, Gaussians mixture models have been widely used. • For sequential data, we intend to use a mixture of Hidden Markov Models to model the potential intra-class variability
Introduction (cont.) λ1 λ2 λ3 S: Sequences, pi’s are HMM probabilities, θi are the HMM parameters
Related work: Discrete HMMs • Given a set of N states {s1,s2, …, sN}, and a set of M observations {v1,v2,…vM} , • The process moves from one state to another generating a sequence of states q1,q2, …, qt , … such that P(qt=sj|qt-1=si) =aij , 1≤i,j≤N : state transition probabilities • States are not visible, but at each state, the model randomly generates one observation ot according to P(ot=vk|qt=si). P(ot=vk|qt=si)=bik , 1≤i≤N 1≤k≤M : state emission probabilities • The probability that the system starts at state i is: P(q1=si)=πi, 1≤i≤N: initial state probability. • Compact representation of the HMM Model is M=(A, B, π). A π q1 q2 q3 o3 o1 o2 B
Related work: Discrete HMMs (cont.) • Evaluation problem. Given the model λ =(A, B, π) and the observation sequence O = o1 o2 ... oT, calculate the probability that model λ has generated sequence O. • Backward-forward procedure • Decoding problem. Given the HMM λ =(A, B, π) and the observation sequence O = o1 o2 ... oT, find the most likely sequence of hidden states si that produced this observation sequence O. • Viterbi algorithm • Learning problem. Given K training observation sequences O=[O(1)O(2)…O(K)]and the general structure of the HMM (number of hidden states and number of codewords), determine the HMM parameters λ = (A, B, π) that best fit the training data. • Maximum Likelihood (ML), Minimum Classification Error (MCE), and Variational Bayesian (VB) training
Outline • Introduction • Hidden Markov Models • Ensemble HMM classifier • Motivations • Ensemble HMM Architecture • Similarity matrix computation • Hierarchical clustering • Model training • Decision level fusion • Application to Landmine Detection • Proposed Future Work
Ensemble HMM: Motivations Sequences belonging to class 0 Sequences belonging to class 1 • Using all sequences to train a single model for class 1 may lead to • Too much averaging of the sequences • Loss of discriminative characteristics within class 1 • One model needs to be learned for each group of similar sequences • How to group sequences? Ground truth is not sufficient.
Ensemble HMM: Overview • We assume that the data is generated by K HMM models. • These different models reflect the “natural” partitions within the data, regardless of the ground truth labels. • Partitioning and model identification is achieved through clustering in the loglikelihood space. • Resulting clusters can vary: • Different sizes • Homogeneous or heterogeneous Adapt learning to different clusters • Fuse the multiple HMM outputs
eHMM: Block Diagram Homogeneous clusters • BW training TrainingData Mixed clusters Decision Level Fusion • Similarity Matrix Computation MCE training • Hierarchical Clustering Confidence Max • Model λJ+1,1 • Model λE,1 Max • Model λJ+1,C Small clusters • Model λE,C VB training
eHMM: Similarity Matrix ComputationFitting individual models to sequences • Initial HMM for each sequence • Fix the number of states, N. Cluster the sequence elements into N clusters. Each cluster center is a state representative. • Define the codebook symbols as the sequence vectors • Training: • Baum-Welch algorithm is used to learn the HMM parameters that best fit a particular sequence. • Overfitting: we seek each model to perfectly fit the corresponding sequence. We are not looking to use the trained model for generalization. λ1 λ2 . . λR λ1
eHMM: Similarity Matrix ComputationComputing the similarity matrix • Test each training sequence with each learned model • Construct a pair-wise penalized log-likelihood matrix • Pr(Oi|λj) : the probability of sequence Oi being generated by Model λj • sq(i): the representative of state q in model λi • q(ij) = q1(ij) … qT(ij): the most likely hidden state sequence that generated the sequence Oi from the model λj . • α: mixing factor L is not symmetric, we use the following scheme to transform it to a similarity matrix:
eHMM: Similarity Matrix ComputationPenalized loglikelihood • Loglikelihood of sequence Oi being generated from model λj: two similar sequences should have high likelihood values for being generated from their respective HMM models. • Viterbi path mismatch term, Two similar sequence should have similar Viterbi paths. • Mixing factor, α: trade-off parameter between the likelihood-based similarity and the Viterbi-path-mismatch based similarity.
eHMM: Similarity-based Clustering • The previous step resulted in a penalized-loglikelihood-based similarity matrix, • Since the data is available in relational form, we use a standard hierarchical clustering algorithm with the complete link inter-cluster distance. • Agglomerative hierarchical clustering is a bottom-up approach that starts with all the data points as clusters. Then, it proceeds to merging the most similar clusters using an inter-cluster distance. • In the complete link based algorithm, the distance between two clusters is the maximum of all pair-wise distances between sequences in the two clusters. It produces compact clusters.
eHMM: Models ConstructionModels initialization • For each model λk • Initial values for the initial state and state transition probabilities (πk and Ak) of model λk are obtained by averaging the initial state and state transition probabilities of the individual models of the sequences belonging to cluster k. • The state representatives, s(k), of model λk are obtained by clustering the observation vectors of the sequences belonging to cluster k into N clusters. • The codebook symbols, V(k), of model λk are obtained by clustering the observation vectors of the sequences belonging to cluster k into M clusters. • For each symbol v(k)m, the membership in each state s(k)n is computed using λ 1 λ2 λK
eHMM: Models ConstructionModels training • Sequences are presumably similar and mainly belong to the same ground truth class. In this case It is expected that the class conditional posterior probability is uni-modal and peaked around the MLE of the parameters. • A maximum likelihood estimation would result in a HMM model parameters that best fit this particular class. • For clusters with a mixture of sequences belonging to different classes, it is expected that the posterior distribution is multimodal. We initialize a model for each class within this cluster. We then focus on finding the class boundaries within the posterior probability. • The models parameters are jointly optimized such that the overall misclassification error is minimized • MLE and MCE approaches need a large number of data points to give good estimates of the model parameters. • Bayesian approach is used to approximate the class conditional posterior distribution.
eHMM: Models ConstructionModels training • For clusters that are dominated by sequences from only one class, we use the standard Baum-Welch re-estimation procedure. • λjBW, j = 1..J, models • For clusters with a mixture of observations belonging to different classes, we use discriminative training based on minimizing the misclassification error to learn a model for each class. • λ i,cMCE, i= J+1..E, c = 1..C, models • For clusters containing a small number of sequences, we use a variational Bayesian method to update the model parameters given the observed data • λkVB, k = E+1..K, models
eHMM: Decision Level Fusion • Let Г = {λjBW, λ i,cMCE, λkVB}, where j = 1..J, i= J+1..E, c = 1..C, and k = E+1..K. be the resulting mixture model after the eHMM training. • To test a new sequence O, we
eHMM: Decision Level Fusion • Let F(r,k) = log Pr(Or|γk), 1≤ r ≤ R, 1 ≤ k ≤ K, be the R-by-K loglikelihood matrix. • Each row Fi, i = 1 .. R, of F represents the feature vector of the sequence i in the decision space. • Thus, the set of sequences is mapped to an Euclidean confidence space via a function .
eHMM: Decision Level Fusion ANN combination • Simple combination methods could be used, such as mean, maximum, majority voting. However these linear methods are not trainable and require the proper identification of cluster to class associations. • Thus we uses a simple neural network to model the potentially nonlinear mapping between the individual confidence values and the predicted output confidence/class. • The combination function is: • And the final output is a sigmoid function:
eHMM: Decision Level Fusion HME combination • The input to the HME network is a K-dimensional vector F. • The network is comprised of expert networks, and gating networks. • For each expert network, • where • with U a weight vector and f a • link function; f is the identity function • for regression problems and logistic • function for binary classification. • The output of each expert network is
eHMM: Decision Level Fusion HME combination • For the gating networks, • with vi a weight vector. • The vector weights U and vi are • the HME parameters and can be • learned using a gradient descent • method or an EM-like approach.
Outline • Introduction • Hidden Markov Models • Ensemble HMM classifier • Application to Landmine Detection • GPR data • EHD feature Extraction • Baseline HMM classifier • Ensemble HMM classifier • Experimental results • Proposed Future Work
Application to Landmine Detection:GPR data • Ground Penetrating Radar (GPR) offers the promise of detecting landmines with little or no metal content, at the expense of higher false alarm rate. • A GPR signature is a 3-dimensional matrix of sample values S(z,x,y). (z,x,y) represent depth, cross-track position, and down-track positions, respectively. • The down track position is considered as the time variable in our HMM modeling NIITEK vehicle mounted GPR system GPR scans GPR signature
Application to Landmine Detection:EHD feature extraction • Simple edge detector operators are used to identify edges and group them into five categories: horizontal, vertical, diagonal, anti-diagonal, and isotropic (non-edge) Illustration of the EHD feature extraction process
Application to Landmine Detection:Baseline HMM classifier • Baseline HMM classifier has two HMM models, one for mine and one for background. Each model has four states. • The mine model assumes that mine signatures have a hyperbolic shape. • Each model produces a likelihood value, and a most likely Viterbi path by backtracking through the model states using the backward-forward and the Viterbi algorithm, respectively. • The confidence value assighed to each observation sequence, O, is Illustration of the baseline HMM mine model Illustration of the baseline HMM architecture
Application to Landmine Detection:eHMM landmine detector • (1) Feature extraction, results in a set of R sequences of length T=15 each. • (2) Similarity matrix computation • Fit a model to each sequence • Compute the likelihood and Viterbi path of each sequence in each model • Deduce the pair-wise similarity matrix • (3) Pair-wise similarity based clustering, using standard hierarchical algorithm with the complete link distance, K=20.
Application to Landmine Detection:eHMM landmine detector (cont.) • (4) Models initialization and training • For each cluster k, initialization of λk=(A, B, π) is done using the sequences (and their corresponding models λr)belonging to the cluster. For clusters that will be trained using MCE, one model is initialized for each class: λk mine and λk background. • Training is done according to the procedure described earlier: • Large clusters dominated by a majority of mines or clutter signatures are trained using the maximum likelihood estimation • Large clusters containing a mixture of signatures form both classes are trained using MCE based discriminative training. • Small Clusters are trained using the variational Bayesian method . • (5) Decision level fusion • Done using ANN and HME fusion methods. • Details are provided for the general ensemble HMM classifier.
Application to Landmine Detection:The dataset • The eHMM was trained and tested on GPR data collected by a NIITEK system. • Data was collected from 3 different locations. • A total of 12 lanes • Total of 1616 signatures • 605 mine signatures. • 1011 clutter signatures. • The EHD features are used. Each signature is represented by a sequence of 15 5-dimensional vectors.
Application to Landmine Detection:eHMM clustering results • (a) similarity matrix after clustering (b) dendrogram
Application to Landmine Detection:eHMM clustering results Distribution of the alarms in each cluster: (a) per class, (b) per type, (c) per depth.
Application to Landmine Detection:Individual models performances (a) a sample signature from cluster 1 (b) models responses to the signature in (a) (a) a sample signature from cluster 2 (b) models responses to the signature in (a)
Application to Landmine Detection:Individual models performances Scatter plot of the loglikelihoods of the training data in model 5 (strong mines) versus. model 1 (weak mines). Clutter, low metal (LM), and high metal (HM) signatures at different depths are shown with different symbols and colors.
Application to Landmine Detection:Individual models performances Individual ROCs of some models. Solid lines: clusters dominated by mines. Dashed lines: clusters dominated by clutter.
Application to Landmine Detection:eHMM performance • For the remainder of the experiments, we use 4-fold cross validation technique to average the results of the eHMM on unseen data. • Comparison of the eHMM with the best 3 cluster models (1, 2, and 12)
Application to Landmine Detection:eHMM performance Comparison of the eHMM with the baseline HMM.
Application to Landmine Detection:eHMM performance Scatter plot of the confidence values of the test data in the eHMM vs. the baseline HMM classifier.
Conclusions • Ensemble HMM classifier is proposed • Learn one model per training sequence • Cluster sequences in the log-likelihood space • Learn a HMM model for each cluster using optimized training techniques • The multiple models are expected to capture the intra-class variations. • The output of the multiple models are fused using ANN or HME. • In an application to the landmine detection problem, the eHMM steps are individually analyzed and the overall performance is significantly better than the baseline DHMM
Proposed Future Work • eHMM implementation improvements • Joint clustering, training, and fusion optimization • Use variational Bayesian learning for small clusters • Use BIC to optimize the HMM models structures • Use BIC to optimize the number of clusters • Applications • Indentify potential cross domain applications to evaluate the eHMM • Compare the eHMM performance to other ensemble methods such as the Adaboost algorithm with HMMs as weak classifiers
Thank you! Questions?