160 likes | 363 Views
Constrained Optimization for Validation-Guided Conditional Random Field Learning. Presented by: Qiang Yang, HKUST. Minmin Chen , Yixin Chen , Michael Brent , Aaron Tenney Washington University in St. Louis. Conditional Random Fields.
E N D
Constrained Optimization for Validation-Guided Conditional Random Field Learning Presented by: Qiang Yang, HKUST Minmin Chen,Yixin Chen,Michael Brent ,AaronTenney Washington University in St. Louis KDD 2009
Conditional Random Fields • Conditional Random Fields (Lafferty, McCallum& Pereira 2001) • Probabilistic model for sequential data segmentation and labeling Labeling Sequence Observation Sequence 1 1 1 2 2 A T G C G FEATURE KDD 2009
Applications • Natural Language Processing • Lafferty J., McCallum A.& Pereira F. 2001 • Sha F. & Pereira F. 2003 • Sarawagi S. & Cohen W. W. 2005 • Computer Vision • Sminchisescu C., Kanaujia A. & Metaxas D. 2006 • Vishwanathan S. V. N. et al. 2006 • Bioinformatics • Culotta A., Kulp D. & McCallum A. 2005 • Gross, S. S. et al. 2007 KDD 2009
Challenges and Related Work • Overfitting • Likelihood training Overfitting • Model flexibility Large feature set Overfitting • Related Work • Regularization (Sha 2003, Vail 2007) • Smoothing methods (Newman 1977, Rosenfeld 1996) • Regularization with a Gaussian prior works as well as or better than smoothing methods, but the regularization parameter is hard to tune KDD 2009
Motivation of Proposed Work • Cross-validation is often used to estimate the accuracy of a classifier and to select models. Generally, the performance on the validation set is strongly correlated to the performance of the trained model on the unseen data • Constraints prescribing the performance of the trained model on the validation set are used to guide the learning process and avoid the model from fitting freely and tightly to the training data KDD 2009
Single Training Multiple Validation (STMV) Framework • STMV Framework • Small data set • Large data set Training Testing V3 V1 V2 Training Testing V1 V2 V3 KDD 2009
Constrained Formulation Original objective : maximize the log likelihood of the labeling sequence given the observation sequence of the training data • Where • Constraints prescribing the difference of scores to ensure the model take consideration of the performance on the validation sets Score of a specific labeling sequence y for an observation sequence x Score of the most likely sequence found by Viterbi under current model for validation sequence v(j) Score of real labeling sequence for validation sequence v(j) KDD 2009
Extended Saddle Point (ESP) Theory • Extended Saddle Point Theory (Wah & Chen 2005) • Introduces a necessary and sufficient condition on the Constrained Local Minimum (CLM) of a constrained nonlinear programming problem in a continuous, discrete, or mixed space • Offer several salient advantages over previous constraint handling theory • Does not require the constraints to be differentiable or in closed form • Satisfied over an extended region of penalty values • Necessary and sufficient KDD 2009
Extended Saddle Point (ESP) Search Algorithm • Extended Saddle Point Search Algorithm • Transform the constrained formulation into a penalty form where • Outer loop updates the extended penalty values • Inner loop minimizes • Challenges • Efficient calculation of the gradient of the penalty function • The first term of the constraint is determined by the most likely sequence found by the Viterbi Algorithm. • Change of the parameter W can result in a very different sequence non-differentiable KDD 2009
Approximation of the discontinuous Gradient Highest prob. Of reaching this state at time t f(s1, s2, X) 0 1 t1-1 t1 T s1 s2 s3 KDD 2009
Experimental Results : Gene Prediction • Find out the protein coding regions and associated components • Apply our STMV framework and ESP search algorithm on CONTRAST, a state-of-the-art gene predictor • Data set: • Fruit fly genome, 27,463 genes, evenly divided into 4 sets • Feature set includes around 33,000 features • Compare performance to original CONTRAST, and CONTRAST with regularization • Gene, exon, nucleotide level sensitivity and specificity KDD 2009
Experimental Results : Gene Prediction(cont) • The performance of original CRF, regularized CRF(CRFr) and constrained CRF(CRFc) on CONTRAST KDD 2009
Experimental Result: Stock Price Prediction • Predict if tomorrow’s stock price will raise or fall comparing to today’s, based on historical data; • Preprocessing techniques to smooth out the noise in the raw data • Date set: • 1741 stocks from NASDAQ and NYSE • Each contains stock prices from 2002-02-01 to 2007-09-12 • Feature set includes 2,000 features • Predict on day T+1: • Training Sequence : day 1 ~ T • Validation Sequence: day T-V+1 ~ T (V = 100) KDD 2009
Conclusion • A Single Training Multiple Validation (STMV) framework • Integrate validation into the training process by modeling the validation quality as constraints in the problem formulation • Effectively avoid overfitting of CRF models • Approximation Scheme • Efficiently approximate the discontinuous gradient of our constrained formulation • Extended Saddle Point (ESP) search algorithm • Robustly find a constrained local minimum for our constrained formulation KDD 2009
Questions? • mc15@cec.wustl.edu KDD 2009