Learning from Infinite Training Examples

Learning from Infinite Training Examples 3.18.2009, 3.19.2009 Prepared for NKU and NUTN seminars Presenter: Chun-Nan Hsu (許鈞南) Institute of Information Science Academia Sinica Taipei, Taiwan

The Ever Growing Web(Zhuang, -400) • Human life is finite, but knowledge is infinite. Following the infinite with the finite is doomed to fail. • 人之生也有涯，而知也無涯。以有涯隨無涯，殆矣。 • 莊子，西元前四百年

Analogously… • Computing power is finite, • but the Web is infinite. • Mining infinite Web with finite computing power… • is doomed to fail?

Other “holy grails” in Artificial Intelligence • Learning to understand natural languages • Learning to recognize millions of objects in computer vision • Speech recognition in noisy environment, such as in a car

On-Line Learning vs. Off-Line Learning • Nothing to do with human learning by browsing the web • Definition: Given a set of new training data, • online learner can update its model without reading old data while improving its performance. • By contrast, off-line learner must combine old and new data and start the learning all over again, otherwise the performance willsuffer.

Off-Line Learning • Nearly all popular ML algorithms are off-line today • They scan the training examples many passes iteratively until an objective function is minimized • For example: • SMO algorithm for SVM • L-BFGS algorithm for CRF • EM algorithm for HMM and GMM • Etc.

Why on-line learning?

Single-pass on-line learning • The key for on-line learning to win is to achieve satisfying performance right after scanning the new training examples for a single pass only

Previous work on on-line learning • Perceptron • Rosenblatt 1957 • Stochastic Gradient Descent • Widrow & Hoff 1960 • Bregment Divergence • Azoury & Warmuth 2001 • MIRA (Large Margin) • Crammer & Singer 2003 • LaRank • Borde & Bottou 2005, 2007 • EG • Collins & Peter Bartlet et al. 2008

Stochastic Gradient Descent (SGD) • Learning is to minimize a loss function given training examples

Optimal Step Size(Benveniste et al. 1990, Murata et al. 1998) • Solving gradient = 0 by Newton’s method • Step size is asymptotically optimal if it approaches to

Single-Pass Result (Bottou & LeCun 2004) • Optimum for n+1 examples is a Newton step away from the optimum for n examples

2nd Order SGD • 2nd order SGD (2SGD): Adjusting the step size to approach to Hessian • Good News: from previous work, given sufficiently large training examples, 2SGD achieves empirical optimum in a single pass! • Bad News: it is prohibitively expensive to compute H-1 • e.g. 10K features, H will be a 10K by 10K matrix = 100M floating point array • How about 1M features?

Approximating Jacobian(Aitken 1925, Schafer 1997) • Learning algorithms can be considered as fixed-point iteration mapping =M() • Taylor expansion gives • Eigenvalues of J can be approximated by

Approximating Hessian • Consider SGD mapping as a fixed-point iteration, too. • since J=M’=I-H, we have eig(J)=eig(M’)=eig(I-H), • therefore, (since H is symmetric) eig(J)=1- eig(H) eig(H-1)= / 1-eig(J)= / 1-γ.

Estimating Eigenvalue Periodically • Since the mapping of SGD is stochastic, estimating the eigenvalues at each iteration may yield inaccurate estimations. • To make the mapping more stationary, we use Mb=M(M(…M(θ)…)) • From the law of large number, b consecutive mappings, Mb, will be less “stochastic” • From Equation (4), we can estimate eig(Jb) by b

The PSA algorithm(Huang, Chang & Hsu 2007)

Experimental Results • Conditional Random Fields (CRF) (Lafferty et al. 2001) • Sequence labeling problems – gene mention tagging

Conditional Random Fields

In effect, CRF encodes a probabilistic rule-based system with rules of the form: If fj1(X,Y) & fj2(X,Y) & … & fjn(X,Y) are non-zero, then the labels of the sequence are Y with score P(Y|X) • If we have d features and considers w context, then an order-1 CRF encodes this many rules:

CoNLL 2000 base NP Tag Noun phrases 8936 training 2012 test 3 tags, 1015662 features CoNLL 2000 chunking Tag 11 POS types 8936 training 2012 test 23 tags, 7448606 features Performance measure: F-score: BioNLP/NLPBA 2004 Tag 5 types of bio-entities (e.g., gene, protein, cell lines, etc.) 18546 training 3856 test 5977675 features BioCreative 2 Tag gene names 15000 training 5000 test 10242972 features Tasks and Setups

Feature types for BioCreative 2 O(22M ) rules are encoded in our CRF model!!!

Convergence PerformanceCoNLL 2000 base NP

Convergence PerformanceCoNLL chunking

Convergence PerformanceBioNLP/NLPBA 2004

Convergence PerformanceBioCreative 2

First Pass 23.74 sec Execution TimeCoNLL 2000 base NP

First Pass 196.44 sec Execution TimeCoNLL chunking

First Pass 287.48 sec Execution TimeBioNLP/NLPBA 2004

First Pass 394.04 sec Execution TimeBioCreative 2

Experimental results for linear SVM and convolutional neural net • Data sets

Linear SVM • Convolutional Neural Net (5 layers) ** Layer trick -- Step sizes in the lower layers should be larger than in the higher layer

Mini-conclusion: Single-Pass • By approximating Jacobian, we can approximate Hessian, too • By approximating Hessian, we can achieve near-optimal single-pass performance in practice • With a single-pass on-line learner, virtually infinitely many training examples can be used

Analysis of PSA • PSA is a member in the family of “discretized Newton Methods” • Other well-known members include • Secant method (aka. Quickprop) • Steffensen’s method (aka. Triple Jump) • General form of these methods where A is a matrix designed to approximate the hessian matrix without actually computing the derivative

PSA • PSA is not secant nor is it steffensen’s method • PSA iterates a 2b-step “parallel chord” method (i.e., fixed rate SGD) followed by an approximated Newton step • Off-line 2-step parallel chord method is known to have an order 4 convergence

Convergence analysis of PSA

Are we there yet? • With single-pass on-line learning, we can learn from infinite training examples now, at least in theory • A cheaper, quicker method to annotate labels for training examples • Plus a lot of computers…

The human life is finite, but the knowledge is infinite. Learning from infinite examples by applying PSA to 2nd Order SGD is a good idea!

Thank you for your attention! http://aiia.iis.sinica.edu.tw http://chunnan.iis.sinica.edu.tw/~chunnan This research is supported mostly by NRPGM’s advanced bioinformatics core facility grant 2005-2011.

Learning from Infinite Training Examples

Learning from Infinite Training Examples

Presentation Transcript

Does Learning from Examples Improve Tutored Problem Solving?

Generalization in Learning from examples

Training examples

Learning from Positive and Unlabeled Examples

Learning-Theoretic Linguistics: Some Examples from Phonology

Examples of HOM from teacher training session

Learning from Negative Examples in Set-Expansion

Learning Decompositional Shape Models from Examples

Learning rules from incomplete training examples by rough sets

Learning Decompositional Shape Models from Examples

LEAD elements from Infinite Campus

Learning Semantic String Transformations from Examples

Learning from Only Positive Examples in Learning By Observation

Machine Learning Examples

“Learning From Bad Examples!”

Update on Learning By Observation Learning from Positive Examples Only

Learning from Infinite Training Examples

Learning Description From Examples