Assoc. Prof. Nguyen Xuan Hoai, HANU IT R&D Center

Learning, Generalization, and Regularization: A Glimpse of Statistical Machine Learning Theory – Part 2 Assoc. Prof. Nguyen Xuan Hoai, HANU IT R&D Center

Contents VC Theory: Reminders from the last lecture. ERM Consistency (asymptotic results). ERM generalization bounds (non-asymptotic results). Structural Risk Minimization (SRM). Other theoretical framework for machine learning.

Probabilistic Setting of ML

Empirical Risk Minimization (ERM) Lost function:

Empirical Risk Minimization (ERM) Risk Function:

Empirical Risk Minimization (ERM) Risk Minimization Principle:

Empirical Risk Minimization (ERM) Empirical Risk Minimization Principle:

Regularization Regularized Empirical Risk:

Empirical Risk Minimization (ERM) Questions for ERM (Statistical Learning Theory): Is ERM consistent? (consistency) (Weak convergence of ERM solution to the true one) How fast is the convergence? How to control the generalization?

ERM Consistency R[f] is estimator of true solution with sample size n, and Remp[f] is the estimator of R[f]. So we have an estimator as combination of the two. Is it consistent?

ERM Consistency

ERM Consistency Consistency Definition for ERM:

ERM Consistency Do we need both limits hold true? Counter example: Q(z,) are indicator functions. Each function of this set is equal to 1 for all z except a finite number of intervals of measure  where it is equal to 0. The parameters  define the intervals at which the function is equal to zero. The set of functions Q(z, ) is such that for any finite number of points z1,...,zl, one can find a function that takes the value of zero on this set of points. Let F(z) be the uniform distribution function on the interval [0,1].

ERM Consistency Do we need both limits hold true? We have:

ERM Consistency Strict Consistency: Problem of minorization of function sets  consistency is satisfied trivially.

ERM Consistency Strict Consistency: (note: only the second limit is needed)

ERM Consistency Two sided Empirical Processes: Uniform convergence:

ERM Consistency One-sided Empirical Processes:

ERM Consistency Concentration Inequality: Hoeffding’s Inequality

ERM Consistency Concentration Inequality: Hoeffding’s Inequality Hoeffding’s inequality is distribution independent. It describes the rate of convergence of frequencies to their probability. Where a=1, b=0 it reduces to Chernoff’s inequality. It and its generalization have been used extensively for analyzing randomized algorithms and learning theory.

ERM Consistency Key Theorem of Learning:

ERM Consistency Uniform Convergence of Frequencies to Prob: Case 1: Finite set of functions Let our set of events contain a finite number N of events Ak = {z : Q(z,k) > 0} , k = 1,2, ..,N. For this set, uniform convergence does hold.

ERM Consistency Uniform Convergence of Frequencies to Prob: Case 1: Finite set of functions Let our set of events contain a finite number N of events Ak = {z : Q(z,k) > 0} , k = 1,2, ..,N. For this set, uniform convergence does hold. For uniform convergence:

ERM Consistency Uniform Convergence of Frequencies to Prob: Case 2: Sets of Indicator Functions Entropy of a set of function: Given a sample: {z1,z2,…,zl}, for each  we have a binary vector: q()=(Q(z1, 1),…,Q(zl, l)) Each q() is a vertex in a hypercube.

ERM Consistency Uniform Convergence of Frequencies to Prob: Case 2: Sets of Indicator Functions Entropy of a set of function: N(z1,…,zl) is the number of distinguish vertices, we have N(z1,…,zl)2l. Def.H(z1,…,zl) = ln N(z1,…,zl) be random entropy And the entropy of the function set is defined as:

ERM Consistency Uniform Convergence of Frequencies to Prob: Case 2: Sets of Indicator Functions

ERM Consistency Uniform Convergence of Frequencies to Prob: Case 3: Real-valued Bounded Functions Consider set of functions Q that |Q(z,)| <C Similar to indicator functions, given a sample Z=z1,…,zl for each  vector q()=(Q(z1, ),…,Q(zl, )) is a point in a hypercube

ERM Consistency Uniform Convergence of Frequencies to Prob: Case 3: Real-valued Bounded Functions

ERM Consistency Uniform Convergence of Frequencies to Prob: Case 3: Real-valued Bounded Functions Define N(;z1,…,zl) be the number of vectors in the minimal -net of the set vector q() (with  varies). Random -entropy of Q(z,) is defined as: H(;z1,…,zl) = ln N(;z1,…,zl) -entropy is defined as:

ERM Consistency Uniform Convergence of Frequencies to Prob: Case 3: Real-valued Bounded Functions

ERM Consistency Uniform Convergence of Frequencies to Prob: Case 4: Functions with bounded expectations

ERM Consistency Conditions of One-Sided Convergence:

ERM Consistency Three milestones in learning theory: For pattern recognition (space of indicator functions), we have: Entropy: H(l) = E ln N(z1,…,zl) Annealed Entropy: Hann(l) = ln E N(z1,…,zl) Growth Function: G(l) = ln sup N(z1,…,zl) We have: H(l)  Hann(l)  G(l)

ERM Consistency Three milestones in learning theory: First milestone: - sufficient condition for consistency: Second milestone: - sufficient condition for fast convergence rate: Third milestone: sufficient and necessary condition for consistency for any measure and fast convergence rate:

ERM Generalization Bounds Non-asymptotic results: Consistency is asymptotic results, it does not tell the speed of convergence or the confidence of results of ERM

ERM Generalization Bounds Non-asymptotic results: Note that for finite case when Q(x,) contains only N (indicator) functions. For this case (using Chernoff’s inequality): ERM is consistent and ERM converges fast.

ERM Generalization Bounds Non-asymptotic results: with probability 1-: With probability 1-2 :

ERM Generalization Bounds Indicator Functions: - Distribution Dependent

ERM Generalization Bounds Indicator Functions: - Distribution Dependent with probability 1-: With probability 1-2 : Where:

ERM Generalization Bounds Indicator Functions: Distribution Independent Reminder: Growth Function: G(l) = ln sup N(z1,…,zl) We have: H(l)  Hann(l)  G(l) G(l) does not depend on distribution so if we substitute G(l) for H(l), we will get distribution free bounds of generalization error.

ERM Generalization Bounds Indicator Functions: VC dimension

ERM Generalization Bounds Indicator Functions: VC dimension Example: linear functions

ERM Generalization Bounds Indicator Functions: VC dimension Example:

Assoc. Prof. Nguyen Xuan Hoai, HANU IT R&D Center