680 likes | 1k Views
Learning, Generalization, and Regularization: A Glimpse of Statistical Machine Learning Theory – Part 2. Assoc. Prof. Nguyen Xuan Hoai, HANU IT R&D Center. Contents. VC Theory: Reminders from the last lecture. ERM Consistency (asymptotic results).
E N D
Learning, Generalization, and Regularization: A Glimpse of Statistical Machine Learning Theory – Part 2 Assoc. Prof. Nguyen Xuan Hoai, HANU IT R&D Center
Contents VC Theory: Reminders from the last lecture. ERM Consistency (asymptotic results). ERM generalization bounds (non-asymptotic results). Structural Risk Minimization (SRM). Other theoretical framework for machine learning.
Empirical Risk Minimization (ERM) Lost function:
Empirical Risk Minimization (ERM) Risk Function:
Empirical Risk Minimization (ERM) Risk Minimization Principle:
Empirical Risk Minimization (ERM) Empirical Risk Minimization Principle:
Regularization Regularized Empirical Risk:
Empirical Risk Minimization (ERM) Questions for ERM (Statistical Learning Theory): Is ERM consistent? (consistency) (Weak convergence of ERM solution to the true one) How fast is the convergence? How to control the generalization?
ERM Consistency R[f] is estimator of true solution with sample size n, and Remp[f] is the estimator of R[f]. So we have an estimator as combination of the two. Is it consistent?
ERM Consistency Consistency Definition for ERM:
ERM Consistency Consistency Definition for ERM:
ERM Consistency Do we need both limits hold true? Counter example: Q(z,) are indicator functions. Each function of this set is equal to 1 for all z except a finite number of intervals of measure where it is equal to 0. The parameters define the intervals at which the function is equal to zero. The set of functions Q(z, ) is such that for any finite number of points z1,...,zl, one can find a function that takes the value of zero on this set of points. Let F(z) be the uniform distribution function on the interval [0,1].
ERM Consistency Do we need both limits hold true? We have:
ERM Consistency Strict Consistency: Problem of minorization of function sets consistency is satisfied trivially.
ERM Consistency Strict Consistency: (note: only the second limit is needed)
ERM Consistency Two sided Empirical Processes: Uniform convergence:
ERM Consistency One-sided Empirical Processes:
ERM Consistency Concentration Inequality: Hoeffding’s Inequality
ERM Consistency Concentration Inequality: Hoeffding’s Inequality Hoeffding’s inequality is distribution independent. It describes the rate of convergence of frequencies to their probability. Where a=1, b=0 it reduces to Chernoff’s inequality. It and its generalization have been used extensively for analyzing randomized algorithms and learning theory.
ERM Consistency Key Theorem of Learning:
ERM Consistency Uniform Convergence of Frequencies to Prob: Case 1: Finite set of functions Let our set of events contain a finite number N of events Ak = {z : Q(z,k) > 0} , k = 1,2, ..,N. For this set, uniform convergence does hold.
ERM Consistency Uniform Convergence of Frequencies to Prob: Case 1: Finite set of functions Let our set of events contain a finite number N of events Ak = {z : Q(z,k) > 0} , k = 1,2, ..,N. For this set, uniform convergence does hold. For uniform convergence:
ERM Consistency Uniform Convergence of Frequencies to Prob: Case 2: Sets of Indicator Functions Entropy of a set of function: Given a sample: {z1,z2,…,zl}, for each we have a binary vector: q()=(Q(z1, 1),…,Q(zl, l)) Each q() is a vertex in a hypercube.
ERM Consistency Uniform Convergence of Frequencies to Prob: Case 2: Sets of Indicator Functions Entropy of a set of function: N(z1,…,zl) is the number of distinguish vertices, we have N(z1,…,zl)2l. Def.H(z1,…,zl) = ln N(z1,…,zl) be random entropy And the entropy of the function set is defined as:
ERM Consistency Uniform Convergence of Frequencies to Prob: Case 2: Sets of Indicator Functions
ERM Consistency Uniform Convergence of Frequencies to Prob: Case 3: Real-valued Bounded Functions Consider set of functions Q that |Q(z,)| <C Similar to indicator functions, given a sample Z=z1,…,zl for each vector q()=(Q(z1, ),…,Q(zl, )) is a point in a hypercube
ERM Consistency Uniform Convergence of Frequencies to Prob: Case 3: Real-valued Bounded Functions
ERM Consistency Uniform Convergence of Frequencies to Prob: Case 3: Real-valued Bounded Functions Define N(;z1,…,zl) be the number of vectors in the minimal -net of the set vector q() (with varies). Random -entropy of Q(z,) is defined as: H(;z1,…,zl) = ln N(;z1,…,zl) -entropy is defined as:
ERM Consistency Uniform Convergence of Frequencies to Prob: Case 3: Real-valued Bounded Functions
ERM Consistency Uniform Convergence of Frequencies to Prob: Case 4: Functions with bounded expectations
ERM Consistency Uniform Convergence of Frequencies to Prob: Case 4: Functions with bounded expectations
ERM Consistency Conditions of One-Sided Convergence:
ERM Consistency Conditions of One-Sided Convergence:
ERM Consistency Three milestones in learning theory: For pattern recognition (space of indicator functions), we have: Entropy: H(l) = E ln N(z1,…,zl) Annealed Entropy: Hann(l) = ln E N(z1,…,zl) Growth Function: G(l) = ln sup N(z1,…,zl) We have: H(l) Hann(l) G(l)
ERM Consistency Three milestones in learning theory: First milestone: - sufficient condition for consistency: Second milestone: - sufficient condition for fast convergence rate: Third milestone: sufficient and necessary condition for consistency for any measure and fast convergence rate:
ERM Generalization Bounds Non-asymptotic results: Consistency is asymptotic results, it does not tell the speed of convergence or the confidence of results of ERM
ERM Generalization Bounds Non-asymptotic results: Note that for finite case when Q(x,) contains only N (indicator) functions. For this case (using Chernoff’s inequality): ERM is consistent and ERM converges fast.
ERM Generalization Bounds Non-asymptotic results: with probability 1-: With probability 1-2 :
ERM Generalization Bounds Indicator Functions: - Distribution Dependent
ERM Generalization Bounds Indicator Functions: - Distribution Dependent with probability 1-: With probability 1-2 : Where:
ERM Generalization Bounds Indicator Functions: Distribution Independent Reminder: Growth Function: G(l) = ln sup N(z1,…,zl) We have: H(l) Hann(l) G(l) G(l) does not depend on distribution so if we substitute G(l) for H(l), we will get distribution free bounds of generalization error.
ERM Generalization Bounds Indicator Functions: VC dimension
ERM Generalization Bounds Indicator Functions: VC dimension Example: linear functions
ERM Generalization Bounds Indicator Functions: VC dimension Example: linear functions
ERM Generalization Bounds Indicator Functions: VC dimension Example: linear functions
ERM Generalization Bounds Indicator Functions: VC dimension Example:
ERM Generalization Bounds Indicator Functions: VC dimension Example: