320 likes | 449 Views
In the Name of God. Statistical Learning Theory. Bounds on the Rate of Convergence of Learning Processes(chapter 3). Supervisor : Dr. Shiry By : L. Pour Mohammad Bagher Author : Vladimir N. Vapnik. I ntroduction. In this chapter:
E N D
In the Name of God • Statistical Learning Theory • Bounds on the Rate of Convergence of Learning Processes(chapter 3) Supervisor : Dr. Shiry By : L. Pour Mohammad Bagher Author : Vladimir N. Vapnik
Introduction In this chapter: • we consider upper bounds on the rate of uniform convergence. (lower bounds are not as important for controlling the learning processes as the upper bounds)
Introduction Bounds on the rate of convergence: • Distribution-dependent bounds (based on the annealed entropy function) • Distribution-independent bounds (based on the growth function) Boundsare nonconstructivethe VC dimension of set of functions (a scalar value that can be evaluated for any set of functions ) Constructive distribution-independent bounds
THE BASIC INEQUALITIES • : a set of indicator functions • : the corresponding VC entropy • : the annealed entropy • :the growth function • theorem: basic inequalities in the theory of bounds
THE BASIC INEQUALITIES • The bounds are nontrivial if (In chapter 2 we called this condition the second milestone of learning theory.)
THE BASIC INEQUALITIES • Theorem 3.1 estimates the rate of uniform convergence with respect to the norm of the deviation between probability and frequency. • maximal difference occurs for the events with maximal variance. • (Bernoulli case ) variance : therefore the maximum of the variance is achieved for the events with probahility: the largest deviations are associated with functions that possess large risk.
THE BASIC INEQUALITIES • Theorem 3.2 considered relative uniform convergence. (we will obtain a bound on the risk where the confidence interval is determined by the rate of uniform convergence) • the uniform relative convergence: • upper bound on the risk obtained using Theorem 3.2 is much better than the upper bound on the risk obtained on the basis of Theorem 3.1.
THE BASIC INEQUALITIES • The bounds obtained in Theorems 3.1 and 3-2 are distribution-dependent • To construct distribution independent bounds it is sufficient to note that for any distribution function F(z) the growth function is not less than the annealed entropy. • for any distribution function F(z):
THE BASIC INEQUALITIES • These inequalities are nontrivial if (necessary and sufficient conditions for distribution-free uniform convergence) • if condition 3.5 is violated, then there exist probability measures F(z) on Z for which uniform convergence does not take place.
Generalization for the set of real functions • Let be a set of real functions, with • Let us construct a set of indicators functions by: Where are indicator functions, the set of indicators coincides with this set of functions.
Generalization In generalization we distinguish three cases: • Totally bounded functions • Totally bounded nonnegative functions • Nonnegative (not necessarily bounded) functions • The following bounds are nontrivial if
Generalization • Totally bounded functions • Totally bounded nonnegative functions
Generalization • Nonnegative functions Let be a set of functions such that for some p > 2 the pth normalized moments of the random variables exist:the Therefore: Where:
Distribution – independent bounds • The above bounds were distribution-dependent • To obtain distribution- independent bounds one replaces the annealed entropy with the growth function. • The following inequalities are nontrivial if
Distribution – independent bounds • For the set of totally bounded • For the set of nonnegative totally bounded functions • For the set of nonnegative real functions whose pthnormalized moment exists for some p > 2,
Bounds on the generalization ability of learning machines • What actual risk R(αl) is provided by the function Q(z,αl) that achieves minimal empirical risk Remp(αl)? • How close is this risk to the minimal possible infα (R(α)), αϵΛ, for the given set of functions? • using the following notation, the bounds are nontrivial when
Describe distribution-independent bounds (another form) • For the set of totally bounded functions • with probability at least for all functions of : • with probability at least for the function that minimizes the empirical risk :
Describe distribution-independent bounds (another form) • For the set of totally bounded nonnegative functions • with probahility at least for all functions : • with probability of at least for the function that minimizes the empirical risk :
Describe distribution-independent bounds (another form) • For the set of unbounded nonnegative functions We are given a pair (p,Ƭ) such that : • With probability at least for all functions where (u)+ = max(u,0). • With probability at least for the function that minimizes the empirical risk
The structure of the growth function • To make the above bounds constructive one has to find a way to evaluate the annealed entropy and/or the growth function for the given set of functions. • We will find constructive bounds by using the concept of VC dimension of the set of functions. • There is remarkable connection between the concept of VC dimension and the growth function.
The structure of the growth function • theorem Any growth function either satisfies the equality Or is bounded by the inequality • Definition We will say that the VC dimension of the set of indicator functions is infinite if the growth function for this set of functions is linear. the corresponding growth function is bounded by a logarithmic function with coefficient h.
VC dimension • the finiteness of the VC dimension of the set of indicator functions is a sufficient condition for consistency of the ERM method independent of the probability measure and implies a fast rate of convergence. • It is a necessary and sufficient condition for distribution-independent consistency of ERM learning machines. • The VC dimension of a set of indicator functions It is the maximum number h of vectors z1,...,zhthat can be separated into two classes in all possible ways using functions of the set. • If for any n there exists a set of n vectors that can be shattered by the set of functions, then the VC dimension is equal to infinity.
VC dimension • The VC dimension of a set of real functions Let be a set of real functions bounded by constants A and B. Considering the set of indicators where θ(z) is the step function The VC dimension of a set of real functions is defined to be the VC dimension of the set of corresponding indicators with parameters and .
VC dimension-Example • The VC dimension of the set of linear indicator functions in n-dimensioiialcoordinate space is equal to h=n+1 since by using functions of this set one can shatter at most n + 1 vectors. The VC dimension of the set of linear function in n-dimensional coordinate space is equal to h = n + 1, because the VC dimension of the corresponding linear indicator functions is equal to « + 1.
VC dimension-Example • The VC dimension of the set of functions is infinite. The points on the line can be shattered by functions from this set to separate these data into two classes determined by the sequence . it is sufficient to choose the value of the parameter αto be • by choosing an appropriate co- coefficientαone can for any number of appropriately chosen points approximate values of any function bounded by (-1,+1) using .
VC dimension • The VC dimension of a set of functions does not coincide with the number of parameters. It can be either larger or smaller than the number of parameters. • In the following: we will present the bounds on the risk functional that in Chapter 4 we use for constructing the methods for controlling the generalization ability of learning machines.
Constructive distribution – independent bounds • Considering sets of functions that possess a finite VC dimension h • Therefore, in all inequalities of the above Section the following constructive expression can be used (in the case of the finite VC dimension) • We also will consider the case where the set of loss functions contains a finite number of elements
Constructive distribution – independent bounds • For the set of totally bounded functions • with probability at least for all functions • with probability at least for the function that minimizes the empirical risk:
Constructive distribution – independent bounds • The set of totally bounded nonnegative functions • with probability at least for all functions • with probability at least for the function that minimizes the empirical risk:
Constructive distribution – independent bounds • The set of unbounded nonnegative functions • with probability at least for all functions • with probability at least for the function that minimizes the empirical risk:
Refrences • Vapnik, Vladimir,”The Nature of Statistical Learning Theory”, 2000