CSI5388: Functional Elements of Statistics for Machine Learning Part I

CSI5388:Functional Elements of Statistics for Machine Learning Part I

Contents of the Lecture • Part I (This set of lecture notes): • Definition and Preliminaries • Hypothesis Testing: Parametric Approaches • Part II (The next set of lecture notes) • Hypothesis Testing: Non-Parametric Approaches • Power of a Test • Statistical Tests for Comparing Multiple Classifiers

Definitions and Preliminaries I • A Random Variable is a function, which assigns unique numerical values to all possible outcomes of a random experiment under fixed conditions. • If X takes on N values x1, x2, .. xN, such that each xiє R, then, • The Mean of X is • The Variance is • The Standard Deviation is

Definitions and Preliminaries II • Sample Variance • Sample Standard Deviation

Hypothesis Testing Generalities Sampling Distributions Procedure One- versus Two-tailed tests Parametric approaches

Generalities • Purpose: If we assume a given sampling distribution, we want to establish whether or not a sample result is representative of the sampling distribution or not. This is interesting because it helps us decide whether the results we obtained on an experiment can generalize to future data. • Approaches to Hypothesis Testing: There are two different approached to hypothesis testing: Parametric and Non-Parametric approaches

Sampling Distributions • Definition: The sampling distribution of a statistic (example, the mean, the median or any other description/summary of a data set) is the distribution of values obtained for that statistics over all possible samplings of the same size from a given population. • Note: Since the populations under study are usually infinite or at least, very large, the true sampling distribution is usually unknown. Therefore, rather than finding its exact value, it will have to be estimated. Nonetheless, we can do so quite well, especially when considering the mean of the data

Procedure I • Idea: If we assume a given sampling distribution, we want to establish whether or not a sample result is representative of the sampling distribution or not. This is interesting because it helps us decide whether the results we obtained on an experiment can generalize to future data. • Example: If a sample mean we obtain on a particular data sample is representative of the sampling distribution, then we can conclude that our data sample is representative of the whole population. If not, it means that the values in our sample are unrepresentative. (Perhaps this sample contained data that were particularly easy or particularly difficult to classify).

Procedure II • State your research hypothesis • Formulate a null hypothesis stating the opposite of your research hypothesis. In particular, the null hypothesis regards the relationship between the sampling statistics of the basic population and the sample result you obtained from your specific set of data. • Collect your specific data and compute the statistic’s sample result on it. • Calculate the probability of obtaining the sample result you obtained if the sample emanated from the data set that gave you the original sample statistic. • If this probability is low, reject the null hypothesis, and state that the sample you considered does not emanate from the data set that gave you the original sample statistic.

One- and Two-Tailed Tests • If H0 is expressed as an equality, then there are two ways to reject H0. Either the statistic computed from your sample at hand is lower than the sampling statistics or it is higher. If you are only concerned about either lower or higher statistics, then you should perform a one-tailed test. If you are simultaneously concerned about the two ways in which H0 can be rejected, then you should perform a two-tailed test.

Parametric Approaches to Hypothesis Testing • The classical approach to hypothesis testing is parametric. This means that in order to be applied, this approach makes a number of assumptions regarding the distribution of the population and the available sample. • Non-parametric approaches, discussed later do not make these strong assumptions, although they do make some assumptions as well, as will be discussed there.

Why are Hypothesis Tests often applied to means? • Hypothesis tests are often applied to means. The reason is that unlike for other statistics, the standard deviation of the mean is known and simple to calculate. • Since, without a standard deviation, hypothesis testing could not be performed (since the probability that the sample under consideration emanates from the population that is represented by the original sampling statistics is linked to this standard deviation), having access to the standard deviation is essential.

Why is the standard deviation of the mean easy to calculate? • Because of the important Central Limit Theorem which states that no matter how your original population is distributed, if you use large enough samples, then the sampling distribution of the mean of these samples approaches a normal distribution. If the mean of the original population is μ and its standard deviation σ, then the mean of the sampling distribution is μ and its standard deviation σ/sqrt(N).

When is the sampling distribution of the mean Normal? • The number of samples necessary for the sampling distribution of the mean to approach normal depends on the distribution of the parent population. • If the parent population is normal, then the sampling distribution of the mean is also normal. • If the parent population is not normal, but symmetrical and uni-modal, then the sampling distribution of the mean will be normal, even for small sample sizes. • If the population is very skewed, then, sample sizes of at least 30 will be required for the sampling distribution of the mean to be normal.

How are hypothesis tests set up?t-tests • Hypothesis Tests are used to find out whether a sample mean comes from a sampling distribution with a specified mean. • We will consider: • One-sample t-tests • μ, σ known • μ, σ unknown • Two-sample t-tests • Two-matched samples • Two-independent samples

One-sample t-testσ known • If σ is known, we can use the central limit theorem to obtain the sampling distribution of this population’s mean (mean is μ and standard deviation is σ/sqrt(N)). • Let X be the mean of our data sample, we compute z = (X – μ)/(σ/sqrt(N)) (1) • We find the probability that z is as large as the value obtained from the z-table and then output this probability if we are solely interested in a one-tailed test and double it before outputting it if we are interested in a two-tailed test. • If this output probability is smaller than .05, we would reject H0 at the .05 level of significance. Otherwise, we would state that we have no evidence to conclude that H0 does not hold.

What is the meanings and purpose of z? • Normal distributions can all be easily mapped into a single one, using a specific transformation. • This means that, in our hypothesis tests, we can use the same information about the sampling distribution over and over (if we assume that our population is normally distributed), no matter what the mean and variance of our actual population are. • Any observation can be changed into a standard score, z, with respect to mean=0 and standard deviation =1, as follows: Z = (X-mean)/sd

One-sample t-testσ unknown • In most situations, σ, the variance of the population is unknown. In this case, we replace σ by s, the sample standard deviation, in equation (1) yielding t = (X – μ)/(s/sqrt(N)) (2) • Because s is likely to under-estimate σ, and, thus, return a t-value larger than z would have been had σ been known, it is inappropriate to use the distribution of z to accept or reject the null hypothesis. • Instead, we use the Student’s t distribution, which corrects for this problem and compares t to the t-table with degree of freedom N-1. We then proceed as we did for z on the slide about σ known, above.

What is the meanings and purpose of t? • t follows the same principle as z except for the fact that t should be used when the standard deviation is unknown. • t, however, represents a family of curves rather than a single curve. The shape of the t distribution changes from sample size to sample size. • As the sample size grows larger and larger, t looks more and more like a normal distribution

Assumption of the t-test with σ unknown • Please, note that one assumption is made in the use of the t-test. That is that we assume that the sample was drawn from a normally distributed population. • This is required because the derivation of t by Student was based on the assumption that the mean and variance of the population were independent, an assumption that is true in the case of a normal distribution. • In practice, however, the assumption about the distribution from which the sample was drawn can be lifted whenever the sample size is sufficiently large to produce a normal sampling distribution of the mean. In general, n= 25 or 30 (number of cases in a sample) is sufficiently large. Often, it can be smaller than that.

Two-sample t-testsmatched samples • Given two matched population, we want to test whether the difference in means between these two populations are significant or not. We do so by looking at the difference in means, D, and variance, SD, between these two populations and comparing it to the mean of 0. • We can then apply the t-test as we did above, in the case where σ was unknown. • This time, we have t = (D – 0)/ (SD/sqrt(n)) (3) • We use the t-table as before with a n-1 degree of freedom, and the same assumptions about the normality of the distribution.

Two-sample t-testsindependent samples • This time, we are interested in comparing two populations with different means and variance. The two populations are completely independent. • We can, again apply the t-test, with the same conditions applying, using the formula: t= (X1 –X2)/ sqrt((s12/n1) + (s22/n2))

Confidence Intervals • Sample means represent point estimates of the mean parameter.Here, we are interested in interval estimates, which tell us how large or small the true value of μ could be without causing us to reject H0, given that we ran a t-test on the mean of our sample. • To calculate these intervals, we simply take the equations presented on the previous slides and express them in terms of μ, and as a function of t. • We then replace t for the two-tailed value we are interested in in the t-table. This value can be positive or negative, meaning that we will obtain two values for μ: μupper and μlower. This gives us the limits of the confidence interval. • The confidence interval means that μ has a certain probability (attached to the value of t chosen) to belong to this interval. The greater the size of the interval, the greater the probability that μ is included. Conversely, the smaller that interval, the smaller the probability that it is included.

CSI5388: Functional Elements of Statistics for Machine Learning Part I

CSI5388: Functional Elements of Statistics for Machine Learning Part I

Presentation Transcript

Machine Learning

New Horizons in Machine Learning

Machine Learning

Machine Learning on Spark

Crash Course on Machine Learning Part V

Introduction to Machine Learning BMI/IBGP 730

Introduction to Statistics and Machine Learning

Machine Learning ICS 273A

Machine learning: Unsupervised learning

Machine Learning

CS 478 – Tools for Machine Learning and Data Mining

Functional Question Higher (Statistics 8)

CSI5388 Current Approaches to Evaluation

CSI5388: A Critique of our Evaluation Practices in Machine Learning

Introduction to Machine Learning

CSI5388 Putting it all together: Error Estimation of Machine Learning Algorithms

PAC Learning