Lecture # 2 MATHEMATICAL STATISTICS

Lecture # 2MATHEMATICAL STATISTICS

Plan of the lecture • Key concepts in mathematical statistics • Methods of point estimating of the statistical characteristics of a random value by a sample • Method of interval estimating of the statistical characteristic of a random value by a sample • Testing of statistical hypotheses • Correlation dependence between random variables. Regression • Estimating of regression function by sample • Method of estimating of correlation dependence between qualitative (not quantitative) attributes

Mathematical Statistics is a science about mathematical methods of analyzing, systematization and using of statistical data for solving scientific and practical problems. Key Concepts of Mathematical Statistics 1) Assembly is a collection of objects (assembly elements) having at least one common property. 2) Assembly size is a number of assembly elements (N or n).

Two Types of Assembles: • Entire assembly • Sample assembly (Sample) • Entire assembly (population) is the largest assembly uniting all the elements having at least one common property (sign). • Sample assembly (Sample) is a part of the entire assembly selected for studying.

For example, the students of our university have at least one common property – “student”. So they form entire assembly. We want to study their “height”. Randomly we choose 100 of them. These students form sample assembly.

Conclusions received from studying of sample are valuable, if they can be applied not only for the elements studied, but also for all elements with the common property, i.e. for all elements of the entire assembly. • The entire assembly usually includes a very large, and often also, an infinitely large volume of data, rendering it impossible to study the entire assembly. Therefore, only part of the entire assembly is studied.

Results of sample properties studying must show (reflect) properties of entire assembly. • For this the sample must be REPRESENTATIONAL, that is satisfy the conditions: 1 - objects for sample are selected randomly (in a random manner) 2 – assembly size is sufficiently large

When studying assembly, we have to not just represent all our data in a table, but to give such numerical characteristics, which would characterize properties of the assembly. • Numerical parameters characterising the assemblies are known as statistical characteristics of assembly: M(X) - mathematical expectation, D(X) - variance, σ(X) - standard deviation.

When studying a sample it is impossible to determine exactly the values of the statistical characteristics of an entire assembly, but these values can be estimated with greater or lesser accuracy. • The variables obtained when studying a sample assembly are used instead of the true values of statistical characteristics of the entire assembly. They are known as estimates (sample estimates) of statistical characteristics.

For entire assembly values of statistical characteristics are called TRUE VALUES: M(X), D(X), σ(X) • For sample assembly they are called ESTIMATED VALUES:

Estimation of statistical characteristics by a sample can be provided by two methods: • Method of Point Estimation of the statistical characteristic 2. Method of ConfidenceInterval Estimation of the statistical characteristic

Method of Point Estimationof the statistical characteristics of random variable by a sample 1) Optimal sample estimate ofmathematical expectation of random variable Х ( ): ismean sample Х is a random variable х1, х2, ..., хn are values of random variable X n is sample size

2) Optimal sample estimate ofvariance of random variable Х ( ):

3) Optimal sample estimate of thestandard deviation of variable X (S(X)): 4) Error in meanmx (another notation is ):

2. Method of Confidence Interval Estimation • of the statistical characteristic of a random variable • by a sample The estimate of a statistical characteristic found from the sample is a random value, and its deviation from the true value of the statistical characteristic can be indefinitely large.Therefore the interval can be determined, within which the true value falls with a certain acceptable probability  . Such an interval is known as the confidence interval.

Confidence interval for a statistical characteristic is a random interval which covers the true value of the characteristic with the given probability . • Boundaries of the confidence interval are defined completely by the results of trials, and are also random values. • Probability is known as confidence probability. • Probability p = 1-α is known as significance level. In medical-biological studies, it is usually taken that α = 0.95; p = 1 - 0.95 = 0.05

Confidence interval for mathematical expectation M(X) (if random variable has a normal distribution) • 1) If the variance is estimated by the sample fairly accurately (n≥30), the sample variance estimate can be accepted as the true variance value, i.e. consider quantity D(X) as known.

is the sample mean; n is the sample size; t(α) is the argument of Laplace function If α = 0.95 , t(α)=1.96

2) If variance D(X) is unknown, and n<30 S is the sample estimate of the standard deviation; k is the number of degrees of freedom(k=n–1); t(α, k) is Student's coefficient (from the table)

Testing of Statistical Hypotheses • Point and interval estimating of statistical characteristics of the assemblies studied is the most common initial stage of statistical data analysis. The following stage consists in statement and testing of hypotheses related to the assemblies studied. Testing of statistical hypotheses is a large significant section of mathematical statistics. • Let us consider a simple problem of this kind, viz. testing the hypothesis on the validity of the difference of mean values for two sampled assemblies.

This problem often occurs in practice when sampling is done under different conditions. For example, a sample contains the measurements of a quantity affected by a certain factor, whereas the other assembly was not affected. It is necessary to find out whether these differences are purely random, or they are due to different conditions affecting sampling. • In other words, it is necessary to find out whether the samples studied belong to one entire assembly, or to different ones. In the former case, the differences between the sample means are purely random. In the latter case, the difference of sample means is valid.

To answer the question posed, the following analysis procedure is applied. We compute the values where k is the number of degrees of freedom; n1 and n2 are the sizes of samples compared; S1 and S2 are the sampled estimates of the standard deviations of the first and second samples respectively; and and are the sample means. Then, using the Student's coefficients table, for the computed value k and the specified significance level р, we find the value of Student's coefficient t. If Т > t, one can say that the difference of sample means is valid. It cannot be accounted for by random factors alone, and it stems from different entire assemblies. If Т < t, the differenceis invalid.

If the sizes of both samples are large and approximately equal, to compute Т, a simpler formula can be used where m1 and m2are errors in mean for the first and second samples respectively. We emphasise that the method of determining the validity of the difference of two sample means being considered is strictly valid only then when variables Х1 and Х2 have a normal distribution.

CORRELATION DEPENDENCE BETWEEN RANDOM VARIABLES. REGRESSION If we consider random values, the relationship between them can be not only functional. For example, children grow taller with age, i.e. there is an objective dependence between the height and age of children. At the same time, this dependence is not functional since children of the same age are of significantly different height. Clearly, the height at the given age is a continuous random variable having a certain distribution. In this case, the relationship between variables consists in that each value of one random variable relates to a certain distribution law of the other variable. Here we refer not to the random variable probability density, but to the conditional probability density.

The conditional probability density of variable Y(f(Y/x)) is the probability density of variable at the given value of variable X. • If there exist conditional probability densities of variables Y(f(Y/x)) and X(φ(X/y)), then correlation dependence is said to exist between variables X and Y.

If the probability density of a random variable depends on the value of random variable , then the expectation of variable also depends on this value, so we can speak about the conditional expectation of random variable Y at the given value of variable X M(Y/x). Hence, the conditional expectation of variable Y is a function of variable X, or in mathematical form where function Ψ(x)is known as the regression function of Y onX. The graph of the regression function is known as the regression line. The constant factors in the mathematical expression for Ψ(x)are known as regression coefficients. The regression function of on is introduced similarly. If then function ξ(y)is the regression function of X on Y. In the majority of cases, the regression line of Y on X , and that of X on Y are different lines

Estimating of Regression Function by Sample • A strict definition of the regression function involves studying the entire assembly, this being practically impossible. Therefore, an important task is to estimate the regression function by experimental data, i.e. by the sample. • Let there be a sample of elements, for each of which the values of random variables Y and X are defined, it being assumed that there is a correlation dependence between these variables. • If we plot points with coordinates yi and xi (i = 1, 2,…, n) on the coordinate plane XoY, we will obtain the so-called correlation field. • A visual study of the correlation field can be a basis for selecting an appropriate analytical expression for the regression function.

To define an optimal analytical expression for the regression function means finding the values of the regression coefficients. • The more complex is the analytical expression selected for describing the regression function, and the more coefficients it contains, the more involved is the task of sampled estimation of these coefficients. The task of computing the regression coefficients is simplest for the case of a linear regression function. • The correlation field points almost never fall along a straight line. Therefore, when selecting a linear function as a regression function, it is necessary to substantiate preliminarily the assumption on the linearity of the regression function.

Using experimental data, one finds the sample estimate of the correlation coefficient (the sample correlation coefficient) in the form • where R is the sample correlation coefficient. • The values of the sample correlation coefficient lie in the interval -1≤R ≤1. • If R>0, the regression functions of Y on X and X on Y are increasing functions, but if R<0, these functions are decreasing ones.

The closer the value of │R│ to unity, the closer the correlation field points cluster about a straight line, giving more reason to consider the regression function a linear one. In this case, we speak about strong correlation dependence. • The closer the value of to zero, the worse the correlation points cluster about a straight line, giving less reason to consider the regression function a linear one. At the same time, a small value of the correlation coefficient by no means implies absence of a correlation dependence between variables Y and X. This just implies that there is no reason to consider this dependence a linear one. • Therefore, the correlation coefficient is a measure of linearity of the dependence between random variables, but not a measure of the degree of dependence between these variables in general.

At a big correlation coefficient, a linear function, i.e. the dependence y=ax+b can be used to describe the regression function. To determine this function, the regression coefficients should be estimated. Optimal estimates of regression coefficients are obtained by the least squares method. The essence of the least squares method consists in that the best estimates of the regression coefficients for function are considered to be those, for which thefunction y=Ψ(x),are considered to be those, for which the sum takes the least value. For the particular case of linear regression of the kind the values of coefficients a and b are found by minimising the sum

For this, the partial derivatives of expression with respect to variables a and bare set to zero, and the system of equations obtained is solved. As a result, we obtain the sample estimates of the regression coefficients a and b: The latter expressions can be rearranged as

In case of regression of X on Y, the regression function has the form and the regression coefficients a1 and b1 are found from formulas Note that the regression lines of Y on X , and of X on Y coincide only if │R│=1. In this case, there is a linear functional relation between variables Y and X.

The referred above method of studying the relations between the characteristics (attributes) of a sample is applicable only for quantitative attributes. At the same time, it is necessary often to investigate the relations between the characteristics of other types whose values are not expressed quantitatively. • Let us consider the method of estimating the correlation dependence when at least one of the attributes is not a quantitative one. One of the forms of defining a qualitative (not quantitative) attribute for the sample elements is comparing according to the principle «more or less».

In this case we compute the rank correlation coefficient (Spearman correlation coefficient) by the formula where xiis the value of the rank for one attribute for the i-th element of the sample (n is the sample size), and yi is the value of the rank for the other attribute for the same sample element (i = 1, 2,…, n). The Spearman correlation coefficient is a measure of correlation dependence between attributes, viz. the greater the modulus of the rank correlation coefficient, the closer is the relationship.

Lecture # 2 MATHEMATICAL STATISTICS

Lecture # 2 MATHEMATICAL STATISTICS

Presentation Transcript

LECTURE

Lecture 25 Lecture 26

Lecture

Lecture

Lecture VIII Lecture IX

Lecture

Lecture 10 Lecture 10 Lecture 11 Lecture 11 Lecture 11 Lecture 11

Lecture S1: Sample Lecture

Lecture