RESEARCH METHODOLOGY

RESEARCH METHODOLOGY RESULT AND ANALYSIS (part 2)

HYPOTHESIS TESTING A hypothesis • is a conjecture about a population parameter. This conjecture may or may not be true. • An educated guess based on theory and background information • A proposed explanation for a phenomenon. Hypothesis Testing is a process of using sample data and statistical procedures to decide whether to reject or not reject a hypothesis (statement) about a population parameter value.

Examples • Whether seat belts will reduce the severity of injuries caused by accident • Whether the public prefer certain colour in the fabric lining • Whether adding a chemical will improve water quality • The average life expectancy in the next decade for man will be more than 100 years • Education increases income

education increases income • a positive relationship between the concepts "education" and "income." • This abstract or conceptual hypothesis cannot be tested. First, it must be operationalized or situated in the real world by rules of interpretation. Consider again the simple hypothesis "Education increases Income." • To test the hypothesis the abstract meaning of education and income must be derived or operationalized. The concepts should be measured. Education could be measured by "years of school completed" or "highest degree completed" etc.Income could be measured by "hourly rate of pay" or "yearly salary" etc.

Two type of statistical hypothesis • The Null Hypothesis: symbolised by Ho, states that there is no difference between a parameter and a specific value OR that there is no difference between two parameters. NULL means NO CHANGE. Statement of equality • The Alternative Hypothesis: symbolised by Ha, states a specific difference between parameter and a specific value OR states that there is a difference between two parameters. TEST or Research Hyphothesis.

Situation A: A researcher is interested in finding out whether a new medicine will have any undesirable side effects on the pulse rate of the patient. Will the pulse rate increase, decrease or remain unchanged. Since the researcher knows the pulse rate of the population under study is 82 beats per minute, the hypothesis will be Ho :  = 82 (remain uncahnged) H1 :  82 (will be different) This is a two-tailed test since the possible effect could be to raise or lower the pulse

Situation B: A chemist invents an additive to increase the life of an automobile battery. The mean life time of ordinary battery is 36 months. The hypothesis will be: Ho :  36 Ha :  > 36 The chemist is interested only in increasing the lifespan of the battery. His alternative hypothesis is that the mean is larger than 36. Therefore the test is called right-tailed, interested in the increase only.

Situation C: A contractor wishes to lower heating bill by using a special type of insulation in house. If the average monthly bill is RM100, his hypothesis will be Ho :  RM 100 H1 :  RM 100 This is a left-tailed test since the contractor is only interested in reducing the bill

General Procedure for testing the hypothesis. Can be done statistically. • Step 1: State the hypothesis • Step 2: find critical value for a selected level of significant or formulate an analysis plan e.g. 0.1, 0.05, 0.01. Consider case for one-tailed or two-tailed • Step 3: Analyze sample data. • Step 4: Interpret results or make the decision to reject or not to reject the hypothesis. If test value < critical value accept Ho. test value > critical value reject Ho.

significant difference A significant difference occurs if the difference between the hypothesized (null) value and the sample statistic value is too large to be attributed to chance. A significant difference strongly suggests that the null hypothesis is not true. Significant difference at p<0.05 means, 95% of the time the sample mean is larger than the hypothesised value.

TESTING THE DIFFERENCE AMONG MEANS AND VARIANCE Situations: To compare the average lifetime of two difference brands of tires Two different brands of fertilizer, whether one is better than the other for growing plants Two brands of cough syrup, to test whether one brand is more effective than the other

Problem 1: Two-Tailed Test Suppose the Acme Drug Company develops a new drug, designed to prevent colds. The company states that the drug is equally effective for men and women. To test this claim, they choose a a simple random sample of 100 women and 200 men from a population of 100,000 volunteers. At the end of the study, 38% of the women caught a cold; and 51% of the men caught a cold. Based on these findings, can we reject the company's claim that the drug is equally effective for men and women? Use a 0.05 level of significance.

Solution: • State the hypotheses. The first step is to state the null hypothesis and an alternative hypothesis. Null hypothesis: P1 = P2 Alternative hypothesis: P1 ≠ P2 Note that these hypotheses constitute a two-tailed test. The null hypothesis will be rejected if the proportion from population 1 is too big or if it is too small. • Formulate an analysis plan. For this analysis, the significance level is 0.05. The test method is a two-proportion z-test.

Analyze sample data. Using sample data, we calculate the pooled sample proportion (p) and the standard error (SE). Using those measures, we compute the z-score test statistic (z). p = (p1 * n1 + p2 * n2) / (n1 + n2) = [(0.38 * 100) + (0.51 * 200)] / (100 + 200) = 140/300 = 0.467 SE = sqrt{ p * ( 1 - p ) * [ (1/n1) + (1/n2) ] } SE = sqrt [ 0.467 * 0.533 * ( 1/100 + 1/200 ) ] = sqrt [0.003733] = 0.061 z = (p1 - p2) / SE = (0.51 - 0.38)/0.061 = 2.13 where p1 is the sample proportion in sample 1, where p2 is the sample proportion in sample 2, n1 is the size of sample 2, and n2 is the size of sample 2.

Since we have a two-tailed test, the P-value is the probability that the z-score is less than -2.13 or greater than 2.13. • We use the Normal Distribution Calculator to find P(z < -2.13) = 0.017, and P(z > 2.13) = 0.017. Thus, the P-value = 0.017 + 0.017 = 0.034. • Interpret results. Since the P-value (0.034) is less than the significance level (0.05), we cannot accept the null hypothesis.

Problem 2: One-Tailed Test Suppose the previous example is stated a little bit differently. Suppose the Acme Drug Company develops a new drug, designed to prevent colds. The company states that the drug is more effective for women than for men. To test this claim, they choose a a simple random sample of 100 women and 200 men from a population of 100,000 volunteers. At the end of the study, 38% of the women caught a cold; and 51% of the men caught a cold. Based on these findings, can we conclude that the drug is more effective for women than for men? Use a 0.01 level of significance.

Solution: • State the hypotheses. The first step is to state the null hypothesis and an alternative hypothesis. Null hypothesis: P1 >= P2 Alternative hypothesis: P1 < P2 Note that these hypotheses constitute a one-tailed test. The null hypothesis will be rejected if the proportion of women catching cold (p1) is sufficiently smaller than the proportion of men catching cold (p2). • Formulate an analysis plan. For this analysis, the significance level is 0.01. The test method is a two-proportion z-test.

Analyze sample data. Using sample data, we calculate the pooled sample proportion (p) and the standard error (SE). Using those measures, we compute the z-score test statistic (z). p = (p1 * n1 + p2 * n2) / (n1 + n2) = [(0.38 * 100) + (0.51 * 200)] / (100 + 200) = 140/300 = 0.467 SE = sqrt{ p * ( 1 - p ) * [ (1/n1) + (1/n2) ] } SE = sqrt [ 0.467 * 0.533 * ( 1/100 + 1/200 ) ] = sqrt [0.003733] = 0.061 z = (p1 - p2) / SE = (0.38 - 0.51)/0.061 = -2.13 where p1 is the sample proportion in sample 1, where p2 is the sample proportion in sample 2, n1 is the size of sample 2, and n2 is the size of sample 2.

Since we have a one-tailed test, the P-value is the probability that the z-score is less than -2.13. We use the Normal Distribution Calculator to find P(z < -2.13) = 0.017. Thus, the P-value = 0.017. • Interpret results. Since the P-value (0.017) is greater than the significance level (0.01), we cannot reject the null hypothesis.

Commonly used Methods 1. z-test • For detecting difference between two means for large sample (two samples) • Assumptions required • The sample must be independent, that is no relationship between the subject in the sample • The sample must be normally distributed

Example problem Suppose that in a particular geographic region, the mean and standard deviation of scores on a reading test are 100 points, and 12 points, respectively. Our interest is in the scores of 55 students in a particular school who received a mean score of 96. We can ask whether this mean score is significantly lower than the regional mean — that is, are the students in this school comparable to a simple random sample of 55 students from the region as a whole, or are their scores surprisingly low. Calculate z – score?

solution • We begin by calculating the standard error (SE) of the mean: Next we calculate the z-score, which is the distance from the sample mean to the population mean in units of the standard error:

problem • the mean and standard deviation of scores on a calculating test are 120 points, and 18 points, respectively. Our interest is in the scores of 81 students in a particular school who received a mean score of 92. We can ask whether this mean score is significantly lower than the regional mean — that is, are the students in this school comparable to a simple random sample of 81 students from the region as a whole, or are their scores surprisingly low. Calculate Z- score? 2. Every year, 50,000 runners compete in the Peachtree Road Race. They run 10 kilometers (a little over 6 miles). The average finishing time is 55 minutes, with a standard deviation of 10 minutes. Fred and Wilma completed the race in 61 and 51 minutes, respectively. Barney and Betty had finishing times with z-scores of -0.3 and 0.7, respectively. List the runners in order, starting with the fastest runner and ending with the slowest runner. (A) Wilma, Barney, Fred, Betty (B) Barney, Wilma, Fred, Betty (C) Wilman, Barney, Betty, Fred (D) Betty, Fred, Barney, Wilma (E) None of the above

solution 1. Calculate (SE) of the mean: Next we calculate the z-score

solution 2. The answer is A. This problem can be solved by converting Fred and Wilma's raw scores into z-scores. To do this, we use the z-score equation: To do this, we use the z-score equation: z = (M-µ) / sd where z is the z-score, x is the runner's raw score, M is the mean finishing time, and sd is the standard deviation of finishing times. Solving first for Fred's z-score, we get z = (M-µ) / sd = ( 61-55) / 10 = 0.60 Using the same approach to compute Wilma's z-score, we get z = (M-µ) / sd = ( 51-55) / 10 = - 0.4 Based on z-scores, we can order the runners from fastest to slowest as follows: Wilma (z = -0.4), Barney (z = -0.3), Fred (z = 0.6), and Betty (z = 0.7).

problem • Each year, a national achievement test is administered to 3rd graders. The test has a mean score of 100 and a standard deviation of 15. If Jane's z-score is 1.20, what was her score on the test? (A) 82 (B) 88 (C) 100 (D) 112 (E) 118

solution • The correct answer is (E). From the z-score equation, we know z = (M-µ) / sd where z is the z-score, x is the value of Jane's test score, M is the mean test score, and sd is the standard deviation of test scores. Solving for Jane's test score (M), we get M = ( z * sd) + 100 = ( 1.20 * 15) + 100 = 18 + 100 = 118

2. F test • For the comparison of two variances or standard deviations. E.g variation in cholesterol level in man and women • Assumptions • The population from which the samples were obtained must be normally distributed • Samples must be independent of each other

Example problem • Consider an experiment to study the effect of three different levels of a factor on a response (e.g. three levels of a fertilizer on plant growth). If we had 6 observations for each level, we could write the outcome of the experiment in a table like this, where a1, a2, and a3 are the three levels of the factor being studied.

solution Step 1:Calculate the mean within each group: Step 2: Calculate the overall mean: where a is the number of groups.

Step 3: Calculate the "between-group" sum of squares: • where n is the number of data values per group. • The between-group degrees of freedom is one less than the number of groups • fb = 3 − 1 = 2 • so the between-group mean square value is • MSB = 84 / 2 = 42

Step 4: Calculate the "within-group" sum of squares. Begin by centering the data in each group • The within-group sum of squares is the sum of squares of all 18 values in this table • SW = 1 + 9 + 1 + 0 + 4 + 1 + 1 + 9 + 0 + 4 + 9 + 1 + 9 + 1 + 1 + 4 + 9 + 4 = 68 • The within-group degrees of freedom is • fW = a(n − 1) = 3(6 − 1) = 15

Thus the within-group mean square value is • Step 5: The F-ratio is

2. t-test • To test the difference between two means for small independent sample (n<30) • Assumptions • Sample must be independent • The populations are normally distributed

CORRELATION AND REGRESSION • Correlation is a statistical method used to determine whether a relationship between variable exists. Correlation attempts to study the strength of the mutual relationship between two variables. In correlation we assume that the variables are random and dependence of any nature is not involved. • Regression describe the nature of the relationship between variables. Regression studies the relationship where dependence is necessarily involved. One variable has the dependence on a certain number of variables. Regression can be used for predicting the values of the variable which depends upon other variables.

Linear and Non Linear Correlation • Linear Correlation:Correlation is said to be linear if the ratio of change is constant. The amount of output in a factory is doubled by doubling the number of workers is the example of linear correlation.In other words it can be defined as if all the points on the scatter diagram tends to lie near a line which are look like a straight line, the correlation is said to be linear, as shown in the figure.

Non Linear (Curvilinear) Correlation:Correlation is said to be non linear if the ratio of change is not constant. In other words it can be defined as if all the points on the scatter diagram tends to lie near a smooth curve, the correlation is said to be non linear (curvilinear), as shown in the figure.

Positive and Negative Correlation Positive Correlation:The correlation in the same direction is called positive correlation. If one variable increase other is also increase and one variable decrease other is also decrease. For example, the length of an iron bar will increase as the temperature increases. Negative Correlation:The correlation in opposite direction is called negative correlation, if one variable is increase other is decrease and vice versa, for example, the volume of gas will decrease as the pressure increase or the demand of a particular commodity is increase as price of such commodity is decrease. No Correlation or Zero Correlation:If there is no relationship between the two variables such that the value of one variable change and the other variable remain constant is called no or zero correlation.

Perfect Correlation If there is any change in the value of one variable, the value of the others variable is changed in a fixed proportion, the correlation between them is said to be perfect correlation. It is indicated numerically as +1 and -1. • Perfect Positive Correlation:If the values of both the variables are move in same direction with fixed proportion is called perfect positive correlation. It is indicated numerically as +1. • Perfect Negative Correlation:If the values of both the variables are move in opposite direction with fixed proportion is called perfect negative correlation. It is indicated numerically as -1.

Coefficient of Correlation For sample data the correlation coefficient denoted by “r” is a measure of strength of the linear relation between X and Y variables, where “r” is a pure number and lies between -1 and +1.

Examples of Correlation • Calculate and analyze the correlation coefficient between the number of study hours and the number of sleeping hours of different students.

Solution: • The necessary calculation is given below: There is perfect negative correlation between the number of study hours and the number of sleeping hours.

Problem • From the following data, compute the coefficient of correlation between X and Y: Summation of products of deviations of X and Y series from their arithmetic means = 122.

Solution:

LINEAR REGRESSION • If the plot of n pairs of data (x , y) for an experiment appear to indicate a "linear relationship" between y and x, then the method of least squares may be used to write a linear relationship between x and y. • The least square regression line for the set of n data points is given by y = ax + b where a and b are given by

Example • Consider the following set of points: {(-2 , -1) , (1 , 1) , (3 , 2)} a) Find the least square regression line for the given data points. b) Plot the given points and the regression line in the same rectangular system of axes.

Solutions a) Let us organize the data in a table. We now use the above formula to calculate a and b as follows a = (nΣx y - ΣxΣy) / (nΣx2 - (Σx)2) = (3*9 - 2*2) / (3*14 - 22) = 23/38 b = (1/n)(Σy - a Σx) = (1/3)(2 - (23/38)*2) = 5/19

b) We now graph the regression line given by y = ax + b and the given points.

Problems 2 a) Find the least square regression line for the following set of data {(-1 , 0),(0 , 2),(1 , 4),(2 , 5)}b) Plot the given points and the regression line in the same rectangular system of axes. 3 The values of y and their corresponding values of y are shown in the table below a) Find the least square regression line y = ax + b. b) Estimate the value of y when x = 10. 4 The sales of a company (in million dollars) for each year are shown in the table below. a) Find the least square regression line y = ax + b. b) Use the least squares regression line as a model to estimate the sales of the company in 2012.

RESEARCH METHODOLOGY