490 likes | 608 Views
CHAPTER 16 THE FURTHER DATA ANALYSIS. 16.1 Introduction. 16.2 FURTHER DATA ANALYSIS: (MEASURED V ATTRIBUTE). FDA is procedure that enables a decision to be made, based on the sample evidence: There is no relationship There is a relationship
E N D
16.2 FURTHER DATA ANALYSIS: (MEASURED V ATTRIBUTE) • FDA is procedure that enables a decision to be made, based on the sample evidence: • There is no relationship • There is a relationship • These statistical procedures are called hypothesis tests
Hypothesis • A statement about a population developed for purpose of testing. • Hypothesis tests • A Procedure based on sample evidence and probability theory to determine whether the hypothesis is a reasonable statement. • Four stages of hypothesis tests • Stage 1: Specifying the hypotheses. • Stage 2: Defining the test parameters and the decision rule. • Stage 3: Examining the sample evidence. • Stage 4: The conclusions.
FDA for Measured v Attribute requires two different hypotheses tests • Two levels of attribute explanatory variable • three or more levels of attribute explanatory variable
16.3 HYPOTHESIS TEST 1Measured Response v Attribute Explanatory Variable with exactly two levels • Illustrative Example • Response Variable: AMOUNT Spent on Clothes per month • Attribute Explanatory Variable GENDER (Male/Female) • If Males and Females have the same 'spending on clothes' characteristics then the average amounts spent monthly by Males and by Female should be the same. • If Male and Females have different 'spending on clothes' characteristics then the average amount spent monthly by Males and Female would be different.
Total population can be split into two or more sub-populations according to the level of the attribute, a population of Males and a population of Females.
Stage 1: Specifying the hypotheses. • NULL HYPOTHESIS: • ALTERNATIVE HYPOTHESIS
Stage 2: The Decision Rule • Results of IDA for Illustrative Example • Outcome 1 • Male Mean = £45 (Stand Dev = £20) • Female Mean = £55 (Stand Dev = £20) • Noenough evidence to form a clear judgement • FDA is required.
Outcome 2 • Male Mean = £45 (Stand Dev = £10) • Female Mean = £55 (Stand Dev = £10) • The widths of the boxes would lead to the decision from the I.D.A. that there is definitely a link.
Outcome 3 • Male Mean = £45 (Stand Dev = £40) • Female Mean = £55 (Stand Dev = £40) • FDA is required and Stand Dev is bigger
Measure of Relative Separation of the boxplots • Considering not only MEANS but also STANDARD DEVIATIONof the two samples • Finding “Threshold value” • If Measure of Relative Separation > Threshold value, • there is a connection • If Measure of Relative Separation < Threshold value • there is no connection
Student's t Ratio (a measure of the relative separation of the boxplots ) • Sample data is Normal distribution • Student’s t-test • tcalc --- value of t-ratio
Bigger |tcalc| Larger Separation • Outcome2 >Outcome 1>Outcome3 • Set up decision rule
Decision Rule • If tcalc value is numerically between the range - tcrit & + tcrit then the decision rule is flagging H0 Supporting the viewpoint that there is no relationship • If tcalc value is numerically outside the range - tcrit & + tcrit then the decision rule is flagging H1 Supporting the viewpoint that there is a relationship. • Value of tcrit • Depending upon the sample size, through a measure called Degrees of Freedom(DF) • Could be looked up in the tables.
The hypothesis test described above is called the student's t test and is a two tailed test using the 5% level of significance. • Formally the level of significance may be defined as the chance the tester is prepared to take in coming to the wrong conclusion about H0
Stage 3: Doing the calculations • If tcalc value is numerically between the range - tTable & + tTable then the decision rule is flagging H0 There is no relationship • If tcalc value is numerically outside the range - tTable & + tTable then the decision rule is flagging H1 There is a relationship
Stage 4: The conclusions • In terms of the original business problem specification • For example, On the basis of the sample evidence there is evidence to suggest that there is a link between the amount spent on clothes and gender, Males on average spend about £45 per month and females spend on average £55.
Worked Example CREDIT • IDA
FDA • Stage 1: Define the hypotheses: • 0--true average amount borrowed on credit for house owners • 1--true average amount borrowed on credit for non house owners}
Stage 2: Defining the test parameters and the decision rule • Student’s t-test
Stage 3: Examining the sample evidence • MINITAB to do the calculations on the sample data • tTable = 1.96 • tcalc = -4.51 lies outside the range -1.96 to 1.96, reject H0 , accept H1
Stage 4: The conclusions. • Based on the sample evidence there is a connection between Amount Borrowed on Credit and House-ownership. On average house owners borrow £869.5 and non house owners borrow £1009.00.
16.4 HYPOTHESIS TEST 2: • Measured Response v Attribute Explanatory Variable with three or more levels • For example • Response variable: amount spent in a supermarket • Explanatory Variable: the customer's marital status--four categories, Single, Married, Divorced, or widowed • The common data analysis methodology applies and has the following three stages: • Initial Data Analysis • Further Data Analysis • Describing the Relationship
Example 1: • No evidence of a connection. • Example 2: • Some degree of separation • Measure of relative separation
Hypothesis Test--Four stages • Stage 1:Specifying the hypotheses. • Stage 2:Defining the test parameters and the decision rule. • Stage 3:Examining the sample evidence. • Stage 4:The conclusions.
Stage 1: Specifying the hypotheses. • By definition if there is no connection then all the population means are equal, whilst if there is a connection at least on of the means must be different, • Null hypotheses • Alternative hypotheses
Stage 2: Defining the test parameters and the decision rule. • Decision rule: based on F-Ratio. • Test procedure: Oneway Analysis of Variance • ANalysis Of VAriance : ANOVA • Fcritis the particular value of F that split the area under the distribution in the proportions 95%/5%.
Decision rule • If the value of Fcalc is between 0 and Fcrit then conclude that there is no link • If the value of Fcalc is greater than Fcrit then conclude that on the basis of the sample evidence there is a link.
Stage 3:Examining the sample evidence • Example1: • Fcalc would be small. • The F-Ratio is defined in such a way that if the null hypothesis is true, i.e. all the means are equal then Fcalc is expected to be 1. • Example 2 • Fcalc measures the relative separation • wider the separation, larger Fcalc value
To find Threshold Value: Fcrit • For F-Ratio: • two degrees of freedom(depends on sample size) • Look up the statistical tables: Ftable • Suppose: • Fcalc= 8.91 • The degrees of freedom as (3, 80) • Then Ftable=2.72
Stage 4:The conclusions. • Since the value of Ftable is larger than the value of Fcalc the conclusion is that on the basis of the sample evidence, there is enough evidence to suggest that there is a link between amount spent by customers in a supermarket and the customer's marital status. The remaining issue is to describe the connection.
Worked Example • CREDIT data scenario • Question: • The explanatory variable 'REGION' influence the response variable 'CREDIT'? • The amount borrowed on credit is dependent upon the region of the country where the customer lives?
FDA • Stage 1:Specifying the hypotheses. • Stage 2: Defining the test parameters and the decision rule.
Stage 3:Examining the sample evidence • MINITAB—ANOVA—ONE WAY Analysis of Variance for CREDIT Source DF SS MS F P REGION 4 3445125 861281 5.10 0.0 Error 649 109631953 168924 Total 653 113077078 • Ftable=2.39 • Since Fcalc= 5.10> Ftable=2.39 , the sample evidence is indicating a link between "Amount borrowed on credit" and "The region the customer lives in"
REGION AMOUNT SOUTH-WEST £977.10 SOUTH-EAST £958.40 LONDON £1061.80 MIDLANDS £898.10 NORTH £864.30 • Stage 4:The conclusions • Examination of the average values shows London to be the region with the highest amount on credit, then the South-West and South-East with similar average credits; the North having the lowest amount on credit.
Examine diagram displaying the 95% confidence intervals for each level of the attribute variable • Interpretation: • The decision rule is that if the confidence limits don't overlap then there is a real difference in the sample means for the two levels of the attribute. • For example Region 3 London has an average amount on credit that is statistically significantly larger than average amount on credit for Regions 4, The Midlands, because the two confidence limits don't overlap.
level 2 level 3 level 4 level 5 level 1 No Difference No Difference No Difference No Difference level 2 No Difference No Difference No Difference level 3 Difference Difference level 4 No Difference • The final description of the link can be summarised, as the amount spent on credit in London is significantly higher than in the Midlands and the North.