STAT131/171 W4L2 Modelling Variation: Introduction to modelling and GOF

STAT131/171W4L2 Modelling Variation: Introduction to modelling and GOF by Anne Porter alp@uow.edu.au

Activity: Let’s play beat the butcher • Morning radio 6am -7am, weekdays • Contestant telephones in to play • Contestant has to say stop before the gong rings to win the meat • Radio personality reads the meat items: 2 slices of scotch fillet,…,3kg mince, until the gong is reached

The list Let’s play, all stand, I’ll read, you sit when you have enough meat. Last ones standing before the gong win. 1) Three kilos scotch fillet 4) 12 chicken kebabs 5) 12 lamb kebabs 2) 1 chicken 3) 3 kilos of sausages 6) 3 livers 9) 2kg salmon rissoles 7) 1 kg bacon 8) lamb chops

How might you increase your chances of winning?What information would be useful before you play again? • What is the maximum and minimum number of items ever read out? • What is the voice pattern over the gonged items? • What is the average number of items read out before the Gong? • What is the frequency of gongs over time for each item?

Frequency distribution of the number of items before the gong What is a more informative way of presenting the data so we optimise our chance of where to stop?

Relative frequency table What is a better way of presenting this information so it is easier to use?

Cumulative frequency What will be the median number of items before the gong? Is the (n+1)/2th value =the 50.5th value =8

Frequency distribution of the number of items before the gong What is the average number of items read before the gong?

What do we do to calculate the mean number of items before the gong?

What do we do to calculate the mean number of items before the gong? Multiply the number of items by the Frequency AND add to get the total number of items before the gong AND divide by the number of games played

Calculate the mean

Calculate the mean =784/100 =7.84 Items before the gong

Will your stopping strategy be the same for this set of data? Why not?

Will your stopping strategy be the same for this set of data? Why not? For these values of x we have a much smaller spread

In the long run what should be the probability of stopping at each number if stopping at random?

P(X=x) and number expected for each item for the random stopping model Does it appear that the data fit the random stopping model? Why so?

P(X=x) and number expected for each item for the random stopping model Does it appear that the data fit the random stopping model? Why so? Number expected differs from number observed.

Bar Chart: Compare observed & expected frequencies

Measuring the difference between O and E How do we Measure (compare, calculate) the difference between observed and expected

P(X=x) and number expected for each item for the random stopping model How might we calculate the difference between observed and expected If the data fits will this be big or small? small

Calculating

Model Fit Using • Calculate • And see if it is too large for the data to be considered to fit the model

Model Fit Informal : Is too big? • If • Where d=g-p-1 • g is the number of cells • p is the number of parameters estimated from the data Then there is evidence the data does not fit the model For our example g= 10 cells therefore d=10-0-1=9 = 17.49 Decision: As =65.6 >17.49 there is evidence that the data do not fit the random stopping model

Percentage Points of the distribution df a 0.995 0.99 0.05 0.025 0.01 0.005 1 3.841 5.024 6.635 7.879 9 1.735 2.088 16.919 19.023 21.666 23.589 Model Fit Formal • Decision: If calculated > critical value of (tables) then there is evidence of lack of fit a=0.05 (typical and we will use) df=Number of cells –number of estimated parameters-1 df =10-0-1=9

Percentage Points of the distribution df a 0.995 0.99 0.05 0.025 0.01 0.005 1 3.841 5.024 6.635 7.879 9 1.735 2.088 16.919 19.023 21.666 23.589 Model Fit Formal • Decision: As calculated =65.6 > critical value of 16.919 found in the tables there is evidence of lack of fit between the data and the random stopping model.

Lack of fit Looking at the table we can see most lack of fit occurs for items 2, 3, 8 and 9 lots of meat before the gong

Sampling Distributions • We will explore how these types of sampling distributions, are generated in our lecture on sampling distributions. • We will also explore how we chose a value of a • We will look at using the data to estimate parameters later

Model fit approaches • Use a Bar chart to compare observed and expected frequencies • Compare observed and expected frequencies • Calculate and use • Informally • Formally assumes that the expected counts in each cell is 5 If not combine cells. Other literature uses other rules, there is a debate over this. (Check the Utts& Heckard (2004) definition)

Mean (expected value, E(X)) for the random stopping model

Expected value for the random stopping model is? E(X)=6.5

Spread of the Population Model We will leave calculation of these till a little later on a simpler example

What have we been doing? • We have been looking at the centre, spread, outliers and shape of samples of data? • With a view to improving decision making. • Why are we concerned with looking at models?

Describing characteristics of Data We collect data on samples • Time in seconds until two species of flies released together mate • The number of lost articles found in a large municipal office • The average carbohydrate content per 100 gm serve in a sample of different species • The number of items of meat read before the gong

Improving our decisions Looking at • The shape of the distribution • Centre • Spread • Whether or not the data fit some model • May even look at outliers, points not fitting the model

Describing Batches of Data • Comparing midterm marks from the different versions of the test. • Are the papers completed in a similar manner?

What we are really looking at is NOT • The mating behaviour of these particular flies • Past lost articles • Or last years exam papers • Or the last 100 games of beat the butcher We are interested in them because they may suggest a model for the characteristics of the data in general. This involves Probability Models. We shall continue to explore probability models in future lectures.

STAT131/171 W4L2 Modelling Variation: Introduction to modelling and GOF