STAT131 W10L2 Review

STAT131W10L2 Review by Anne Porter alp@uow.edu.au

Relax, Close your eyes • The start of this lecture involves closing ones eyes and thinking of some relaxing music, a quiet place and when in this comfort zone to begin to confront the accumulation of formulae and language that has occurred over the past few weeks. Now waken to...

The student’s nightmare

Organising assumptions • Events in one period are independent of the next • Variances should be equal • Expected frequency > 5 • Sample size large or population normally distributed •  known • n fixed trials • Constant probability • In any small period of time only 0 or 1 events

Making Sense of Formulae • To make sense of the formulae and language encountered we need to organise it. When exploring samples of data and developing models for the random variables, whether discrete or continuous, we have sought to identify: • shape (including the probability of some event), • centre (mean, median), • spread (variance and standard deviation) and • fit of models to data (including outliers).

Map your concepts • See if you can organise the various formulae into related groups, attaching words to the symbols as you do so. You may also like to include those formulae which I have ommited. There is a symbolic language to learn in order to give voice to our statistical ideas.

Maybe some concept maps!

Summary • Maps • Frames • Worked solutions • Interpretations of printout • Interpretations of statistics • Assumptions

Let me show you some of the questions I ask as a teacher-researcher

What do the results of my midterm reveal about students?

Or what does this reveal?

Are the results normal distributed?

Does there appear to be a relationship between mark in Ass2 and the midterm?

What does Pearson’s r reveal? • What does r2 suggest about the strength of the relationship? • What other factors must we take into account when we try to assess the strength of the relationship?

What estimate would we give someone for the midterm if the missed the test and the only information we had was the mark for the (6) for the second assignment? • Why would you be reluctant to use this method? • Why would you use the method?

Is there any difference according to the paper completed?

Think about me marking all…. • Why might a time series showing the marks versus the order of marking be of interest?

What is the P(failing)? • What is the P(student was from 2002)? • What is the probability that given I have a 2002 student that they passed? • What is the probability that a student will be from 2002 and failed? • Is the grade independent of year?

Now let us do a past midterm!

Review ( paper 2003)

Q1: Describing data Compare the groups male and female on four key features of the data (eg centre…) • Centre -median male about same as for females at 2010 approx • Spread IQR (80) for males about twice that for females(40ish) (or look at range). • Outliers one for females at about 1800 none for males • Females reasonably symmetric smaller third quartile, males skewed to left longer low valued tail

Q1: Describing data • Four key features • Outlier for females (<1800) • None for males • Shape Normal for females • Bimodal for males • Spread females 1920-2060 • (excluding outlier) possibly narrower than for males • 1890 to 2080 • Centre Mode 2010 for females • higher than Mode 1990 for males

Q1: Describing Data Four main features • Outliers one male (2200) None for females • Shape females reasonably symmetric (third quartile narrow) males skewed with longer tail to low numbers • Centre - medians both about 2010 • Spread Male IQR about twice female IQR, range for males wider than female

Q2: Poisson • The number of questions posted to the STAT131 forum follow a Poisson distribution at the rate of 3.3 per day. Var(X)= 2 =t=3.3x1=3.3 Standard deviation(X) =  P(X=0)= P(Y>1) ie Y is exponential therefore =P(X=0)

Data centred about 3.3 Q2 vi) use of time series to test assumption of constant probability • What does this reveal about the suitability of the Poisson(3.3) distribution • Lack of trend suggests • Constant probability • Smoothing (running means • or medians can help smooth)

Data is centred about 2.1 Q2 vi) use of time series to test assumption of independence • What does this reveal about the suitability of the Poisson(2.1) distribution Shows random variation about that centre The absence of trends constant probability

Data is centred about 4.2 Q2 vi) use of time series to test assumption of independence • What does this reveal about the suitability of the Poisson(4.2) distribution • Shows random variation about • that centre • Smoothing (running median • or mean may show whether there • is possible change in probability the middle time periods)

Q3 Probability • (i) P(Other Nationality) = • (ii) P(Yes)= 110 120 100 330 150 180 100 others/330 total 150 yes /330 total response

Q3 Probability • (iii) P(Yes|Other Nationality)= • (iv) P(Yes and Other Nationality)= 110 120 100 330 150 180 50 yes /100 other nationalities 50/330

Q3 Independence • (v) From looking at the data does it appear that the response is independent of nationality. Yes / No. Explain your answer. 110 120 100 330 150 180 No they are not independent the patterns or responses are different for different nationalities. Knowing nationality helps knowing the probability of agreeing or not

Q4: Discrete Random Variables E(X) • The number of passengers packages being mislaid (X) per week is defined by the following probabilities and outcomes 40 30 100 (i)Complete the table (ii)What is E(X) ? =0x0.3+1x0.4+2x0.3=1

Hence Q4: Discrete Random Variables • The number of passengers packages being mislaid (X) per week is defined by the following probabilities and outcomes 40 30 100 X2 0 1 4 E(X)=1 (iii)What is the standard deviation of X? We need 2 Where Var(X)=s2= E(X2)-(E(X))2 Var(X)=1.6-12

40 30 100 freq 50 40 30 20 10 0 Observed Expected 0 1 2 x Q4: Discrete Random Variables - Graphical Fit • Sketch an appropriate graph to assess goodness of fit. • Do the data seem to fit the model There appears to be a lack of fit as seen by the discrepancy in columns

40 30 100 Q4: Discrete Random Variables - Chi-square goodness of fit • Use chi-square goodness of fit test to see if the data fit the model =26.6

40 30 100 Q4: Discrete Random Variables - Chi-square goodness of fit • Use chi-square goodness of fit test to see if the data fit the model Then there is lack of fit If Conclusion As 26.6 >3.828 there is evidence that the data do not fit the model Appears to be fewer observed for X=0 and more for X=2

Review by a researcher • Putting statistics to work

STAT131 W10L2 Review