350 likes | 437 Views
Lab 4 Wednesday, February 04, 2004. Manipulating your data. Example Problem.
E N D
Lab 4Wednesday, February 04, 2004 Manipulating your data
Example Problem Suppose you have 45 student’s GPAs and you want to answer two questions, do psychology majors have higher GPAs than other majors and do psychology students who work full time have lower GPAs than students who do not work at all.
113389 213400 311268 411156 511203 622245 721100 821356 911210 1012310 Column 1-2: Participant Id # Column 3: major (1=psychology, 2=math, 3=english) Column 4: Employment status (1=full-time, 2=part-time, 3=don’t work) Column 5-7: GPA Part of the Data
Infile and input statement Data D1; INFILE ‘C:\My Documents\exdatalab4.txt’; Input id 1-2 major 3 work 4 @ 5 (gpa)(3.2) ; Proc print; Run;
Proc print output Obs id major work gpa 1 1 1 3 3.89 2 2 1 3 4.00 3 3 1 1 2.68 4 4 1 1 1.56 5 5 1 1 2.03 6 6 2 2 2.45 7 7 2 1 1.00 8 8 2 1 3.56 9 9 1 1 2.10 10 10 1 2 3.10
Clean data Data D1; INFILE 'C:\My Documents\exdatalab4.txt'; Input id 1-2 major 3 work 4 @ 5 (gpa)(3.2) ; procfreq; Tables id; Proc univariate; Var work major gpa; run;
Clean data cont. The FREQ Procedure Cumulative Cumulative id Frequency Percent Frequency Percent --------------------------------------------------------------------- 1 1 2.22 1 2.22 2 1 2.22 2 4.44 3 1 2.22 3 6.67 4 1 2.22 4 8.89 5 1 2.22 5 11.11 6 1 2.22 6 13.33 7 1 2.22 7 15.56 8 1 2.22 8 17.78 9 1 2.22 9 20.00 10 1 2.22 10 22.22
Clean data (cont) The UNIVARIATE Procedure Variable: work Extreme Observations ----Lowest---- ----Highest--- Value Obs Value Obs 1 42 3 32 1 41 3 35 1 38 3 39 1 37 3 40 1 36 3 45
Clean data (cont) The UNIVARIATE Procedure Variable: major Extreme Observations ----Lowest---- ----Highest--- Value Obs Value Obs 1 45 3 28 1 44 3 29 1 43 3 34 1 40 3 35 1 39 6 23
Clean Data (cont) GPA - Extreme Observations ----Lowest---- ----Highest--- Value Obs Value Obs 1.00 7 3.95 23 1.56 4 3.95 41 1.67 33 4.00 2 1.80 26 4.00 39 2.03 5 5.56 31
Clean data (cont) Obs id major work gpa 22 22 2 1 3.67 23 23 6 1 3.95 24 24 3 3 3.80 25 25 3 1 2.80 26 26 1 1 1.80 27 27 3 3 2.50 28 28 3 3 2.90 29 29 3 3 3.15 30 30 1 3 3.00 31 31 1 3 5.56 32 32 1 3 3.89
Delete bad item Data D1; INFILE 'C:\My Documents\exdatalab4.txt'; Input id 1-2 major 3 work 4 @ 5 (gpa)(3.2) ; Data d2; Set d1; if _n_ = 23 then delete; if _n_ = 31 then delete; procprint data=d2; Run;
Output Obs id major work gpa majorr 21 21 2 1 3.05 1 22 22 2 1 3.67 1 23 24 3 3 3.80 1 24 25 3 1 2.80 1 25 26 1 1 1.80 0 26 27 3 3 2.50 1 27 28 3 3 2.90 1 28 29 3 3 3.15 1 29 30 1 3 3.00 0 30 32 1 3 3.89 0
Recode Major to dichotomousPsychology major vs not pych Data D1; INFILE 'C:\My Documents\exdatalab4.txt'; Input id 1-2 major 3 work 4 @ 5 (gpa)(3.2) ; Data d2; Set d1; if _n_ = 23 then delete; if _n_ = 31 then delete; if major = 1 then majorr = 0; if major = 2 then majorr = 1; if major = 3 then majorr = 1; procprint data=d2; Run;
Proc print Obs id major work gpa majorr 1 1 1 3 3.89 0 2 2 1 3 4.00 0 3 3 1 1 2.68 0 4 4 1 1 1.56 0 5 5 1 1 2.03 0 6 6 2 2 2.45 1 7 7 2 1 1.00 1 8 8 2 1 3.56 1 9 9 1 1 2.10 0 10 10 1 2 3.10 0
Hypothesis • Ho: psychology majors GPAs = other majors GPAs • H1: psychology majors have higher GPAs than other majors
Testing means procunivariate normal plot data=d2; Var gpa; By majorr; procglm data=d2; class majorr; model gpa = majorr; means majorr/hovtest; Run;
Proc GLM output The GLM Procedure Dependent Variable: gpa Sum of Source DF Squares Mean Square F Value Pr > F Model 1 0.01065652 0.01065652 0.02 0.8905 Error 41 22.78548766 0.55574360 Corrected Total 42 22.79614419 R-Square Coeff Var Root MSE gpa Mean 0.000467 25.58931 0.745482 2.913256
Proc glm output (cont) The GLM Procedure Level of -------------gpa------------- majorr N Mean Std Dev 0 22 2.92863636 0.74777317 1 21 2.89714286 0.74306893 Therefore, no support was found for the hypothesis that psychology majors have higher GPAs than English and Math Majors (F(1, 41) = .02, ns).
Sample problem Do psychology students who work full time have lower GPAs than students who do not work at all. Ho: Full time GPAs = no work GPAs H1: Students who don’t work > GPAs than students who work full time
Create data set of only Psychology majors and delete part-time data Data d3; Set d2; If major = 0; If work = 2 then delete; Proc print d3; Run;
Output Obs id major work gpa majorr 1 1 1 3 3.89 0 2 2 1 3 4.00 0 3 3 1 1 2.68 0 4 4 1 1 1.56 0 5 5 1 1 2.03 0 6 9 1 1 2.10 0 7 18 1 3 3.00 0 8 19 1 3 3.15 0 9 20 1 3 2.56 0 … 18 45 1 3 3.15 0
Compare means procunivariate normal plot data=d3; Var gpa; By work; procglm data=d3; class work; model gpa = work; means work/hovtest; Run;
Proc GLM output The GLM Procedure Dependent Variable: gpa Sum of Source DF Squares Mean Square F Value Pr > F Model 1 5.10034028 5.10034028 15.98 0.0010 Error 16 5.10768750 0.31923047 Corrected Total 17 10.20802778 R-Square Coeff Var Root MSE gpa Mean 0.499640 19.86733 0.565005 2.843889
Proc GLM output (cont) The GLM Procedure Level of -------------gpa------------- work N Mean Std Dev 1 8 2.24875000 0.54971259 3 10 3.32000000 0.57661850 Therefore, students who don’t work have significantly higher GPAs (M = 3.32, SD = .58) than those students who work full time (M = 2.25, SD = .55; F(1,16) = 15.98, p < .05).
Notes on what to include in the write-up • Talk about tests of assumptions and if there is a violation talk about the consequences. Include test statistics (W for normality and F for homogeneity of variance). • Report means, SDs, and F-value. What do the results means in terms of the hypotheses. • Overall conclusions
Exercises • 4 variables: • Id (participant id number) • Gender (female = 0, male = 1) • Depress (depression scale ranging from 29 to 45) • Age (range 6 to 98) • Open “lab4.sas” on Dr. Brannick’s website
Exercise (cont) • Clean the data. Check the gender variable with a proc univariate statement to see if they are out of bound values. • Delete the out of bounds value.
Program • Cards; • ; • data d2; • set d1; • if _n_ = 15 then delete; • procprint; • run;
Exercise 2 • Create a data set of females that are under 21.
Program data d2; set d1; if _n_ = 15 then delete; data d3; set d2; if gender = 0 and age <21; procprint; run;
Output Obs id gender depres age 1 1 0 39 18 2 9 0 34 14 3 10 0 37 19 4 23 0 33 7 5 24 0 39 10 6 25 0 36 12 7 26 0 45 18 8 27 0 40 16 9 28 0 41 12 10 29 0 33 8
Exercise 3 • Create a data set of males who have a score on the depress scale under 38.
Program data d4; set d2; if gender = 1 and depres<38; procprint; run;
Output Obs id gender depres age 1 32 1 36 79 2 33 1 35 72 3 35 1 29 63 4 37 1 33 56 5 38 1 34 51 6 39 1 36 46 7 40 1 32 41 8 41 1 32 35 9 42 1 31 30 10 45 1 32 18 11 46 1 37 25 12 47 1 36 36 13 49 1 36 5