460 likes | 832 Views
Biostat 201: Winter 2011. Lab Session 1 Week 1 and Week 2. Introduction. Wendy Shih wendyshi@ucla.edu Office Hours: Tues 2-3pm or by appointment A1-228 or Biostat Consulting Room (two doors to the left of the Lab). Access to SAS/STATA. In the lab: login= sph , password=hello
E N D
Biostat 201: Winter 2011 Lab Session 1 Week 1 and Week 2
Introduction • Wendy Shihwendyshi@ucla.edu • Office Hours: • Tues 2-3pm or by appointment • A1-228 or Biostat Consulting Room (two doors to the left of the Lab)
Access to SAS/STATA • In the lab: login=sph, password=hello • one year SAS student license • Check with your department • www.softwarecentral.ucla.edu • Computers/laptops at UCLA library • TLC lab at Biomed library • STATA Only • shortcut.clicc.ucla.edu
Typical lab session • 4 assignments total • Brief (very brief!) overview of the assignment • Introduce statistical tools/methods that may be helpful with accompanying SAS/STATA code fragments • Further discussion (time permitting) • Go analyze!
Some additional notes • Both SAS and STATA code will be introduced, but need only to know how to use one (so use whichever is most familiar to you) • Code will not be given to you in electronic format • Might want to bring a USB drive or have a way to save your documents • No raw outputs from SAS or STATA. All submitted results must be formatted.
Please Do NOT Paste Raw Outputs . tabstat dage, by(grad) stat(n mean semean min max) Summary for variables: dage by categories of: grad (Center Grade) grad | N mean se(mean) min max ---------+-------------------------------------------------- excellen | 36 29.13889 1.993702 18 68 good | 36 30.27778 1.581446 18 60 fair | 36 37.13889 1.792911 18 55 poor | 36 37.97222 1.853134 19 69 ---------+-------------------------------------------------- Total | 144 33.63194 .9552307 18 69 ------------------------------------------------------------ The MEANS Procedure Analysis Variable : dage N N grad Obs Miss Mean Std Error Minimum Maximum ------------------------------------------------------------------------------------------- 1 36 0 29.1388889 1.9937015 18.0000000 68.0000000 2 36 0 30.2777778 1.5814455 18.0000000 60.0000000 3 36 0 37.1388889 1.7929105 18.0000000 55.0000000 4 36 0 37.9722222 1.8531338 19.0000000 69.0000000 -------------------------------------------------------------------------------------------
Formatted Results Table 1: Summary Statistics for Donor Age (Years) by Center Grades
The assignments • All four assignments are reports, not problem sets • Introduction • Methods • Results • Can be submitted via e-mail as a Microsoft Word file • E-mail: wendyshi@ucla.edu • Subject: Biostat 201 W10 hw# Last First • Filename: Biostat 201 W10 hw# Last First • ex: Biostat 201 hw1 Shih Wendy
Assignment grades • Graded on a 0.0 – 4.0 scale • 0.0 to 1.9: major errors / misunderstandings • 2.0 to 2.5: a few major or multiple minor errors • 2.6 to 3.0: a few minor errors • 3.1 to 3.5: good/excellent job • 3.6 to 4.0: very impressive!
Assignment expectations • Brief • 2.5-3.5 pages (with tables and figures), 12pt, double-spaced is often sufficient • Complete • Requested analyses were performed and properly interpreted • Logical • Has an easy-to-follow flow • Easy to see how the analyses guided each step of the investigation • No ambiguity on what you were thinking
Common pitfalls • Lack of explanation • Why are you doing what you are doing? • Example: • We run a multivariate linear regression. (why?) • We run a multivariate linear regression to evaluate the association between crime rate and depression while adjusting for socioeconomic factors. (ah, that’s better!)
Common pitfalls • Lack of interpretation • On what basis are you making your claims? • Example: • There is a significant difference between the IQ’s of UCLA and USC students. (what makes you say this?) • The two-sample t-test result indicates that the SAT scores of UCLA and USC freshmen are statistically different (p=0.0032), with UCLA students having an average SAT score that is 220 points greater than USC students. (note: method used, measure used, statistical significance, magnitude, direction)
Common pitfalls • Lack of follow-up • How exactly did your findings guide you in your investigation? • Example: • A scatterplot of SAT score vs. GPA suggests a positive linear relationship among males, but a negative linear relationship among females. (How does this finding influence your analysis?) • A scatterplot of SAT score vs. GPA suggests a positive linear relationship among males, but a negative linear relationship among females. Therefore, the association of SAT score and GPA among males and females were evaluated separately.
Questions to ask yourself • What are you investigating? • What analytical method are you using to investigate it? • What do the results of that analysis tell you? • How do those results guide your subsequent analyses, or what conclusions do you draw from it?
SAS/STATA code key • I will use the following convention in these slides: • statements: bold • keywords: italics • options: underlined • Variables, or something you specify yourself: courier font
What do we need to do? • Import data • Summary statistics and plots • Choose and specify a model • Investigate if the model is appropriate • Predicted mean differences for covariate profiles • Conduct and interpret the model results
SAS: Importing data • http://www.ats.ucla.edu/stat/sas/faq/rwxls8.htm • http://www.ats.ucla.edu/stat/sas/faq/read_delim.htm • Can use import wizard:file import data… • proc importout=datasetdatafile="directory_of_excel_file"dbms=excelreplace;sheet="sheet_name";run;
SAS: Importing data • http://www.ats.ucla.edu/stat/sas/faq/rwxls8.htm • http://www.ats.ucla.edu/stat/sas/faq/read_delim.htm • Can use import wizard:file import data… • proc importout=hdldatadatafile="C:\SAS\data\hdltable.csv"dbms=csvreplace;sheet="sheet3";run;
STATA: Importing data • http://www.ats.ucla.edu/stat/stata/faq/readcommatab.htm • cd "directory_of_csv_file" • insheetusingfile_name
Example: Kidney Data SAS proc import datafile="G:\TA - Biostat 201 Winter 2011\KIDNEY.csv“ out=kidney dbms=csv replace; run; STATA cd "G:\TA - Biostat 201 Winter 2011" insheet using "KIDNEY.csv"
SAS: Summary statistics • proc meansdata=dataset [options];varvar1 var2 var3;run; • proc meansdata=dataset [options];classgrpvar;varvar1 var2 var3;run; • proc univariatedata=dataset;varvar1 var2 var3;run;
SAS: Summary statistics procmeans data=kidney nmiss mean stderr min max; var dage cith; run; procmeans data=kidney nmiss mean stderr min max; class grad; var dage cith; run; procunivariate data=kidney; var dage cith; run; procunivariate data=kidney; class grad; var dage cith; run;
STATA: Summary statistics • summarizevar1 var2 • bysort grpvar: summarizevar1 var2 • summarizevar1 var2,detail • sum dage cith • sum dage cith, detail • bysort grad: sum dage cith, detail
SAS: Bivariate statistics (continuous variables) • proc ttestdata=dataset;classgrpvar;varvar1 var2 var3;run; • proc npar1waydata=dataset;classgrpvar;varvar1 var2 var3;run;
SAS: Bivariate statistics (continuous variables) • procttest data=kidney; class cens; var cith; run; • procnpar1way data=kidney; class cens; var cith; run;
STATA: Bivariate statistics (continuous variables) • ttestvar1, by(grpvar) • kwallisvar1, by(grpvar) • ttest cith, by(cens) • ttest cith, by(cens) unequal • kwallis cith, by(cens)
SAS: Plots • proc gplotdata=dataset;plotyvar * xvar = grpvar;run; quit; • procgplot data=kidney; plot dage*cith=cens; run; quit;
STATA: Plots • twoway (scatter yvarxvarifgrpvar==value, mcolor(color)) • twoway (scatter dage cith if cens==0, ms(o) mcolor(red)) (lfit dage cith if cens==0, clcolor(red)) (scatter dage cith if cens==1, ms(o) mcolor(blue)) (lfit dage cith if cens==1, clcolor(blue)), legend(off)
Choose a model • Right now, we assume that this assignment is driving toward a linear regression model. Just know that this may not always be appropriate in real-world situations.
SAS: Linear model • procregdata=dataset;modelyvar = x1x2x3;run; quit; • procreg data=kidney; model cith=censdage; run; quit;
STATA: Linear model • regress yvarx1x2x3 • regress cith cens dage
SAS: Stratified model • proc sortdata=dataset; by grpvar;run;procregdata=dataset;modelyvar = x1x2x3;bygrpvar;run; quit; You must SORT by the grouping variable before you run the stratified model.
SAS: Stratified model • procsort data=kidney; by cens; run; • procreg data=kidney; model cith=dage; by cens; run; quit;
STATA: Stratified model • bysortgrpvar: regress yvarx1x2 • bysort cens: regress cith dage
SAS: Dummy encoded model • proc regdata=dataset;modelyvar = x1x2x3z1z2;run; quit; • Note: “z” represents dummy-encoded variables • procreg data=kidney; model cith = dage cens excel good fair; run; quit; Newly created dummy variables.
STATA: Dummy encoded model • regress yvarx1x2z1 z2 • Note: “z” represents dummy-encoded variables • regress cith cens dage excel good fair Newly created dummy variables.
SAS: Interaction model • datadataset;setdataset;intnvar = x1 * x2;run;proc regdata=dataset;modelyvar = x1x2intnvar;run; quit;
SAS: Interaction model • data kidney; set kidney; d_c=dage*cens; run; • procreg data=kidney; model cith=dagecensd_c; run;quit;
STATA: Interaction model • gen intnvar = x1 * x2regressyvarx1x2intnvar • gen d_c=dage*cens regress cith dage cens d_c
Predicted mean differences • Question:Observation 1 has “this” particular profile, and observation 2 has “that” particular profile. Is there a difference in their predicted mean response/outcome? • Example:Obs1: 56 years old and censoredObs2: 61 years old and censored
Predicted mean differences • Strategy • Add observations with the specified covariate profiles with the outcome missing • Run the linear regression model and request the predicted outcome with standard error of the prediction • Look at the results
SAS: Predicted mean differences • Add observations • data profiles; input dage cens; cards; 56 0 61 0 ; run; data kidney; set kidney profiles; run;
SAS: Predicted mean differences • Analyze and request standard error of the prediction • procreg data=kidney; model cith=dagecens; output out=kidney_new p=ypredstdp=yprese; run; quit; • Now if you open the “kidney_new” dataset, you can scroll down and view the predicted values and the standard error of the prediction
STATA: Predicted mean differences • Add observations • It’s probably easiest to do this using the data editor • Suppose our dataset has 100 observations: • set obs 146 replace dage=56 in 145 replace cens=0 in 145 replace dage=61 in 146 replace cens=0 in 146
STATA: Predicted mean differences • Analyze and request the standard error of the prediction • regress cith cens dage • predict ypred • predict yprese, stdp • Now if you open the data browser, you can scroll down and view the predicted values and the standard error of the prediction