490 likes | 620 Views
Error check in data. Hein Stigum Presentation, data and programs at: http://folk.uio.no/heins/. Example data. HUMIS Birth cohort, 5 counties in Norway N=475 mother-child pairs Repeated questionnaires Purpose Outcome: Growth after birth Exposure: Contaminants in mother’s milk. Agenda.
E N D
Error check in data Hein Stigum Presentation, data and programs at: http://folk.uio.no/heins/ H.S.
Example data • HUMIS • Birth cohort, 5 counties in Norway • N=475 mother-child pairs • Repeated questionnaires • Purpose • Outcome: Growth after birth • Exposure: Contaminants in mother’s milk H.S.
Agenda • Potential problems • String variables, Missing, … • Univariate • Bivariate • Multivariable • Individual growth H.S.
Potential problems H.S.
String variables String to numeric encode KJONN if KJONN!=" ", generate(sex3) H.S.
Missing H.S.
Univariate outliers H.S.
Commands for previous plot local i=1 foreach var of varlist age1 weight1 fHCB BMI1 mHeight mWeight{ graph hbox `var', marker(1, mlabel(id) msymbol(i) mlabpos(0) mlabangle(-90)) /// name(plt`i', replace) local ++i } graph combine plt1 plt2 plt3 plt4 plt5 plt6, col(2) H.S.
Bivariate outliers H.S.
Commands for previous plot twoway (scatter mWeight mHeight) /// (scatter mWeight mHeightif BMI1>35 | BMI1<16, mcol(red)) /// (qfit mWeight mHeight) /// (qfit mWeight mHeightif mHeight<185) /// , legend(off) text(110 195 "BMI>35", col(red)) /// ytitle("Mother's weight") xtitle("Mother's height") H.S.
Multivariable outliers Weight H.S.
Commands for previous plot gen agesq=age^2 gen ageqb=age^3 regress weight age agesq ageqbif age>=0 & age<1000 capture: drop xb res predict xb, xb /* predicted value */ predict res, res /* residuals */ tw (scatter weight age)(scatter weight ageif abs(res)>4000, mcol(red)) /// (line xb age, sort lcol(red)) if age>=0 & age<1000, legend(off) H.S.
Plot of individual growth patterns: weight versus age H.S.
Weight by age 1 H.S.
Weight by age 2 H.S.
Weight by age 3 H.S.
Weight by age 4 H.S.
Weight by age 5 H.S.
Weight by age 6 H.S.
Weight by age 7 H.S.
Weight by age 8 H.S.
Weight by age 9 H.S.
Weight by age 10 H.S.
Weight by age 11 H.S.
Weight by age 12 H.S.
Weight by age 13 H.S.
Weight by age 14 H.S.
Weight by age 15 H.S.
Weight by age 16 H.S.
Commands for previous plots * Individual growth patterns. OBS 16 pages of each 30 plots * Repeated measurements, long format, age nested in id sort id age /* sort by id-number and age */ global d=30 /* 30 plots per page */ forvalues i=1(1)16 { /* 16 pages*30 plots=480 subjects */ local j=(`i'-1)*$d+1 /* plot subjects in id-interval: j<=id<=k */ local k=`i'*$d twoway (line weight age, connect(ascending)) if id>=`j' & id<=`k‘ /// ,by(id, compact title("Weight by age, `i'") note("") ) /// ylabel(0(5000)15000) xlabel(0(200)800) graph export “H:\Projects\HUMIS\Weight gain\plt`i'.emf", replace /* Enhanced Metafile Format */ } /* end of loop */ * Make new Photo album in Powerpoint, and add all plots. This will give one plot per page in max size. H.S.
After new data merge Plot of individual growth patterns: weight versus age H.S.
Individual plots in large datasets? • Scan 1 page (=30 curves) in 5 sec • Hours used=5N/(30*60*60) • Scan all • If N=50 000, need 2.3 hours • May instead scan curves of subjects with medium to large residuals. • Residual>1000 • finds 190 of the 470 children =40% • 12 of the 15 deviant growth patterns =80% H.S.
Summing up • Graph, outliers • Uni: Boxplots • Bi: Scatterplots • Multi: Scatterplots+residuals • Individual growth • Merge errors are not rare! H.S.