320 likes | 655 Views
Summarizing Data. Numeric Methods. Rcmdr. Features for loading, viewing and analyzing data Help system Packages. Data in R. Several formats: vectors, arrays, matrices, lists, data.frames
E N D
Summarizing Data Numeric Methods
Rcmdr • Features for loading, viewing and analyzing data • Help system • Packages
Data in R • Several formats: vectors, arrays, matrices, lists, data.frames • Generally we use data.frames as they have the advantage of letting us store different kinds of data and linking them by row. • Rcmdr uses data.frames
Referencing Data.frames • R allows you to refer to rows, columns and individual cells in a data frame in multiple ways • Every cell has a row and column number that identifies it (like Excel) • Every cell is the intersection of a named row and a named column
Types of Data in R • Numeric – integer and decimal • Categorical – factors • Ranks – ordered factor • Logical (True/False) • Character
Data sets • Darl and Pedernales points from Fort Hood archaeological surveys • Data on village and house sizes among California Indian tribes from an article by Sherburne Cook and Robert Heizer
Darl Pedernales
> head(DartPoints) Name TARL QUAD East North Length Width Thick 35-3026 Darl 41CV0270 24/62 62 24 34.5 15.9 4.8 36-3321 Darl 41CV1023 12/58 58 12 36.0 17.1 4.0 36-3520 Darl 41CV0495 16/62 62 16 32.4 14.5 5.2 35-2382 Darl 41CV0611 22/62 62 22 31.2 15.6 5.1 40-0847 Darl 41CV1287 05/48 48 5 33.6 15.8 5.1 35-2959 Darl 41CV0235 21/63 63 21 41.8 16.8 4.1 > tail(DartPoints) Name TARL QUAD East North Length Width Thick 38-0098 Pedernales 41BL0416 39/45 45 39 74.0 34.0 6.6 35-2951 Pedernales 41CV0235 21/63 63 21 64.5 28.5 8.2 35-0173 Pedernales 41CV0869 22/66 66 22 78.3 28.1 8.5 36-4266 Pedernales 41CV0240 15/65 65 15 64.1 27.2 12.0 41-0239 Pedernales 41CV0493 16/62 62 16 67.2 27.1 12.0 35-2855 Pedernales 41CV0843 24/65 65 24 49.3 19.5 7.5
> str(DartPoints) 'data.frame': 55 obs. of 8 variables: $ Name : Factor w/ 2 levels "Darl","Pedernales": 1 1 1 1 1 1 1 1 1 1 ... $ TARL : Factor w/ 43 levels "41BL0183","41BL0205",..: 13 34 18 21 ... $ QUAD : Factor w/ 38 levels "05/48","08/63",..: 30 9 17 28 1 26 15 ... $ East : num 62 58 62 62 48 63 33 63 59 63 ... $ North : num 24 12 16 22 5 21 16 17 26 20 ... $ Length: num 34.5 36 32.4 31.2 33.6 41.8 33.5 32 42.8 37.5 ... $ Width : num 15.9 17.1 14.5 15.6 15.8 16.8 16.6 16 15.8 16.3 ... $ Thick : num 4.8 4 5.2 5.1 5.1 4.1 4.9 5.4 5.8 6.1 ... - attr(*, "na.action")=Class 'omit' Named int [1:2] 39 53 .. ..- attr(*, "names")= chr [1:2] "35-2650" "35-2384“
> attributes(DartPoints) $names [1] "Name" "TARL" "QUAD" "East" "North" "Length" "Width" "Thick" $row.names [1] "35-3026" "36-3321" "36-3520" "35-2382" "40-0847" "35-2959" [7] "41-0257" "36-3619" "41-0322" "35-2921" "36-3036" "35-2905" [13] "35-2866" "36-3487" "36-4247" "35-2928" "35-2871" "36-3898" [19] "35-2946" "38-0736" "35-2325" "35-0164" "41-0323" "35-3043" [25] "35-2004" "35-2960" "41-0237" "44-0643" "43-0110" "36-3549" [31] "41-0008" "36-4320" "44-1315M" "35-2901" "41-0220" "35-2873" [37] "47-0041" "36-3879" "41-0054" "50-0092" "44-1492M" "36-3880" [43] "35-2875" "36-3081" "36-3897" "44-1253M" "36-3229" "41-0058" [49] "35-2391" "38-0098" "35-2951" "35-0173" "36-4266" "41-0239" [55] "35-2855" $class [1] "data.frame"
> DartPoints[1,] Name TARL QUAD East North Length Width Thick 35-3026 Darl 41CV0270 24/62 62 24 34.5 15.9 4.8 > DartPoints["35-3026",] Name TARL QUAD East North Length Width Thick 35-3026 Darl 41CV0270 24/62 62 24 34.5 15.9 4.8 > DartPoints[,6] [1] 34.5 36.0 32.4 31.2 33.6 41.8 33.5 32.0 42.8 37.5 38.1 41.8 43.0 43.0 33.1 [16] 40.0 35.5 38.0 45.0 42.2 47.7 42.3 48.5 44.2 47.6 56.0 54.2 35.4 65.0 43.7 [31] 48.1 47.1 45.2 60.0 49.0 43.2 44.8 84.0 52.8 49.1 55.0 57.2 59.0 48.0 52.0 [46] 61.3 55.3 61.1 66.0 74.0 64.5 78.3 64.1 67.2 49.3 > DartPoints[,"Length"] [1] 34.5 36.0 32.4 31.2 33.6 41.8 33.5 32.0 42.8 37.5 38.1 41.8 43.0 43.0 33.1 [16] 40.0 35.5 38.0 45.0 42.2 47.7 42.3 48.5 44.2 47.6 56.0 54.2 35.4 65.0 43.7 [31] 48.1 47.1 45.2 60.0 49.0 43.2 44.8 84.0 52.8 49.1 55.0 57.2 59.0 48.0 52.0 [46] 61.3 55.3 61.1 66.0 74.0 64.5 78.3 64.1 67.2 49.3 > DartPoints$Length [1] 34.5 36.0 32.4 31.2 33.6 41.8 33.5 32.0 42.8 37.5 38.1 41.8 43.0 43.0 33.1 [16] 40.0 35.5 38.0 45.0 42.2 47.7 42.3 48.5 44.2 47.6 56.0 54.2 35.4 65.0 43.7 [31] 48.1 47.1 45.2 60.0 49.0 43.2 44.8 84.0 52.8 49.1 55.0 57.2 59.0 48.0 52.0 [46] 61.3 55.3 61.1 66.0 74.0 64.5 78.3 64.1 67.2 49.3 > DartPoints[1,6] [1] 34.5
> head(CAIndians) Region Tribe Language AreaHouseFamilySizeFpHousePpHouseAreapPer 1 1 Yurok Algonkin 439 7.5 1 7.5 58.5 2 2 Wiyot Algonkin 254 7.5 1 7.5 33.8 3 3 KarokHokan NA 7.5 1 7.5 NA 4 4 HupaAthabaskan 400 7.0 1 7.0 57.1 5 5 ChilulaAthabaskan NA 7.5 1 7.5 NA 6 6 Shasta Hokan 264 7.0 1 7.0 33.0 HpVillagePpVillageAreapVilAreaVillageVpHouseVpPerPctFloor 1 7.8 60 3434 25450 3263 424 13.5 2 7.6 57 1930 28400 3738 498 6.8 3 4.1 31 NA NANANANA 4 10.9 76 4360 NA NANANA 5 7.0 52 NA NANANANA 6 6.0 48 1584 18950 3158 394 8.4
> str(CAIndians) 'data.frame': 30 obs. of 15 variables: $ Region : int 1 2 3 4 5 6 7 8 9 10 ... $ Tribe : Factor w/ 30 levels "Achomawi","Athabascans",..: 30 27 8 7 5... $ Language : Factor w/ 6 levels "Algonkin","Athabaskan",..: 1 1 3 2 2 3 3 ... $ AreaHouse : int 439 254 NA 400 NA 264 110 118 100 125 ... $ FamilySize : num 7.5 7.5 7.5 7 7.5 7 6 6 6 6 ... $ FpHouse : num 1 1 1 1 1 1 1 1 1 1 ... $ PpHouse : num 7.5 7.5 7.5 7 7.5 7 6 6 6 6 ... $ AreapPer : num 58.5 33.8 NA 57.1 NA 33 18.3 19.6 16.7 20.8 ... $ HpVillage : num 7.8 7.6 4.1 10.9 7 6 5.3 5.4 3.6 5 ... $ PpVillage : int 60 57 31 76 52 48 32 32 22 30 ... $ AreapVil : int 3434 1930 NA 4360 NA 1584 583 637 360 625 ... $ AreaVillage: int 25450 28400 NA NANA 18950 14000 27100 61500 6390 ... $ VpHouse : int 3263 3738 NA NANA 3158 2641 5019 17084 1278 ... $ VpPer : int 424 498 NA NANA 394 438 847 2795 214 ... $ PctFloor : num 13.5 6.8 NA NANA 8.4 4.2 2.4 0.6 9.8 ...
Central Tendency • Mean (Average) = Sum/Number • Dichotomous data – percentage present • Median = Middle value • Mode = Predominant value
> mean(DartPoints$Length) [1] 48.64 > median(DartPoints$Length) [1] 47.1 > mean(CAIndians$AreaHouse) [1] NA > mean(CAIndians$AreaHouse, na.rm=TRUE) [1] 299.4815 > median(CAIndians$AreaHouse) [1] NA > median(CAIndians$AreaHouse, na.rm=TRUE) [1] 129 > mean(DartPoints[,6:8]) Length Width Thick 48.640000 22.052727 7.283636 > mean(DartPoints[DartPoints$Name=="Darl",6:8]) Length Width Thick 40.574074 18.003704 5.981481 > mean(DartPoints[DartPoints$Name=="Pedernales",6:8]) Length Width Thick 56.417857 25.957143 8.539286
Dispersion • Range (max – min) • Standard Deviation, Variance (Sample vs. Population) • Coefficient of Variation = StDev/Mean * 100 • Quartiles and the Interquartile Range
> range(DartPoints$Length) [1] 31.2 84.0 > diff(range(DartPoints$Length)) [1] 52.8 > sd(DartPoints$Length) [1] 12.22144 > var(DartPoints$Length) [1] 149.3636 > sd(DartPoints$Length)/mean(DartPoints$Length)*100 [1] 25.12631 > quantile(DartPoints$Length) 0% 25% 50% 75% 100% 31.20 40.90 47.10 55.65 84.00 > IQR(DartPoints$Length) [1] 14.75 > diff(range(CAIndians$AreaHouse, na.rm=TRUE)) [1] 1175 > sd(CAIndians$AreaHouse, na.rm=TRUE) [1] 339.4273 > var(CAIndians$AreaHouse, na.rm=TRUE) [1] 115210.9 > quantile(CAIndians$AreaHouse, na.rm=TRUE) 0% 25% 50% 75% 100% 75.0 110.5 129.0 310.0 1250.0
Shape • Symmetry, Skewness • Normal = 0, Positive or Negative indicates tail in that direction • Peaked vs Flat, Kurtosis • Normal = 0, Positive – more clustered (peaked) than normal, Negative – more spread (flatter) than normal
> library(e1071) Loading required package: class > skewness(DartPoints$Length) [1] 0.7749526 > kurtosis(DartPoints$Length) [1] 0.12126 > skewness(CAIndians$AreaHouse, na.rm=TRUE) [1] 1.708035 > kurtosis(CAIndians$AreaHouse, na.rm=TRUE) [1] 1.498035
Descriptive Stats • summary() – in base R • numSummary() – in Rcmdr • describe() – in psych • describe() – in prettyR • stat.desc() – pastecs
> summary(DartPoints) Name TARL QUAD East North Darl :27 41CV0235: 4 21/63 : 4 Min. :33.00 Min. : 5.00 Pedernales:28 41CV0859: 3 14/62 : 3 1st Qu.:55.00 1st Qu.:14.50 41CV1092: 3 16/62 : 3 Median :62.00 Median :20.00 41BL0205: 2 20/63 : 3 Mean :58.24 Mean :19.02 41CV0132: 2 22/66 : 3 3rd Qu.:63.50 3rd Qu.:23.00 41CV0493: 2 24/66 : 3 Max. :70.00 Max. :39.00 (Other) :39 (Other):36 Length Width Thick Min. :31.20 Min. :14.50 Min. : 4.000 1st Qu.:40.90 1st Qu.:16.95 1st Qu.: 5.850 Median :47.10 Median :22.00 Median : 7.200 Mean :48.64 Mean :22.05 Mean : 7.284 3rd Qu.:55.65 3rd Qu.:26.95 3rd Qu.: 8.050 Max. :84.00 Max. :34.00 Max. :12.000
> numSummary(DartPoints[,6:8]) mean sd 0% 25% 50% 75% 100% n Length 48.640000 12.221438 31.2 40.90 47.1 55.65 84 55 Width 22.052727 5.194579 14.5 16.95 22.0 26.95 34 55 Thick 7.283636 1.891870 4.0 5.85 7.2 8.05 12 55 > library(psych) > describe(DartPoints[,6:8]) var n mean sd median trimmed mad min max range skew kurtosis se Length 1 55 48.64 12.22 47.1 47.63 12.16 31.2 84 52.8 0.77 0.38 1.65 Width 2 55 22.05 5.19 22.0 21.85 7.41 14.5 34 19.5 0.24 -1.16 0.70 Thick 3 55 7.28 1.89 7.2 7.13 1.48 4.0 12 8.0 0.69 0.50 0.26 > detach(package:psych) > library(prettyR) > describe(DartPoints[,6:8]) Description of DartPoints[, 6:8] Numeric mean median varsdvalid.n Length 48.64 47.1 149.4 12.22 55 Width 22.05 22 26.98 5.195 55 Thick 7.284 7.2 3.579 1.892 55
> library(pastecs) > stat.desc(DartPoints[,6:8]) Length Width Thick nbr.val 55.0000000 55.0000000 55.0000000 nbr.null 0.0000000 0.0000000 0.0000000 nbr.na 0.0000000 0.0000000 0.0000000 min 31.2000000 14.5000000 4.0000000 max 84.0000000 34.0000000 12.0000000 range 52.8000000 19.5000000 8.0000000 sum 2675.2000000 1212.9000000 400.6000000 median 47.1000000 22.0000000 7.2000000 mean 48.6400000 22.0527273 7.2836364 SE.mean 1.6479384 0.7004369 0.2550997 CI.mean.0.95 3.3039176 1.4042914 0.5114441 var 149.3635556 26.9836498 3.5791717 std.dev 12.2214384 5.1945789 1.8918699 coef.var 0.2512631 0.2355527 0.2597425
Decimals • The various summaries of statistics provide limited ways to round or modify the output: • Options digits= and scipen= can be set before running the summary • Wrapping the function in round() works for some.
> stat.desc(DartPoints[,6:8], norm=TRUE) Length Width Thick . . . . . skewness7.749526e-01 0.23775656 0.68608894 skew.2SE 1.204307e+00 0.36948312 1.06620943 kurtosis 1.212600e-01 -1.22700641 0.23109312 kurt.2SE 9.570531e-02 -0.96842355 0.18239189 normtest.W 9.435694e-01 0.92900685 0.94970998 normtest.p 1.207084e-02 0.00300526 0.02233218 > op <- options(digits=3, scipen=100) > stat.desc(DartPoints[,6:8], norm=TRUE) Length Width Thick . . . . . skewness0.7750 0.23776 0.6861 skew.2SE 1.2043 0.36948 1.0662 kurtosis 0.1213 -1.22701 0.2311 kurt.2SE 0.0957 -0.96842 0.1824 normtest.W 0.9436 0.92901 0.9497 normtest.p 0.0121 0.00301 0.0223 > options(op)
Publishable Tables • Much of the focus in producing publishable tables in R is on LaTex • Most anthropologists are more familiar with html • xtable() provides both if there is an xtable method for your output
Using xtable() • xtable(function-output) produces a LaTex version of the table • print(xtable(function-output), type=“html”) converts to html • Appending file=“mytable.html”) will write the output to a file
> library(psych) > print(xtable(describe(DartPoints[,6:8])), type="html") <!-- html table generated in R 2.13.1 by xtable 1.5-6 package --> <!-- Mon Sep 05 11:54:20 2011 --> <TABLE border=1> <TR> <TH> </TH> <TH> var </TH> <TH> n </TH> <TH> mean </TH> <TH> sd </TH> <TH> median </TH> <TH> trimmed </TH> <TH> mad </TH> <TH> min </TH> <TH> max </TH> <TH> range </TH> <TH> skew </TH> <TH> kurtosis </TH> <TH> se </TH> </TR> . . . . . </TABLE> Use print(xtable(x), type=“html”) ; select the html commands (<TABLE> to </TABLE> , copy, and paste into Excel or print(xtable(x), type=“html”, file=“filename.html”) and insert the file into Excel or Word
Summary 1 Extract table part of results: print(numSummary(x)$table, digits=4); round(print(numSummary(x)$table, 3) print(xtable(numSummary(x)$table), type="html")