410 likes | 547 Views
R Installation. R is an open source software package for statistical data analysis. R Installation . R d ownload : http://www.r-project.org / Germany, Stefan Drees Bonn: http://cran.r-mirror.de / This is the main program !
E N D
R Installation R is an open sourcesoftwarepackageforstatisticaldataanalysis
R Installation • R download: http://www.r-project.org/ • Germany, Stefan Drees Bonn: http://cran.r-mirror.de/ • This isthemainprogram! • RStudio (Auxiliaryprogramforediting R files): http://www.rstudio.com/ide/download/ • Installtheprograms (R shouldbetherealready).
Working withthe R command prompt (withoutRStudio) • Practicalsession
Working withRStudio • Start RStudio • „File -> New -> R Script“ Letsyouedit a R commandor R script (= smallprogramme = severalconsecutivecommands)
Working withRStudio • Select • Files • Plots • Packages (foradvancedanalyses) • Help New R files The command prompt
Working withRStudio • Commandsandprogramscanbestored in R files • Execute onecommandline: • Ctrl+Enteror • Button „Run“ • Execute severallines: • Mark linesanduse „Ctrl+Enter“ or „Run“ button
Whatisstatistics? • Statisticsis a meanstoconnectempiricalknowledgeandtheoryandisconstitutedasfollows: • Data representation (Empirics) • Methodsfordescription, analysis, andinterpretationofdata, in ordertoallowpredictions, conclusionsanddecisions(Statistical Theory)
Whatisstatistics? • DescriptiveStatistics • Probabilitytheory • Test theory
DescriptiveStatistics • Basic concepts: • Population: Collectionofobjectsforwhich a conclusionshallbemade (canbe human beingsbut also a collectionofatomswhenapplied in physics) • Sample: a representativepart/sub-set ofthepopulation • Random sample: elementsofthepopulationdrawnrandomlyandindependentlyofeachother • Example: „Mietspiegel“ (= statisticsofrents) forthecityof Bonn • Population: all rooms, flats etc. forrent in Bonn (toomanytoinvestigate all) • Sample: selectedpart; all flatsfromPoppelsdorf • Random sample: Investigation of n = 100, 200,… randomobjectsfrom Bonn
Attributes / traits Values qualitative quantitative Descriptivestatistics patients, bloodsamples, DNA samples, houses, atoms Observationalobjects bloodpressure, weight, age, bloodgroup, numberofsiblings, maritalstatus, rent Blood group, maritalstatus discrete Numberofsiblings continuous bloodpressure, weight, age, rent
Descriptivestatistics • Scaling: • Nominal scale: attributevaluesthatare not directlycomparable (sex, subjectofstudies, countryoforigin) (qualitative) • Ordinalscale: attributevaluesthathave a „natural“ order (grades, fontsizes: tiny-small-medium-large-huge) • Intervalscale: differencebetweenattributevaluesisinterpretable (temperature in °C) (quantitative) • Tobedistinguished: • Discreteattributes: Attribute valuescanbecounted • Continuousattributes: All real numbers, orat least all numbersfrom an interval, arepossible
Descriptivestatistics • Frequencies: • Absolute frequencyni: • Numberofobersvationswithattributevaluei (counts) • Relative frequency hi: • Portion ofelementswithattributevaluei • Tobecomputedas absolute frequencydevidedby total numberofobjectsN: ni/ N • Relative frequenciesliebetween 0 and 1 • Relative frequencieshavetoaddupto 1 (<- canbeusedto check computation)
tally sheet value absolute frequencyni relative frequency hi Bonn Köln Bonn Köln 0.34 0.39 1 0 17 78 0.38 0.38 2 A1 19 76 0.12 0.10 3 A2 6 20 0.10 0.09 4 B 5 18 0.04 0.03 5 A1B 2 6 0.02 0.01 6 A2B 1 2 0.00 0.00 7 other 0 0 1.00 200 Descriptivestatistics AB0 bloodgroup N = 50 1.00
Descriptivestatistics • Frequencies: • Cumulativefrequency: • Sumof all frequenciesupto a givenvaluei. • Denotedasfor absolute frequenciesanddenotedasifor relative frequencies • Oftenusedwhenvaluesaresubdividedintoclasses • Classification: • Arrangement ofattributevaluesintodisjointgroups, so called „classes“ • Classesaredisjoint, i.e. non-overlapping, andneighbouringintervalsofattributevalues, whicharedefinedby a lowerand an upperbound. Neighbouringvaluesimpliesthateachvaluebelongsto a classanddoes not lieoutiside(completenessoftheclassification).
150 200 height [cm] classlimits: • (160; 170] contains all values, that are > 160 but 170. 150 160 170 180 190 200 height [cm] ] ( ] ( ( ] ] ( ( ] ( Descriptivestatistics height • complete • disjoint (eachvaluebelongstoonlyoneclass) classification:
Class number i Class limits (ai-1; ai] Tally sheet frequency Cumulative frequency absolute ni relative hi absolute Ni relative Hi 1 150 0 0.00 0 0.00 2 (150; 160] 5 0.05 5 0.05 3 (160; 170] 30 0.30 35 0.35 4 (170; 180] 35 0.35 70 0.70 5 (180; 190] 25 0.25 95 0.95 6 (190; 200] 5 0.05 100 1.00 7 > 200 0 0.00 100 1.00 N=100 1,00 Descriptivestatistics height [cm]
Descriptivestatistics Graphicalrepresentation
Descriptivestatistics • Piechart (R function: pie() ) • Shows absolute frequencies • Example: bloodgroups
Descriptivestatistics • Bar chart (R function: barplot() ) • Shows relative frequencies • Example: bloodgroups
Frequencies Cumulativefrequencies Number of children Tally sheet absoluteni relative hi 5 0.10 1 0 5 0.10 0.50 25 2 1 20 0.40 40 0.80 3 2 15 0.30 45 4 3 0.90 5 0.10 48 0.96 5 4 3 0.06 50 1.00 6 >4 2 0.04 1.00 N = 50 Descriptivestatistics • Representationofcumulativefrequencieswithempiricaldistributionfunction F • Discretetrait: NumberofChildren relative Hi absoluteNi
F H h i i 1.0 1.0 0.8 0.8 hi 0.6 0.6 0.4 0.4 0.2 0.2 hi 0.0 0.0 0 4 >4 2 1 3 0 1 2 3 4 >4 F: Empiricaldistributionfunction Sincetheattributeis quantitative discrete, weobtain a stepfunction Descriptivestatistics Number of children Bar chart
Descriptivestatistics • Histogramms (R function: hist() ) • Construction: • Data issubdevidedintoclasses • Surfacearea ofcolumnsisproportional totherespectivefrequencies • Columns areneighbouringsinceclassesareneighbouring
1 0,8 0,6 0,4 0,2 0 height[cm] 150 160 170 180 190 200 Descriptivestatistics Example: Height [cm] hi Histogram
f empiricaldensityfunction f 0,8 0,6 F • • 0,4 0,8 • empiricaldistributionfunction F (forcontinuoustrait) 0,6 0,2 0,4 • height[cm] 0 0,2 200 200 150 150 160 160 170 170 180 180 190 190 • • height[cm] 0 Descriptivestatistics
f empirical density function f 0,8 hi 0,6 F • • 0,4 hi 0,8 • empirical distribution function F 0,6 0,2 0,4 • height[cm] 0 0,2 200 200 150 150 160 160 170 170 180 180 190 190 • • height[cm] 0 Descriptivestatistics
DescriptiveStatistics • Note: Slides 23 and 26 bothshowempiricaldistributionfunctions. In thefirstcase, weobtain a stepfunctionsincethetraitunderinvestigationisdiscrete.
Descriptivestatistics Measuresofcentraltendency, dispersionandspread
Descriptivestatistics • Measuresofcentraltendency: • A numbertocharacterizethe „center“ ofthedata • Most important: • Mean • Median
sample sample ranks ranks x(1)=3 x(1)=3 x1=5 x1=5 x(2)=4 x(2)=4 x2 =9 x2 =9 x(3)=5 x(3)=5 x3=3 x3=3 x(4)=6 x(4)=6 x4=8 x4=8 x(5)=8 x(5)=7 x5=19 x5=19 x(6)=9 x(6)=8 x6=4 x6=4 x(7)=19 x(7)=9 x7=6 x7=6 x(8)=19 x8=7 Descriptivestatistics • Median (R function: median() ) • Sample: Order accordingto: Ordered sample: • Median n = 8 even: n = 7 odd:
Descriptivestatistics • Mean (R function: mean() ) • Sample: • Sample size: n • Mean
Descriptivestatistics • Comparison of median andmean: • Bothsampleshave median 2500 • and arethemeanvalues • Meancanstronglybeinfluencedby a singlevalue • Median ismore robust against extreme values („outliers“) Nevertheless, themeanismoreoftenused in practicesinceithasotherdesirableproperties (seelater).
outlier x Descriptivestatistics How to treat outliers? 1) Discard No! 2) Check value and correct Yes!
sample A sample B x x Descriptivestatistics • Measuretheamountofvariationofthedata! The mean (or median) is not sufficenttodescribe a sample
Descriptivestatistics • Measuresofdispersionandspread: • Numbers tocharacterizetheamountvariationaroundthecenter (= mean) • Most important: • Minimum, maximum, range (dispersion) • Empiricalvariance (spread) • Empiricalstandarddeviation (spread)
ranks sample x(1)=3 x1=5 x(2)=4 x2 =9 x(3)=5 x3=3 x(4)=6 x4=8 x(5)=8 x5=19 x(6)=9 x6=4 x(7)=19 x7=6 n=7 n=7 Descriptivestatistics • range: • minimum: min = x(1) • maximum: max = x(n) • range:R = x(n) – x(1)
Descriptivestatistics • Variance (R function: var() ): • A measureto express thespreadaroundthecenter(mean) by a singlevalue • The squareddeviationofeachattributevaluefromthemeanisconsidered. • Formulafortheempiricalvariancefrom a sample ofelements: • The empircalstandarddeviationis just thesquarerootofthevariance, . (R function: sd() ).
x1 = 75 x2 = 2 x3 = 270 x4 = n = 4 = 100 x4=53 isnot free,but givenbyothervalueswhenthemeanisknown. s2has (n-1) degreesoffreedom (f) Whydevidebyinsteadof? Example:
Data • Ifyouhavedatayouwanttoanalyse, please bring italong!