320 likes | 573 Views
General additive models. Variance and covariance. Sums of squares. M contains the mean. The coefficient of correlation. We deal with samples. For a matrix X that contains several variables holds. The diagonal matrix S X contains the standard deviations as entries .
E N D
General additivemodels Variance and covariance Sums of squares Mcontainsthemean
Thecoefficient of correlation We dealwithsamples For a matrixXthatcontainsseveralvariablesholds The diagonal matrixSXcontainsthe standard deviations as entries. X-Miscalledthecentral matrix. ThematrixRis a symmetricdistancematrixthatcontainsallcorrelationsbetweenthevariables
Pre-andpostmultiplication Premultiplication Postmultiplication For diagonal matricesXholds
Linearregression European bat species and environmentalcorrelates
N=62 Matrixapproach to linearregression Xis not a squarematrix, henceX-1doesn’texist.
Thespecies – arearelationship of Europeanbats Whataboutthe part of varianceexplained by our model? 1.16: Averagenumber of species per unit area (speciesdensity) 0.24: spatialspeciesturnover
How to interpretthecoefficient of determination Total variance Rest (unexplained) variance Residual (explained) variance Statisticaltestingisdone by an F or a t-test.
The general linear model A model thatassumesthat a dependent variable Y can be expressed by a linearcombination of predictorvariables X iscalled a linear model. ThevectorEcontainstheerrorterms of eachregression. Aimis to minimizeE.
The general linear model Iftheerrors of thepreictorvariablesareGaussiantheerror term e shouldalso be Gaussian and means and variancesareadditive Total variance Explainedvariance Unexplained(rest) variance
Multipleregression Model formulation Estimation of model parameters Estimation of statisticalsignificance
Thecoefficient of determination x1 y xm x2 Thecorrelationmatrixcan be devidedintofourcompartments.
R: correlationmatrix n: number of cases k: number of independent variablesinthe model D<0 isstatistically not significant and should be eliminatedfromthe model. Adjusted R2
Thefinal model Negativespeciesdensity Realisticincrease of speciesrichnesswitharea Increase of speciesrichnesswithwinterlength Increase of speciesrichnessathigherlatitudes A peak of speciesrichnessatintermediatelatitudes Isthis model realistic? The model makes a series of unrealisticpredictions. Ourinitialassumptionsarewrongdespite of the high degree of varianceexplanation Our problem arisesin part fromtheintercorrelationbetweenthepredictorvariables (multicollinearity). We solvethe problem by a step-wiseapproacheliminatingthevariablesthatareeither not significantorgiveunreasonableparametervalues Thevarianceexplanation of thisfinal model ishigherthanthat of theprevious one.
Multiple regression solves systems of intrinsically linear algebraic equations Polynomialregression General additive model • The matrix X’X must not be singular. It est, the variables have to be independent. Otherwise we speak of multicollinearity. Collinearity of r<0.7 are in most cases tolerable. • Multiple regression to be safely applied needs at least 10 times the number of cases than variables in the model. • Statistical inference assumes that errors have a normal distribution around the mean. • The model assumes linear (or algebraic) dependencies. Check first for non-linearities. • Check the distribution of residuals Yexp-Yobs. This distribution should be random. • Check the parameters whether they have realistic values. Multiple regression is a hypothesis testing and not a hypothesis generating technique!!
Standardizedcoefficients of correlation Z-tranformeddistributionshave a mean of 0 an a standard deviation of 1. In thecase of bivariateregression Y = aX+b, Rxx = 1. HenceB=RXY. Hencetheuse of Z-transformedvaluesresultsinstandardizedcorrelationscoefficients, termedb-values
How to interpret beta-values • Ifthen • Beta valuesare generalisations of simple coefficients of correlation. However, there is an important difference. The higher the correlation between two or more predicator variables (multicollinearity) is, the less will r depend on the correlation between X and Y. Hence other variables might have more and more influence on r and b. For high levels of multicollinearity it might therefore become more and more difficult to interpret beta-values in terms of correlations. Because beta-values are standardized b-values they should allow comparisons to be make about the relative influence of predicator variables. High levels of multicollinearity might let to misinterpretations. Beta valuesabove one arealways a sign of too high multicollinearity • Hence high levels of multicollinearity might • reduce the exactness of beta-weight estimates • change the probabilities of making type I and type II errors • make it more difficult to interpret beta-values. • We mightapplyan additional parameter, the so-called coefficient of structure. The coefficient of structure ci is defined as • where riY denotes the simple correlation between predicator variable i and the dependent variable Y and R2 the coefficient of determination of the multiple regression. • Coefficients of structure measure therefore the fraction of total variability a given predictor variable explains. Again, the interpretation of ci is not always unequivocal at high levels of multicollinearity.
Partial correlations Thepartialcorrelationrxy/zisthecorrelation of theresidualsDX and DY Semipartial correlation A semipartial correlation correlates a variable with one residual only.
Pathanalysisandlinearstructuremodels Multiple regression Theerror term e containthe part of thevariancein Y thatis not explained by the model. Theseerrorsarecalledresiduals Regressionanalysisdoes not studytherelationshipsbetweenthepredictorvariables Pathanalysis defines a whole model and triesto separate correlations into direct and indirect effects Path analysis tries to do something that is logically impossible, to derive causal relationships from sets of observations.
Path analysis is largely based on the computation of partial coefficients of correlation. Pathcoefficients Path analysis is a model confirmatory tool. It should not be used to generate models or even to seek for models that fit the data set. We start from regression functions
FromZ-transformedvalues we get eZY = 0 ZYZY = 1 Pathanalysisis a nice tool to generatehypotheses. Itfailsatlowcoefficients of correlation and circular model structures. ZXZY = rXY
Statisticalinference Roundingerrorsdue to differentprecisionscausetheresidualvariance to be largerthanthetotalvariance.
Logistic and other regression techniques We useodds The logistic regression model
Generalized non-linear regression models A special regression model that is used in pharmacology b0 is the maximum response at dose saturation. b1 is the concentration that produces a half maximum response. b2 determines the slope of the function, that means it is a measure how fast the response increases with increasing drug dose.