480 likes | 494 Views
Learn about statistical methods like hypothesis testing and Bayesian analysis through ecological examples. Explore how to fit models and assess their accuracy in ecological statistics.
E N D
Ten LecturesonEcological Statistics John Bunge jab18@cornell.edu Department of Statistical ScienceCornell University
An example Mean number of eggs laid by birds nesting in a forest: changing? Population: closed – no birth, death or immigration/emigration Population is the target of inference -- from known (data) to unknown (population) Statistical epistemology “All models are wrong, but some are useful” – George E. P. Box
Collect sample – finite-population; infinite-population definitions. Sampling unit; sample size. n = 41. 1, 2, 2, 0, 1, 2, 3, 5, 5, 0, 0, 4, 0, 1, 3, 3, 2, 4, 2, 7, 0, 3, 1, 2, 3, 2, 8, 3, 3, 2, 7, 2, 4, 1, 1, 6, 5, 0, 5, 1, 0 Probability model. Probability distribution controlled/determined by one or more parameters. Parameter: population. Statistic: data.
Model: the Poisson distribution. (Siméon-Denis Poisson, 1781-1840) Assigns probability to nonnegative integers One-parameter (λ>0) model λ = mean of distribution – mean, median, mode
How to fit model? • Statistical inference • Parameter estimation • Hypothesis testing • How to assess fit of model?
Statistic vs. parameter Estimator vs. estimate Assumptions: data arise from Poisson distribution the notion of i.i.d.: Maximum likelihood estimate MLE is optimal in several senses: consistent, efficient, asymptotically normal
Still not much use without error term:2.59 +/- ?? Theory also provides standard error. Program to find SE: Find theoretical variance of MLE Find empirical approximation to (1); Take square root of (2). In our example, SE = So we can write: 2.59 w/SE of 0.33.
Stillnot much use. Notion of confidence interval: We are 95% confident that the true value lies in a certain range. Often: estimate +/- 1.96*SE ≈ estimate +/- 2*SE Example: ≈ 2.59 +/- 2* 0.33 = 2.59 +/- 0.66 = (1.93, 3.25) ≈ (1.940, 3.231) We are 95% confident that the true mean # of eggs per nest lies in this range.
Claim: Historical norm for mean # of eggs/nest is 3.6. Question: Does new data contradict this? Hypothesis testing. Conceptual framework: Null hypothesis H0:situation unchanged, no difference, “nothing is happening.” Alternative hypothesis HA (or H1): difference from H0. “Hypothesis of interest.”
How far is 2.59 from 3.6? Test statisticT measures distance of observed data from null hypothesis. H0: mean μ = μ0= 3.6 HA: mean μ ≠ μ0= 3.6. Two-sidedalternative. What does -3.06 mean? Need null distribution of test statistic.
In hypothesis testing, assume H0 true throughout. Test evaluates whether data is consistent with H0. Two approaches (overall equivalent): Fixed-level: compare test statistic T to some cutoff value. If T > (<) cutoff, “reject” H0, otherwise “accept” or “fail to reject” H0. α = 0.05 “significant”; α = 0.01 “highly significant.” P-value: compute p-value of test = probability, given H0 true, of observing data “as or more extreme” than what actually occurred.
Standard normal (Gaussian) distribution T = -3.06 LH tail area = 0.001107 p-value = 2*0.001107 = 0.002213 < .01 < .05 Reject H0.
Bayesian (parametric) statistical analysis: the notion of a prior distribution. Prior represents investigators prior belief or information, before performing experiment/collecting data, regarding value(s) of parameter(s). Subjective, objective Bayesianism. Elicitation of priors; noninformative or objective priors – Jeffreys’, reference. Modern Bayesian computation: MCMC etc.
Parametric Bayesian (point) estimation Poisson case: conjugate prior Γ(α,β) Bayesian program: (1) establish prior; (2) collect data; (3) update prior based on data, to obtain posterior.
Posterior mode = 2.65 ≠ 2.59 95% highest posterior density (HPD) region; credible region≠ 95% confidence interval
Bayesian hypothesis testing Assign prior probabilities to null & alternative hypotheses. Noninformative/objective: 0.5. Collect data; compute posterior probability that each hypothesis is true Bayes factor: roughly, likelihood of data under H0 / likelihood of data under HA. (more advanced topic)
Quantitative goodness-of-fit assessment for bird nesting dataNaïve chi-square test, 10 – 1 – 1 = 8 d.f.
Test accepts Poisson model @ level α = .01, rejects @ α = .05 Actually problem with test: chi-square distribution of test statistic is asymptotic; requires cell counts >=5 (but see literature); fails in example. Alternative GOF tests are possible.
Will look @ classical nonparametrics in multiple-sample context
Permeability constants of human chorioamnion (a placental membrane) at term (X) and between 12 to 26 weeks gestational age (Y). Alternative of interest is greater permeability for term pregnancy.
Mann-Whitney-Wilcoxon (rank-sum) test One-sided p-value = 0.1272 T-test: H0: μ1 = μ2 HA: μ1 ≠ μ2 W-test: H0: Δ = 0 HA: Δ ≠ 0
Multiple populations or groups ANOVA: H0: μ1 = μ2 = … = μk HA: not H0 Nonparametric version: Kruskal-Wallis test H0: no location shift HA: not H0 SAS: PROC NPAR1WAY Followups: multiple comparisons (for selection of the best) Simultaneous inference.
k confidence intervals or tests simultaneously: Use α/k.Example: simultaneous 95% confidence intervals for 2 means (X & Y) in permeability example. k = 2; α = 0.05 (95% = 100*(1-.05) = 100*(1- α).α/2 = 0.05/2 = 0.025, so use 100*(1- α/2)% = 97.5% confidence intervals for simultaneous 95% confidence. z= 2.2414: use est +/- 2.2414*SE
Qualitative-qualitative: contingency tables. The 1894-96 Calcutta cholera study Χ2 p
Quantitative-quantitative: Linear regression Basic model equation: Fitted equation: Basic hypothesis test:
Multiple regression Simultaneous inference Large p, small n problems
Estimating biodiversity: species richness from abundance data ICoMM dataset ABR 0005 2005 01 07 Application of the 454 technology to active-but-rare biosphere in the oceans: large-scale basin-wide comparison in the Pacific Ocean(Hamasaki & Taniguchi)
True (population) vs. observed (sample) richness Sample size, bias, & standard error Asymptotic normality Parametric vs. coverage-based nonparametric estimation Bayesian methods Nonparametric maximum likelihood estimation Linear modeling of ratios of successive frequency counts Standard error computation The issue of τ
A B Jaccard index: Sorensen index: Comparison of two populations As stated: sample-based. If separate estimates of true, population |A|, |B| & |A∩B| are available, can estimateJaccard (& Sorensen)
Bray-Curtis: Morisita-Horn: ai = abundance of ith species in population A bi = abundance of ith species in population B Based on existing sample (abundance) data
Sample-based: J = 0.375 B-C = 0.351852 ai bi There exist adjusted versions of Jaccard & Sorensen which do (attempt to) account for unseen species: see SPADE; Chao/Chazdon/Colwell/Shen (2005).
Incidence data (vs. abundance) Example: 6 “occasions” (samples, etc.);15 total species observed. Only presence/absence recorded (not abundance). Also capture-recapture, multiple recapture, etc. Models: M0, Mh, Mt, Mb, Mth, Mtb, Mtbh
M(0) - ML estimator (Otis et al. 1978) Population size (CI) 16.0 ± 1.4 (15.0 - 20.0) Capture probability 0.3438 (per occasion) Capture probability 0.9201 (overall) Npar 2 Log likelihood -59.003 AIC 122.005 AICc 123.005 0: null model, homogeneous capture probabilities h: heterogeneous capture probabilities t: time (occasion, sample) effect b: “behavioral” effect Program DENSITY M(h) - 2-point finite mixture Population size (CI) 15.7 (15.0 - 1567.9) Capture probability 0.3508 (per occasion) Capture probability 0.9251 (overall) Npar 4 Log likelihood -31.079 AIC 70.157 AICc 74.157
PCA creates a new coordinate system, i.e, a new set of variables New coordinate system is orthogonal New variables are linear combinations of original variables New variables capture variance of original variables (data) in optimal way Dimension reduction: often possible to use only a few (e.g., 2) principal components instead of original variables, but still capture most of the information (variance) in data.
Multidimensional scaling Represents N data points in low (2 or 3) –dimensional space. Visualization technique Data analysis (exploration, description) vs. statistical inference Input is dissimilarity or distance matrix Attempts to reproduce given distances with minimum stress Metric: preserves distances as closely as possible Nonmetric: preserves order of distances as closely as possible
Distance matrix: Symmetric, zeroes on diagonal Distances may be derived from, e.g., community comparison metrics etc. Be careful of similarity vs. dissimilarity (distance)
Absolute MDS Non-metric MDS Invariant to rotation & symmetry