Ten Lectures on Ecological Statistics

Ten LecturesonEcological Statistics John Bunge jab18@cornell.edu Department of Statistical ScienceCornell University

An example Mean number of eggs laid by birds nesting in a forest: changing? Population: closed – no birth, death or immigration/emigration Population is the target of inference -- from known (data) to unknown (population) Statistical epistemology “All models are wrong, but some are useful” – George E. P. Box

Collect sample – finite-population; infinite-population definitions. Sampling unit; sample size. n = 41. 1, 2, 2, 0, 1, 2, 3, 5, 5, 0, 0, 4, 0, 1, 3, 3, 2, 4, 2, 7, 0, 3, 1, 2, 3, 2, 8, 3, 3, 2, 7, 2, 4, 1, 1, 6, 5, 0, 5, 1, 0 Probability model. Probability distribution controlled/determined by one or more parameters. Parameter: population. Statistic: data.

Model: the Poisson distribution. (Siméon-Denis Poisson, 1781-1840) Assigns probability to nonnegative integers One-parameter (λ>0) model λ = mean of distribution – mean, median, mode

How to fit model? • Statistical inference • Parameter estimation • Hypothesis testing • How to assess fit of model?

Statistic vs. parameter Estimator vs. estimate Assumptions: data arise from Poisson distribution the notion of i.i.d.: Maximum likelihood estimate MLE is optimal in several senses: consistent, efficient, asymptotically normal

Still not much use without error term:2.59 +/- ?? Theory also provides standard error. Program to find SE: Find theoretical variance of MLE Find empirical approximation to (1); Take square root of (2). In our example, SE = So we can write: 2.59 w/SE of 0.33.

Stillnot much use. Notion of confidence interval: We are 95% confident that the true value lies in a certain range. Often: estimate +/- 1.96*SE ≈ estimate +/- 2*SE Example: ≈ 2.59 +/- 2* 0.33 = 2.59 +/- 0.66 = (1.93, 3.25) ≈ (1.940, 3.231) We are 95% confident that the true mean # of eggs per nest lies in this range.

Claim: Historical norm for mean # of eggs/nest is 3.6. Question: Does new data contradict this? Hypothesis testing. Conceptual framework: Null hypothesis H0:situation unchanged, no difference, “nothing is happening.” Alternative hypothesis HA (or H1): difference from H0. “Hypothesis of interest.”

How far is 2.59 from 3.6? Test statisticT measures distance of observed data from null hypothesis. H0: mean μ = μ0= 3.6 HA: mean μ ≠ μ0= 3.6. Two-sidedalternative. What does -3.06 mean? Need null distribution of test statistic.

In hypothesis testing, assume H0 true throughout. Test evaluates whether data is consistent with H0. Two approaches (overall equivalent): Fixed-level: compare test statistic T to some cutoff value. If T > (<) cutoff, “reject” H0, otherwise “accept” or “fail to reject” H0. α = 0.05 “significant”; α = 0.01 “highly significant.” P-value: compute p-value of test = probability, given H0 true, of observing data “as or more extreme” than what actually occurred.

Standard normal (Gaussian) distribution T = -3.06 LH tail area = 0.001107 p-value = 2*0.001107 = 0.002213 < .01 < .05 Reject H0.

Foregoing: frequentist, parametric analysis.

Bayesian (parametric) statistical analysis: the notion of a prior distribution. Prior represents investigators prior belief or information, before performing experiment/collecting data, regarding value(s) of parameter(s). Subjective, objective Bayesianism. Elicitation of priors; noninformative or objective priors – Jeffreys’, reference. Modern Bayesian computation: MCMC etc.

Parametric Bayesian (point) estimation Poisson case: conjugate prior Γ(α,β) Bayesian program: (1) establish prior; (2) collect data; (3) update prior based on data, to obtain posterior.

Posterior mode = 2.65 ≠ 2.59 95% highest posterior density (HPD) region; credible region≠ 95% confidence interval

Bayesian hypothesis testing Assign prior probabilities to null & alternative hypotheses. Noninformative/objective: 0.5. Collect data; compute posterior probability that each hypothesis is true Bayes factor: roughly, likelihood of data under H0 / likelihood of data under HA. (more advanced topic)

Quantitative goodness-of-fit assessment for bird nesting dataNaïve chi-square test, 10 – 1 – 1 = 8 d.f.

Test accepts Poisson model @ level α = .01, rejects @ α = .05 Actually problem with test: chi-square distribution of test statistic is asymptotic; requires cell counts >=5 (but see literature); fails in example. Alternative GOF tests are possible.

Will look @ classical nonparametrics in multiple-sample context

Permeability constants of human chorioamnion (a placental membrane) at term (X) and between 12 to 26 weeks gestational age (Y). Alternative of interest is greater permeability for term pregnancy.

Mann-Whitney-Wilcoxon (rank-sum) test One-sided p-value = 0.1272 T-test: H0: μ1 = μ2 HA: μ1 ≠ μ2 W-test: H0: Δ = 0 HA: Δ ≠ 0

Multiple populations or groups ANOVA: H0: μ1 = μ2 = … = μk HA: not H0 Nonparametric version: Kruskal-Wallis test H0: no location shift HA: not H0 SAS: PROC NPAR1WAY Followups: multiple comparisons (for selection of the best) Simultaneous inference.

k confidence intervals or tests simultaneously: Use α/k.Example: simultaneous 95% confidence intervals for 2 means (X & Y) in permeability example. k = 2; α = 0.05 (95% = 100*(1-.05) = 100*(1- α).α/2 = 0.05/2 = 0.025, so use 100*(1- α/2)% = 97.5% confidence intervals for simultaneous 95% confidence. z= 2.2414: use est +/- 2.2414*SE

Qualitative-qualitative: contingency tables. The 1894-96 Calcutta cholera study Χ2 p

Quantitative-quantitative: Linear regression Basic model equation: Fitted equation: Basic hypothesis test:

Multiple regression Simultaneous inference Large p, small n problems

Estimating biodiversity: species richness from abundance data ICoMM dataset ABR 0005 2005 01 07 Application of the 454 technology to active-but-rare biosphere in the oceans: large-scale basin-wide comparison in the Pacific Ocean(Hamasaki & Taniguchi)

Analysis output from CatchAll

True (population) vs. observed (sample) richness Sample size, bias, & standard error Asymptotic normality Parametric vs. coverage-based nonparametric estimation Bayesian methods Nonparametric maximum likelihood estimation Linear modeling of ratios of successive frequency counts Standard error computation The issue of τ

A B Jaccard index: Sorensen index: Comparison of two populations As stated: sample-based. If separate estimates of true, population |A|, |B| & |A∩B| are available, can estimateJaccard (& Sorensen)

Bray-Curtis: Morisita-Horn: ai = abundance of ith species in population A bi = abundance of ith species in population B Based on existing sample (abundance) data

Sample-based: J = 0.375 B-C = 0.351852 ai bi There exist adjusted versions of Jaccard & Sorensen which do (attempt to) account for unseen species: see SPADE; Chao/Chazdon/Colwell/Shen (2005).

Incidence data (vs. abundance) Example: 6 “occasions” (samples, etc.);15 total species observed. Only presence/absence recorded (not abundance). Also capture-recapture, multiple recapture, etc. Models: M0, Mh, Mt, Mb, Mth, Mtb, Mtbh

M(0) - ML estimator (Otis et al. 1978) Population size (CI) 16.0 ± 1.4 (15.0 - 20.0) Capture probability 0.3438 (per occasion) Capture probability 0.9201 (overall) Npar 2 Log likelihood -59.003 AIC 122.005 AICc 123.005 0: null model, homogeneous capture probabilities h: heterogeneous capture probabilities t: time (occasion, sample) effect b: “behavioral” effect Program DENSITY M(h) - 2-point finite mixture Population size (CI) 15.7 (15.0 - 1567.9) Capture probability 0.3508 (per occasion) Capture probability 0.9251 (overall) Npar 4 Log likelihood -31.079 AIC 70.157 AICc 74.157

Multivariate data: Principal components analysis

PCA creates a new coordinate system, i.e, a new set of variables New coordinate system is orthogonal New variables are linear combinations of original variables New variables capture variance of original variables (data) in optimal way Dimension reduction: often possible to use only a few (e.g., 2) principal components instead of original variables, but still capture most of the information (variance) in data.

Multidimensional scaling Represents N data points in low (2 or 3) –dimensional space. Visualization technique Data analysis (exploration, description) vs. statistical inference Input is dissimilarity or distance matrix Attempts to reproduce given distances with minimum stress Metric: preserves distances as closely as possible Nonmetric: preserves order of distances as closely as possible

Distance matrix: Symmetric, zeroes on diagonal Distances may be derived from, e.g., community comparison metrics etc. Be careful of similarity vs. dissimilarity (distance)

Absolute MDS

Non-metric MDS

Absolute MDS Non-metric MDS Invariant to rotation & symmetry

Ten Lectures on Ecological Statistics

Ten Lectures on Ecological Statistics

Presentation Transcript

Chapter Ten Statistics and Probability

Lectures on Calculus

Lectures on Knowledge Management

Lectures on Economic Policy

Lectures on File Management

Statistics Lectures: Questions for discussion

Ecological Sites on Rangeland

Statistics : the ten main mistakes

Ecological Sites on Rangeland

Ecological Map on Avatar

Selected Statistics Examples from Lectures

Communication lectures on:

Ecological Sites on Rangeland

Lectures on Modern Physics

Communication lectures on:

Lectures on composites

Statistics on Statistics.

TEN WAYS TO MISLEAD WITH STATISTICS

LECTURE. 31. SUMMERY OF FIRST TEN LECTURES

Lectures on C++

Lectures on C++