Adam Butler Biomathematics & Statistics Scotland (BioSS)

Statistical analysis of species-level data on distribution and traits Adam Butler Biomathematics & Statistics Scotland (BioSS) Tartu University, October 2007

The ALARM project • Assessing Large-scale risks to biodiversity with tested methods • Project of the 6th framework programme of the European Union Key objectives • Develop an integrated risk assessment for biodiversity in terrestrial and freshwater ecosystems at the European scale • Focus on four key pressures – climate change, invasive species, chemical pollution, pollinator loss – and their interactions • Contribute to the dissemination of scientific knowledge and to the development of evidence-based policy

BioSS “BioSS undertakes research, consultancy and training in mathematics and statistics as applied to agriculture, the environment, food and health…” • It is a publicly funded organisation based in Scotland, employing approximately 40 people • BioSS is a partner in ALARM, with three staff currently working on the project: Glenn Marion, Stijn Bierman, Adam Butler • Our role involves a mix of research/consultancy and training

Research-consultancy BioSS have two key themes of work within ALARM… Statistical analysis of species-level data on distribution & traits Quantifying uncertainty in complex mechanistic models …this talk focuses mainly on the first of these

Species level data Species atlas data Presence/absence of species for each cell on a regular grid Florkart: Germany, vascular plants National Biodiversity Network: UK Traits data Physiological & genetic traits Biolflor: Germany, vascular plants Invasive success Date of arrival, establishment or naturalisation Local or national Environmental data Land use, climate Future projections under different scenarios of socio-economic change

These data are observational, rather than experimental… • Advantages • Available at large spatial scales for a wide range of species • A basis for predicting impacts of long-term environmental change • Limitations • Can be used to infer correlative relationships, but not causal ones • Analysis must be model-based rather than design-based

Our work in this area • Spatial distribution of individual species • Spread of invasive species across time and space • Spatial distribution of trait compositions • Trait-based prediction of invasive success

Spatial distribution of individual species Galium pumilum in Germany • Atlas data: presence/absence • Data derive from individual records, but much heterogeneity in recording effort • Data aggregated over time & space, with the aim of reducing or removing this heterogeneity Contact: Stijn Bierman (stijn@bioss.ac.uk) Reference: Bierman, S.M., Wilson, I.J., Elston, D.A., Marion, G., Butler, A. & Kühn, I. (in preparation) Bayesian image restoration techniques to analyze species atlas data with spatially varying non-detection probabilities.

Logistic regression Interested in exploring relationships between environment & distribution Data on presence/absence are binary yi = 1 if present, yi = 0 if absent so we need to use logistic regression, rather than standard linear regression Logistic regression: yi ~ Binomial(pi,1) log(pi / (1 – pi)) = a + bxi Mean annual temperature 1960-1990 (oC) explanatory variable(s)

The “climate envelope” approach • Use logistic regression to infer current relationship between climate and presence/absence, and Assume that this relationship will continue to hold in future • Use climate predictions and the regression model to estimate the probability of presence in future years • Predict presence if Probability(presence) > threshold • Ignores dynamics of spatial spread, ignores adaptation, and assumes that climate is the primary determinant of range

Residual spatial autocorrelation • Standard logistic regression ignores the effects of residual spatial autocorrelation – nearby cells will tend to have a similar response (either presence/absence), and this similarity persists even after accounting for known environmental variables • Possible sources: • Distribution depends on a variable about which we have no data • Species not in equilibrium, so range is expanding or contracting • Spatial autocorrelation leads us to underestimate uncertainty, and can lead to bias in estimates of environmental effects

The autologistic model • There are many methods to deal with spatial autocorrelation – one of the simplest is to use an autologistic model: • yi ~ Binomial(pi,1) • log(pi / (1 – pi)) = a + bxi + c(yA + yB + yL + yR)/4 • where A, B, L and R are the cells immediately above, below, to the left and to the right of cell i • Different neighborhoods & weights can be used • The parameter c measures the strength of autocorrelation A L i R B

Non-detection If the species is recorded present (yi = 1), we can be pretty confident it is actually present If the species is recorded absent (yi = 0) then it could either be genuinely absent (true absence) or just undetected (false absence) If we have an additional source of reliable data on detection, then we can modify the analysis to correct for the effects of false absences

Proxy data for detection effort Edge effects? highest lowest

Dealing with non-detection Introduce a new variable… zi = 1 if actually present in cell i, 0 otherwise …and let pi= probability of non-detection = Prob(yi = 0 given zi = 1) Assume a relationship between pi and our proxy variable, and estimate the value of pi We can thereby estimate Prob(zi = 1 given yi = 0)

Estimated current distribution of Galium pumilum Probability of presence

Combining things • The model for non-detection can be combined with the autologistic model, to create an approach that accounts for spatial dependence, false absences and non-normality • This results in a complicated model, so we cannot estimate the parameters using standard approaches such as least squares or maximum likelihood – instead, we use Bayesian methods…

Bayesian inference • Statistical analyses involve modelling and inference • The two major approaches to statistical inference are classical (frequentist) and Bayesian • There are deep philosophical differences between them, but they often yield similar answers in practice • It is usually best to concentrate on finding an appropriate model first, before worrying about which method of inference to use

Before the advent of modern computing and Markov chain Monte Carlo, Bayesian methods were difficult to use • Today, it is possible to fit complicated and realistic models using Bayesian methods, many of which would be difficult or impossible to fit using frequentist methods • The development of a powerful piece of software (WinBUGS) has opened up the methods to non-statisticians • This largely explains why they have become very popular in ecology and other disciplines

Spread of invasive species • For invasive species we may also have data on arrival, establishment or naturalization, at either national or local levels • We can, with care, use such data to draw inferences about the spatio-temporal spread of a species across a landscape • Allows us to assess the risks associated with future expansion • Key issue: environmental heterogeneity (land use & climate) Contact: Glenn Marion (glenn@bioss.ac.uk) Reference: Cook, A., Marion, G., Butler, A. and Gibson, G. (2007) Bayesian inference for the spatio-temporal invasion of alien species. Bulletin of Mathematical Biology, 69(6), 2005-2025.

Spread of Giant Hogweed (Heracleum Mantegazzianum) in the UKData : National Biodiversity Network By 1910

Statistical framework 3. Colonization rate = Arrival rate * Suitability 2. Arrival rate = Sum of dispersal rates from all currently occupied cells j 1. Dispersal rate from cell i to cell j depends only on distanced i Assumed to be no decolonization

Model for dispersal Power law function Dispersal rate = 2d-2  = decay parameter Truncated at 150km + background rate  = 1/2  = 1/10

Suitability: homogeneous landscape For a particular grid cell, i… Suitabilityi =  * proportion of cell i that is land  = unknown parameter Interpretation of  is quite ambiguous

Suitability: heterogeneous landscape Suitabilityi = ( k * Landik ) * exp( * Temperaturei +  * Altitudei) Landik = proportion of cell i with land usek 1,…,10, ,, : unknown parameters Land use, climate & altitude currently treated as constant over time 1 sea 2 coastal 3 arable 4 broadleaf 5 built 6 conifer 7 improved grassland 8 open water 9 semi-natural 10 upland

Inference • The parameters and colonisation history are unknown • We adopt a Bayesian approach, which involves treating both sets of quantities as random • Plausible values are simulated using a computer-intensive algorithm known as Markov chain Monte Carlo

Posterior mean Posterior mean Colonizationprobability: 10 year prediction Colonization suitability

Cumulative rate of colonization homogeneous landscape with landscape heterogeneity

Spatial distribution of functional traits • Rather than the distribution of a particular species, we might be interesting in spatial patterns in the properties of species groups or whole ecosystems • In particular, we might be interested in the proportion of species having a particular qualitative trait – e.g. main pollen vector (insect, selfing, wind) Contact: Stijn Bierman (stijn@bioss.ac.uk) Reference: Kühn, I., Bierman, S.M., Durka, W. & Klotz, S. (2006) Relating geographical variation in pollination types to environmental and spatial factors using novel statistical methods. New Phytologist, 172(1), 127-139.

Main pollen vectors in German flora % wind % insect % selfing Wind speed Altitude

Compositional data • Data on trait proportions are compositional – they must sum to one • Standard methods (linear regression, principal components etc.) are invalid • Instead of analysing the raw data, we can analyse the log-ratios, e.g. • log(selfing / wind) = log(selfing) – log(wind) • log(insect / wind) = log(insect) – log(wind)

Spatial compositions • The multivariate conditional autogressive (MCAR) model describes spatial dependence in multivariate data • By applying it to the log-ratios, we can get a model for compositional data on a spatial grid • This allows us to estimate the relationship between trait proportions and environmental variables, whilst accounting for residual spatial dependence

Trait-based prediction ofinvasive success • Part of ALARM involves assessing the risks associated with invasive species • Atlas data provide us with a measure of invasive success • Data on traits can be used as explanatory variables • We can use regression modelling again – but now modelling across species, rather than across spatial locations Contact: Adam Butler (adam@bioss.ac.uk) Reference: Work in progress:Butler, A., Küster, I., Kühn, I., Bierman, S.M. and Marion. G.

Response variables: • Number of cells occupied (count data), or… • Whether or not species is naturalised (binary data) • …which are both crude measures of invasive success • Explanatory variables: • A whole range of biological traits • e.g. size, ploidy, length of flowering season, type of reproduction • Some are continuous, others binary or ordinal

Statistical modelling yj = number of cells occupied by species j xj = trait data for species j There are (at least) two possible models: Log-normal: log yj ~ Normal(a + bxj, 2) Binomial: yj ~ Binomial(pj, M), where log(pj / (1 - pj)) = a + bxj

Coping with phylogeny • Phylogenic dependence is a key issue - species that have a similar evolutionary history will tend to have similar spatial distributions, even after accounting for trait effects • Sources and impacts are similar to those associated with spatial dependence – and the statistical methods to deal with the dependence are also similar… • One important difference – we often (as in Biolflor) know the phylogenetic tree but not the branch lengths, & measure distance to be the number of branches (“patristic distance”)

Prediction • We want to predict how many species an existing or newly introduced plant species will occupy in Germany after, say, 50 years, based on the traits of that species • We can answer this using national data on residence time e.g. time since first introduction or naturalisation • We include residence time as another explanatory variable within the regression model

Examples of predictions Using BiolFlor/Florkart data for German neophyte plant species

Missing data • Data on residence times are typically sparse e.g. only available for 36% of neophyte plant species in Biolflor • We need to impute these values, and to account for the uncertainty introduced by imputation • Easiest to do this in a Bayesian framework, using WinBUGS

Key statistical themes • Environmental heterogeneity & environmental change • Spatial & phylogenetic dependence • Use of data on traits and invasive history • False absences in atlas data, missing data on traits • Model-based approach to statistics, use of Bayesian inference

Role of the statistician • Develop strong collaborativerelationships with scientists • Contribute to formulation of appropriate models • Ensure that assumptions are clear and explicit, and that sensitivity of results to these assumptions is explored • Advise & assist with software and computational issues • Undertake methodological research to develop new statistical techniques, when appropriate

Adam Butler Biomathematics & Statistics Scotland (BioSS)