Machine Learning Problems in Species Occupancy Modeling

Machine Learning Problems in Species Occupancy Modeling Rebecca Hutchinson March 25, 2010

Toy Example

Adding Covariates

Challenge #2

Birds move. And hide.

Multiple Visits • Visit each site more than once, recording detection histories Yit • E.g. • Population closure assumption: the species occupancy status does not change over the course of the visits to a site.

Assumptions • Species is never misidentified. • Occupancy status is constant across visits. • Visits are separated enough to be conditionally independent, given the occupancy status. • Sites are independent.

Xi Wit oi Zi Yit dit t=1,…,T i=1,…,M a b Key: square=discrete circle=continuous unshaded=latent grey=observed pink=parameter blue=deterministic function of inputs dashed=repeated section Xi = occupancy covariates at site i oi = probability of occupancy at site i Zi = true, unobserved occupancy status of site i a = parameters of occupancy model Wit = detection covariates at site i, visit t dit = probability of detection at site i, visit t Yit = observed presence/absence at site i, visit t b = parameters of detection model

Some details • Conditional distributions: • Conditional log-likelihood • Expected joint log-likelihood

Typical Usage • Fit a small number of models with differing (small) sets of covariates, using the conditional log-likelihood objective • E.g. model 1 vs. model 2 where • o1 ~ rainfall + elevation, d1 ~ weather + time-of-day • o2 ~ rainfall + temperature, d2 ~ underbrush-density • Evaluate models with AIC • Books on this approach: Mackenzie et al 2006, Royle et al 2007.

Outline • Citizen Science: 2 motivating datasets • Problem 1: Integrating more flexible models for occupancy and detection • Regularization • Boosted regression trees • (Joint work with Tom Dietterich) • Problem 2: Alternative detection models • Experts vs. novices • Relaxing assumptions • (Joint work with Weng-Keen Wong and Jun Yu)

Cornell Lab of Ornithology Mission: To interpret and conserve the earth’s biological diversity through research, education, and citizen science focused on birds.

Birds in Forested Landscapes (BFL) • Goals: • Determine habitat/landscape requirements of forest-dwelling birds (especially thrushes) • Translate results into management recommendations for conservation • Develop a network of experienced citizen scientists • BFL is a continent-wide project that has engaged over 1,000 volunteers who surveyed over 3,000 study sites. • Have data from 1997-2006 • Participants follow a rigorously tested protocol that includes: • selecting suitable study sites • visiting these sites at least twice during the breeding season and • measuring a variety of habitat variables. • http://www.birds.cornell.edu/bfl/

BFL data • Select forest patches, then survey points, and one or more species of interest. • Visit 1: earliest date when all your study species have arrived • Want beginning of breeding period, but no birds still migrating. • Visit 2: 2-4 weeks later • Breeding should be underway, different evidence available. • Record presence/absence of 22 possible breeding behaviors observed in each period on each visit. • Record presence/absence of competitors/predators on each visit. • Record environmental variables at large, medium, and small scales. • Observers work in teams of 1-4 people.

BFL data: visit protocol example • Observation Period (mandatory 10 minutes) Look and listen for predators, cowbirds, and study species • Playback Period (mandatory 5 minutes per species) Species 1: play songs, calls, or drums for 1 minuteSpecies 1: watch/listen for 1 minuteSpecies 1: repeat songs, calls, or drums for 1 minuteSpecies 1: watch/listen for 2 minutesSpecies 2: play songs, calls, or drums for 1 minuteSpecies 2: watch/listen for 1 minuteSpecies 2: repeat songs, calls, or drums for 1 minuteSpecies 2: watch/listen for 2 minutes • Behavior Watch Period (mandatory 10 minutes)Play eastern or western mobbing calls for 5 minutes while looking and listening for study speciesWatch/listen for 5 minutes

BFL data: habitat characteristics • Survey point (where observer stands) • Latitude/longitude • Elevation • Distance to nearest edge, road, water, occupied building • Study site (radius=150m) • Hydrology during breeding season • Forest cover type • Slope • Land use • Land ownership • Canopy characteristics • Low vegetation characteristics • Landscape level (2500 acres) • Patch edge (what habitats are adjacent) • Forest patch size • Percentage of forest • Linear distance of edge • Distance to nearest 100 & 500 acre patches (if patch is less than 1000 acres)

Increasing model flexibility • Why? • Many possible habitat variables • interactions? • Exploratory modeling with many covariates rather than hypothesis testing with few • 2 ideas: • Regularization • Boosted regression trees

How to regularize these models? • One possible penalty: • How should the two components be weighted? • tug-of-war between occupancy and detection to explain the all-zero detection histories

Preliminary synthetic data results • 8 covariates for each model, half of which truly had non-zero coefficients • Choice of objective function seems more important than regularization parameters

Posterior Regularization • [Ganchev, Gillenwater, Graca, and Taskar, 2009] • Regularization constraints on posterior expectations instead of parameters, for example: • Expected occupancy is less than 60% • Of the all-zero detection histories, only half can be ‘explained away’ by the detection model

Boosted Regression Trees • Popular in species distribution modeling • [Elith et al 2006] • Functional gradient ascent [Friedman 2001] • regression trees predict F(X) and G(W) • F and G are fed through logistic() to get o and d • Current challenge: tuning • learning rate (shrinkage) • number of trees to grow at each stage • depth of trees • number of stages

Where Birding Meets Science!

eBird—Current Stats (2009) • ~70,000 users • ~540,000 site visitors • 173 countries/territories • >1,500,000 checklists submitted • 2,945 species reported • 21 million observations reported

Northern Cardinal Distribution (Frequency of Detection) • Gray – not reported • Tan – insufficient data • White – not covered

Extensions needed for eBird? • Alternative detection model • add a node for expertise of observer • Relax the assumption of no-misidentifications • Y|Z=1 ~ Bernoulli(d) • Y|Z=0 ~ Bernoulli(h) • (instead of 0)

Model with expertise node Bic Zis Yics Ej Uj Xi s j Wics c i

Preliminary results: Synthetic data Synthetic data generated from EOM with different levels of false positives Area under ROC curve Slide courtesy of Jun Yu

Preliminary results: eBird data • data from New York from May and June in year 2006, 2007 and 2008. • 27 by 64 Checkerboarding [New York State: Width-285 miles (455 km) and Length-330 miles (530 km): • Each Cell is roughly 16.8 km by 8.3 km. • There are roughly 200 sites generated during training. Slide courtesy of Jun Yu

More challenges • Sampling bias • Spatial autocorrelation • For BFL, modeling multiple occupancy states • For eBird, modeling abundance • Multi-species approaches • Dynamic models • migration • range shift

Questions?Comments?Suggestions?

Machine Learning Problems in Species Occupancy Modeling

Machine Learning Problems in Species Occupancy Modeling

Presentation Transcript

Topics in Machine Learning

Machine Learning in Bioinformatics

Modeling species distributions

Pollinator Occupancy Modeling

Occupancy Modeling: Interactions

Occupancy Problems

Mechanism Design, Machine Learning, and Pricing Problems

Modeling species distribution using species-environment relationships

Machine Learning Problems in Species Occupancy Modeling

Machine Learning in Engineering Problems

Machine Learning in GATE

Mechanism Design, Machine Learning, and Pricing Problems

Graph Mining Applications in Machine Learning Problems

Experiments in Machine Learning

Evaluation in Machine Learning

Machine Learning in Football

Graph Mining Applications to Machine Learning Problems

Common Problems in Washing Machine

Machine Learning Python: Regression Modeling