1 / 30

Machine Learning Problems in Species Occupancy Modeling

Machine Learning Problems in Species Occupancy Modeling. Rebecca Hutchinson March 25, 2010. Toy Example. Adding Covariates. Challenge #2. Birds move. And hide. Multiple Visits. Visit each site more than once, recording detection histories Y it E.g.

liam
Download Presentation

Machine Learning Problems in Species Occupancy Modeling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Learning Problems in Species Occupancy Modeling Rebecca Hutchinson March 25, 2010

  2. Toy Example

  3. Adding Covariates

  4. Challenge #2

  5. Birds move. And hide.

  6. Multiple Visits • Visit each site more than once, recording detection histories Yit • E.g. • Population closure assumption: the species occupancy status does not change over the course of the visits to a site.

  7. Assumptions • Species is never misidentified. • Occupancy status is constant across visits. • Visits are separated enough to be conditionally independent, given the occupancy status. • Sites are independent.

  8. Xi Wit oi Zi Yit dit t=1,…,T i=1,…,M a b Key: square=discrete circle=continuous unshaded=latent grey=observed pink=parameter blue=deterministic function of inputs dashed=repeated section Xi = occupancy covariates at site i oi = probability of occupancy at site i Zi = true, unobserved occupancy status of site i a = parameters of occupancy model Wit = detection covariates at site i, visit t dit = probability of detection at site i, visit t Yit = observed presence/absence at site i, visit t b = parameters of detection model

  9. Some details • Conditional distributions: • Conditional log-likelihood • Expected joint log-likelihood

  10. Typical Usage • Fit a small number of models with differing (small) sets of covariates, using the conditional log-likelihood objective • E.g. model 1 vs. model 2 where • o1 ~ rainfall + elevation, d1 ~ weather + time-of-day • o2 ~ rainfall + temperature, d2 ~ underbrush-density • Evaluate models with AIC • Books on this approach: Mackenzie et al 2006, Royle et al 2007.

  11. Outline • Citizen Science: 2 motivating datasets • Problem 1: Integrating more flexible models for occupancy and detection • Regularization • Boosted regression trees • (Joint work with Tom Dietterich) • Problem 2: Alternative detection models • Experts vs. novices • Relaxing assumptions • (Joint work with Weng-Keen Wong and Jun Yu)

  12. Cornell Lab of Ornithology Mission: To interpret and conserve the earth’s biological diversity through research, education, and citizen science focused on birds.

  13. Birds in Forested Landscapes (BFL) • Goals: • Determine habitat/landscape requirements of forest-dwelling birds (especially thrushes) • Translate results into management recommendations for conservation • Develop a network of experienced citizen scientists • BFL is a continent-wide project that has engaged over 1,000 volunteers who surveyed over 3,000 study sites. • Have data from 1997-2006 • Participants follow a rigorously tested protocol that includes: • selecting suitable study sites • visiting these sites at least twice during the breeding season and • measuring a variety of habitat variables. • http://www.birds.cornell.edu/bfl/

  14. BFL data • Select forest patches, then survey points, and one or more species of interest. • Visit 1: earliest date when all your study species have arrived • Want beginning of breeding period, but no birds still migrating. • Visit 2: 2-4 weeks later • Breeding should be underway, different evidence available. • Record presence/absence of 22 possible breeding behaviors observed in each period on each visit. • Record presence/absence of competitors/predators on each visit. • Record environmental variables at large, medium, and small scales. • Observers work in teams of 1-4 people.

  15. BFL data: visit protocol example • Observation Period (mandatory 10 minutes) Look and listen for predators, cowbirds, and study species • Playback Period (mandatory 5 minutes per species) Species 1: play songs, calls, or drums for 1 minuteSpecies 1: watch/listen for 1 minuteSpecies 1: repeat songs, calls, or drums for 1 minuteSpecies 1: watch/listen for 2 minutesSpecies 2: play songs, calls, or drums for 1 minuteSpecies 2: watch/listen for 1 minuteSpecies 2: repeat songs, calls, or drums for 1 minuteSpecies 2: watch/listen for 2 minutes • Behavior Watch Period (mandatory 10 minutes)Play eastern or western mobbing calls for 5 minutes while looking and listening for study speciesWatch/listen for 5 minutes

  16. BFL data: habitat characteristics • Survey point (where observer stands) • Latitude/longitude • Elevation • Distance to nearest edge, road, water, occupied building • Study site (radius=150m) • Hydrology during breeding season • Forest cover type • Slope • Land use • Land ownership • Canopy characteristics • Low vegetation characteristics • Landscape level (2500 acres) • Patch edge (what habitats are adjacent) • Forest patch size • Percentage of forest • Linear distance of edge • Distance to nearest 100 & 500 acre patches (if patch is less than 1000 acres)

  17. Increasing model flexibility • Why? • Many possible habitat variables • interactions? • Exploratory modeling with many covariates rather than hypothesis testing with few • 2 ideas: • Regularization • Boosted regression trees

  18. How to regularize these models? • One possible penalty: • How should the two components be weighted? • tug-of-war between occupancy and detection to explain the all-zero detection histories

  19. Preliminary synthetic data results • 8 covariates for each model, half of which truly had non-zero coefficients • Choice of objective function seems more important than regularization parameters

  20. Posterior Regularization • [Ganchev, Gillenwater, Graca, and Taskar, 2009] • Regularization constraints on posterior expectations instead of parameters, for example: • Expected occupancy is less than 60% • Of the all-zero detection histories, only half can be ‘explained away’ by the detection model

  21. Boosted Regression Trees • Popular in species distribution modeling • [Elith et al 2006] • Functional gradient ascent [Friedman 2001] • regression trees predict F(X) and G(W) • F and G are fed through logistic() to get o and d • Current challenge: tuning • learning rate (shrinkage) • number of trees to grow at each stage • depth of trees • number of stages

  22. Where Birding Meets Science!

  23. eBird—Current Stats (2009) • ~70,000 users • ~540,000 site visitors • 173 countries/territories • >1,500,000 checklists submitted • 2,945 species reported • 21 million observations reported

  24. Northern Cardinal Distribution (Frequency of Detection) • Gray – not reported • Tan – insufficient data • White – not covered

  25. Extensions needed for eBird? • Alternative detection model • add a node for expertise of observer • Relax the assumption of no-misidentifications • Y|Z=1 ~ Bernoulli(d) • Y|Z=0 ~ Bernoulli(h) • (instead of 0)

  26. Model with expertise node Bic Zis Yics Ej Uj Xi s j Wics c i

  27. Preliminary results: Synthetic data Synthetic data generated from EOM with different levels of false positives Area under ROC curve Slide courtesy of Jun Yu

  28. Preliminary results: eBird data • data from New York from May and June in year 2006, 2007 and 2008. • 27 by 64 Checkerboarding [New York State: Width-285 miles (455 km) and Length-330 miles (530 km): • Each Cell is roughly 16.8 km by 8.3 km. • There are roughly 200 sites generated during training. Slide courtesy of Jun Yu

  29. More challenges • Sampling bias • Spatial autocorrelation • For BFL, modeling multiple occupancy states • For eBird, modeling abundance • Multi-species approaches • Dynamic models • migration • range shift

  30. Questions?Comments?Suggestions?

More Related