390 likes | 505 Views
Gaussian Processes for Statistical Soil Modeling of the Tropics. CMU/TechBridgeWorld: Juan Pablo Gonzalez Drew Bagnell CIAT Team: Simon Cook, Thomas Oberthur, Andrew Jarvis, Mauricio Rincon. Introduction: What is CIAT?. International Center for Tropical Agriculture
E N D
Gaussian Processes for Statistical Soil Modeling of the Tropics CMU/TechBridgeWorld: Juan Pablo Gonzalez Drew Bagnell CIAT Team: Simon Cook, Thomas Oberthur, Andrew Jarvis, Mauricio Rincon
Introduction: What is CIAT? • International Center for Tropical Agriculture • Is a not-for-profit organization • Conducts socially and environmentally progressive research in developing countries aimed at • reducing hunger and poverty • preserving natural resources • Works through partnerships with farmers, scientists, and policy makers • 800 people, 120 researchers from 37 different countries
Introduction: CIAT locations • One of 15 future harvest centers in • Cali, Colombia (headquarters) • Kampala, Uganda (African Regional Office) • Vientiane, Lao (Asian Regional Office) • Honduras, Ecuador, Nicaragua, Bolivia, Kenya, Brazil, Sri Lanka and Thailand, amongst others. • Funded by CGIAR • Consultative Group on International Agricultural Research • 58 countries, private foundations, and international organizations CGIAR Members: World Bank, FAO, Ford Foundation, Rockefeller Foundation, Kellog Foundation, USA, Canada, U.K., Australia, New Zealand, Sweden, Portugal, Norway, Denmark, Austria, Italy, India, Pakistan, Kenya, Nigeria, Bangladesh, Belgium, Brazil, China, Colombia, Cote d'Ivoire, Egypt, Finland, France, Germany, Indonesia, Iran, Ireland, Israel, Japan, Korea, Luxembourg, Malaysia, Mexico, Morocco, The Netherlands, Peru, The Philippines, Portugal, Republic of South Africa, Romania, Russian Federation, Spain, Switzerland, Syrian, Arab Republic, Thailand,Turkey, Uganda
Introduction: What is TechBridgeWorld? • An initiative within Carnegie Mellon University • Mission: • To collaboratively design and implement creative technological solutions that will benefit developing communities around the world • “To bridge the world with technology”
Introduction: Task at Hand • Input: • Soil scientists from CIAT • Computer Scientists from CMU/TechBridgeWorld • 2500 Field samples from Honduras • Result: • Statistical Soil Modeling for The Tropics
Introduction • Statistical soil modeling: • The development of statistical soil models for large areas based on soil samples and digital maps of environmental variables • Exploiting easy-to-measure variables • Also known as predictive soil mapping (PSM)
Introduction • Importance • To detect opportunities • Target soil-sensitive crops confidently within new areas • To reduce risk of failure in new crops • To detect threats • Assess impact of climate change • To understand soil interactions with land use • Understand local hydrology • Make decisions about appropriate changes in land use
Introduction • Why in the tropics? • Most developing countries are located in the tropics • Most funding for soil analysis and modeling does not go to the tropics • The tropics have distinct climate patterns from the rest of the globe • Only dry/wet season (instead of four seasons) • Almost constant day length • Main determinant factor for temperature is elevation
Introduction: Current Soil Map Coverage Throughout the World • Detailed soil maps: • USA: complete coverage at 1:24,000 – very extensive and expensive (~30 m grid size) • 68% of the countries (31% by area) have complete coverage at 1:1,000,000 or better (1 km grid size) • Rest of the World • 69% by area • FAO World Map
Introduction: Current Soil Map Coverage Throughout the World • Food and Agricultural Organization (FAO) Worldwide Soil Map • Published in 1974 • Worldwide coverage at 1:5,000,000 (~5 km grid size) • Based on U.S. Soil Taxonomy • 26 classes with subcategories NITOSOLS (N) – Subclass UHTa-3 Soils having an argillic B horizon with a clay distribution where the percentage of clay does not decrease from its maximum amount by as much as 20 percent within 150 cm of the surface; lacking plinthite within 125 cm of the surface; lacking vertic and ferric properties. Low pH (high acidity)
Previous Work: FAO Soil Map • Problems: • Made with information and technology of 1960 • Significant changes in technologies such as GPS, remote sensing and GIS • Categorical data • Most soil types explain only a small proportion of the actual variation of properties • Soil variation is continuous • Soil attributes do not cluster perfectly: a cut on the basis of one attribute may split the variance of another attribute near its peak • Dependent on subjective expert opinion • Dependent on soil classification used • Low resolution
Traditional Soil Survey • Three steps: • Observation and measurement of ancillary data and soil profile • Observations incorporated into implicit conceptual model • Apply conceptual model to predict soil variation in unobserved sites • Conceptual model uses factors of soil formation • Soil is a function of climate, topography, organisms, parent material, time (H. Jenny, 1941)
Predictive Soil Mapping (PSM) • Statistical model using factors of soil formation • Soil is a function of climate, topography, organisms, parent material, time • Goals • Exploit relationships between environmental variables and soil properties to improve data collection efficiency • Produce and present data that better represent soil landscape continuity • Explicitly incorporate expert knowledge in the design
PSM: Existing Approaches • Ordinary Kriging • Weighted local spatial averaging • Spatial interpolation • Does not use knowledge of soil materials or processes • Requires a large number of closely-spaced samples • Block Kriging, Indicator Kriging, Co-Kriging • Extensions to include ancillary data • Difficult to extend to more than one ancillary variable
PSM: Existing Approaches • Expert Systems • Use expert knowledge to establish rule-based relationships between environment and soil properties • Do not use soil data to determine soil-landscape relationships • Regression Trees • Decision trees with linear models • Promising – Good results in Australia (Henderson, 2004)
New Approach: Gaussian Processes • Generalization of Gaussian distribution to function space of infinite dimension • Probabilistic (Bayesian) model • Completely determined by mean and covariance function • Prediction with mean and variance (confidence intervals) • Non-parametric • Very powerful • Complexity of model increases with more data • Not new. It started as kriging and has evolved as a replacement for supervised Neural Networks
Gaussian Processes • Interpolation technique equivalent to: • Neural Network with infinite number of hidden units • Radial Basis functions, with infinite number of basis functions • Least squares SVMs • Kernel Ridge Regression
Gaussian Processes • Covariance function
Available Data • 2500 soil samples from Honduras • Digital maps of Honduras with: • Climate: • Temperatures (max, min, average, etc) • Precipitation (max, min, average, etc) • Topography • 90-m elevation maps • Vegetation Index • Measurement of vegetation cover • And derived variables
Gaussian Processes • Learning the hyperparameters • Maximize the probability of the hyperparameters given the data • Use scaled conjugate gradient descent • Takes approximately 20 minutes with current data set • Selecting variables • Select most promising variables and incrementally add them to the model • Would take 54 hrs for each variable selected!
Gaussian Processes: Variable Selection • Greedy search on R2 of validation set • Learn parameters for all variables @10% of training set • Calculate R2 on validation set for all variables @10% of training set • Select variable with best R2 • Learn parameters @ 80% of training set with selected variables • Calculate R2 with selected variables @80% of training set • Decide whether to continue based on R2 on validation set for parameters R2: coefficient of determination. Percentage of the variance explained by the model
Gaussian Processes: Variable Selection R2: coefficient of determination. Percentage of the variance explained by the model
Training Time • With 10%/80% approach: • 15 s per R2 calculation @10% • 50 minutes for all variables (68), with three length scale priors on each • 20 minutes per R2 calculation @80% • Total: 1h 10’ per variable. Up to 9 h for 8 variables • With 25%/80% approach • 1 minute per R2 calculation • Total: 3h 30’ per variable. Up to 27 h for 8 variables • With 80% approach • 20 minutes per R2 calculation • Total: 54 h per variable. Up to 18 days for 8 variables
Results: FAO Map of Honduras NITOSOLS (N)Soils having an argillic B horizon with a clay distribution where the percentage of clay does not decrease from its maximum amount by as much as 20 percent within 150 cm of the surface; lacking plinthite within 125 cm of the surface; lacking vertic and ferric properties. Low pH (high acidity)
Results: Accuracy Of Current Techniques • “A soil survey is good if the map units have the right soil more than 50% of the time” • Most measurements have a variability of 20% or more between laboratories • Most quantitative prediction methods explain less than 10% of variation • Exception: Henderson 2004 in Australia
Results: pH in Topsoil • %Experiment: 554, PHW1 vs. inputs. Training set= 82% • out_variable = PHW1 • variables = { 'XUTM' 'YUTM' 'P5' } • %final hyperparameters: • in_params = [ 0.1414 -1.3439 4.3123 3.5009 -1.9544 -0.8364 -1.3607 ] • Train/Test2 error: • Data 0.7547/0.7567 • Model 0.4800/0.5590 • Train/Test2 r^2: • 0.5954/0.4544 • bias: 1.151939 • noise: 0.260834 (std = 0.51072) • lengthscale: • XUTM 0.115770 (11067.51) • YUTM 0.173696 (11198.10) • P5 2.656948 ( 6.60) • vertical scale: 0.256473 • linear coefficients: • XUTM -0.039257 • YUTM -0.133942 • P5 0.276685 • vertical scale: 0.433256 P5: Maximum temperature of warmest month
Results: pH in Topsoil P5: Maximum temperature of warmest month
Results: pH in Topsoil P5: Maximum temperature of warmest month
Prediction Time • 21 ms/cell – 1700 training points, Pentium 4 1.8GHz • Honduras (112,000 km2) • 40 minutes @ 1km • 3.4 days @ 90m • 30 days @ 30m • Africa (30,000,000 km2) • 7.2 days @ 1km • 2.4 years @ 90m • 22 years @ 30m • USA (9,158,000 km2) • 2.2 days @ 1km • World (148,940,000 km2) • 37 days @ 1km
Results: Impact • Gaussian Processes for PSM: • Provide quantitative predictions • Provide quantitative estimate of confidence • Combine pedogenic factors and spatial interpolation • Allow for complete coverage • Enable continued improvement • Match or advance state of the art in predictive soil mapping
Future Work • In Gaussian Processes for Predictive Soil Mapping • Validate Results • Improve existing variables • Find new variables to improve results • Compare with leading approach: Regression Trees • Participate in international workshop to assess viability of worldwide coverage with latest techniques
Future Work:Weather Index Insurance for Small Farmers • Rather than insuring yield loss… • Insure for weather: most likely cause of yield loss is lack of or excess of rain • Reduces fraud • Reduces cost • Challenges: • Event timing is critical • Needs very low false positive and false negative rate • Impact of rainfall depends on terrain and soil type
Future Work: Analysis of Digital Aerial Imagery • Captured with low-cost hot air balloon or kite • Automatic image mosaicing • Generation of elevation maps from images
Future Work:Automatic Coast Line Extraction • 90 m Digital Elevation Maps available for the world, from shuttle mission.
Future Work:Temporal Analysis of Vegetation Cover • To monitor natural changes and human impact
Conclusions • Great contributions can be made by applying computer science techniques to other fields • Scientists in other fields are frequently limited to off-the-shelf solutions • Working with existing groups in developing countries can maximize impact of short-term work