1 / 27

A High-Throughput Computational Approach to Environmental Health Study Based on CyberGIS

A High-Throughput Computational Approach to Environmental Health Study Based on CyberGIS. Xun Shi 1 , Anand Padmanabhan 2 , and Shaowen Wang 2 1 Department of Geography, Dartmouth College

peggy
Download Presentation

A High-Throughput Computational Approach to Environmental Health Study Based on CyberGIS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A High-Throughput Computational Approach to Environmental Health Study Based on CyberGIS Xun Shi1, Anand Padmanabhan2, and Shaowen Wang2 1Department of Geography, Dartmouth College 2Department of Geography and Geographic Information Science, National Center for Supercomputing Applications (NCSA), University of Illinois at Urbana Champaign September, 2013

  2. Basic functionality of CyberGIS • Accessibility: Making GIS capabilities accessible to a large of number of users for research and education, through online cyberGISGateway; • Computational Capability: Embedding geospatial software capabilities into advanced cyberinfrastructureenvironments; • Interoperability: Managing heterogeneous and distributed resources and services through GISolve middleware.

  3. Basic functionality of CyberGIS • Accessibility: Making GIS capabilities accessible to a large of number of users for research and education, through online cyberGISGateway; • Computational Capability: Embedding geospatial software capabilities into advanced cyberinfrastructureenvironments; • Interoperability: Managing heterogeneous and distributed resources and services through GISolve middleware.

  4. A computational approach to spatial epidemiology • Disaggregate polygon-level location data using restricted and controlled Monte Carlo (RCMC). • Calculate local statistics, e.g., calculate intensity of disease occurrence using kernel ratio estimation (KRE). • Estimate statistical significance of the intensity using unrestricted and controlled Monte Carlo (UCMC).

  5. Disaggregate polygon-level location data 23 births with defects 1202 births Birth with defect(s) Normal birth Population High Low

  6. Restricted and Controlled Monte Carlo (RCMC) for Disaggregation • Assign polygon-level addresses to random locations. • The randomization is restricted by the smallest polygon to which a polygon-level address belongs. • The randomization is controlled by the detailed background data. • The randomization is repeated many times (Monte Carlo).

  7. Advantages of RCMC • Allows analyses designed for individual/precise locations to be conducted. • Maximize the utilization of available spatial information. • Explicitly evaluate the spatial uncertainty caused by the imprecision in the data.

  8. Kernel ratio estimation (KRE) for Estimating Local Disease Intensity Birth with defect(s) Normal birth Essentially, calculate the ratio between cases and cohort for each and every location.

  9. Setting of KRE fixed bandwidth vs. adaptive bandwidth site-side kernel vs. case-side kernel

  10. Types of KRE Case-side fixed bandwidth Site-side fixed bandwidth Case-side adaptive bandwidth Site-side adaptive bandwidth

  11. Unrestricted and Controlled Monte Carlo (UCMC) for Estimating Statistical Significance RCMC KRE Compare UCMC KRE P-value

  12. Epidemiological Confounding factors 2

  13. 0.020 0.000 1.000 0.006 hot spots mean P-value Stddev of P-value

  14. RCMC-UCMC-based Simulated Case-Control Study for Detecting Disease-Environment Association Case location from RCMC Control location from UCMC Environmental exposure

  15. Spatial variation in disease-environment association: A map of P-value P-value 1 0.0001

  16. Computational Demand I:Number of local statistic computing (e.g. KRE) iterations in RCMC and UCMC • Scenario: • Stratification is needed for addressing confounding factors • Case data are at the polygon level • Cohort data are at the polygon level • Detailed background data are available RCMC iterations: No. of Strata X No. of iterations for cases X No. of iterations for cohort e.g. 2 X 100 X 100 = 20,000 UCMC iterations: No. of Strata X No. of iterations for simulation X No. of iterations for cohort e.g. 2 X 99 X 100 = 19,800

  17. Computational Demand II:Number of layer-on-layer comparisons for estimating P-value No. of iterations for cases X No. of iterations for simulation X No. of iterations for cohort e.g. 100 X 99 X 100 = 990,000

  18. Computational Demand III:Pixel-wise statistic computing • Major operations, use case-side adaptive bandwidth KRE as example: • Expand the kernel in a spinning way • Accumulate the distance-decayed kernel value for each case encountered • Accumulate the cohort value • Check if the threshold is met No. of pixels that are not “nodata” pixels e.g. About 3 million in a 1652 X 2912 raster

  19. Computational Demand IV – Memory Number of raster layers generated during the process: No. of RCMC iterations + No. of UCMC iterations + No. of Parallel Comparisons e.g. 20,000 + 19,800 + 10,000 = 49,800 Memory: Size of data type X No. of columns X No. of rows X No. of raster layers e.g. 4 bytes X 1652 X 2912 X 49,800 = 550 gigabytes

  20. On a HP Z800 Workstation (2 Xeon CPUs 3.07GHz, 32GB RAM) • Mapping birth defects for New Hampshire • 1400 birth defect cases for 2003-2009 • 99,000 births for 2003-2009 • 2 age categories • 220 town polygons • 100-m resolution female population raster (1652 x 2912) • 100 RCMC iterations for cases • 100 RCMC iterations for cohort • 99 URMC iterations • 40 hours

  21. Migrating to cyberGIS • Setup infrastructure • New repository created in CyberGIS SVN • Establish a development environment • Define the application interface using GISolve Open Service APIs • Build and deploy the code on cyberinfrastructure resources from SVN • Publish the application • Test application execution

  22. Computation Management through GISolve Open Service APIs • Compress input into a single zip file and make it available on a Web accessible location • Input to the program include files for point cases, zone cases, cohort, background, zone file, and associated settings need by the application • The URL of the zip file is the single parameter to the Open service APIs • Code execution and input/output data are put into a computation sandbox • Simply run php job-submit.php and the GISolve middleware will take care of the rest

  23. Parallel computing through CIGI local cluster and XSEDE • Original MFC (Windows) code was extracted and adapted to run on the Linux environment • Application code has been checked into the CyberGIS SVN for co-development and deployment on a CIGI local cluster and XSEDE • Developed a set of parallel and distributed computing strategies based on a spatial computational domain construct • Optimizing computational performance of these strategies

  24. Ongoing … • Accessibility: Making GIS capabilities accessible to a large of number of users for research and education, through online cyberGISGateway; • Computational Capability: Embedding geospatial software capabilities into advanced cyberinfrastructureenvironments; • Interoperability: Managing heterogeneous and distributed resources and services through GISolve middleware.

  25. Designing and constructing secured data transporting protocol and tunnel …

  26. Acknowledgements • National Science Foundation - OCI-1047916 • XSEDE SES070004 • NIH P20RO18787 • NIH P20ES018175 and EPA RD83459901 • Dartmouth Neukom/IQBS CompX Faculty Grant

  27. Thanks!Questions …

More Related