390 likes | 594 Views
Metadata, Provenance and Web Service for Spatial Analysis -- the case of spatial weights. Luc Anselin, Sergio Rey, Wenwen Li GeoDa Center School of Geographical Sciences and Urban Planning Arizona State University. Some Specific Project Goals
E N D
Metadata, Provenance and Web Service for Spatial Analysis --the case of spatial weights Luc Anselin, Sergio Rey, Wenwen Li GeoDa Center School of Geographical Sciences and Urban Planning Arizona State University
Some Specific Project Goals • Integrate and sustain a core set of composable, interoperable, manageable, and reusable CyberGIS software elements based on community-driven and open source strategies
Challenge • most current spatial analysis/spatial econometrics software written for single CPU • rethink and rewrite analytical, algorithmic and processing facilities to integrate into a cyberinfrastructure • address lack of interoperability
Spatial Econometrics Workbench • framework for supporting spatial econometric research in a cyberscience era (Anselin and Rey, IJGIS 2012) • Leverage PySALand CyberGIS • Support scientific workflow
PySAL • open source library of Python routines for spatial analysis: geocomputation, spatial weights, spatial autocorrelation, spatial econometrics, regionalization • http://pysal.org • hosted on github
PySAL Progress Report • current version is 1.6 (7th release) • 3.5 years of on-time bi-annual releases • 20,000+ downloads (10,000 in 2012) • recognized in open source scientific community - Anaconda
Migrating to CyberGIS • performance = need for parallelization + refined algorithms • interoperability = provide functionality as web services • replicability: need for metadata and provenance tracking
Example: Spatial Weights • includes spatial data source, type of weights (e.g., contiguity, distance), any standardization or manipulation (e.g., higher order)
Lack of Interoperability • different implementations • no standards • duplication of efforts • hinders interoperability and workflow chaining
Example: PySAL spgreg what do we know about south_k6.gwt and south_ep_k20.kwt
Conceptual Framework • separate data source from operations • data source: polygon or coordinate files with standard metadata (projection, origin, etc.) • operations: weights metadata
Web service implementation(OGC WPS) • wraps PySAL weights module • (re)creates weights object from information in wmd file • makes weights object available as a file
wmd file (json) Weights Parser PySAL Dispatcher Weights Output Metadata Workflow
Generate Weights from Shapefile • NAT.shp available on server • output format = gal
Get Request • http://spatial.gdta.asu.edu/cgi-bin/wps.cgi?request=Execute&service=WPS&version=1.0.0&identifier=weights_ws&status=false&datainputs=[outputformat=gal;metadata={"input1":{"type":"shp","uri":"http://toae.org/pub/NAT.shp"},"weight_type":"rook","transform":"O","parameters":{"p":2,"k":4}}] metadata input
Sample gal output http://spatial.gdta.asu.edu/wpsoutput/e66df128-14ed-11e3-bde9-0050455c0671.gal
metadata (wmd) file http://spatial.gdta.asu.edu/wpsoutput/e66df128-14ed-11e3-bde9-0050455c0671.wmd
Performance Evaluation • How does PySAL scale when the amount of input data increases? • Is the overhead of web service framework acceptable? • How does the web service framework scale in handling massive concurrent requests?
Scale-up vs. Scale-out solution • Scale-up • High-end computer • Configuration • Processor 2 x 2.93 GHz Quad-Core Intel Xeon • Memory 16 GB 1066 MHz DDR3 ECC • Software Mac OS X Lion 10.7.4 (11E53) • Scale-out: • Web server cluster
Performance • experiment using grid layout for N = 10,000 to N = 100,000 • rook contiguity and k nearest neighbors (k = 4) • input shape files on server in Utah, web service on server at ASU
Experiment 1 • Timing: average over 5 experiments • web server overhead, data transfer and computation • explore effect of data size
Experiment 2 Scalability of web service framework High-end computer (8-cores) Cluster (4 computing nodes, each has 2-core) Total processing time Speed up
Experiment 3 Scalability of the cluster by adding more computing nodes Average response time 128 concurrent requests Dataset: 10,000 polygons
Towards a Standard • refine specification: flexible, expandable, deal with edge cases • improve performance (parallelization) • implement seek operations on distributed files • interoperability with other software