190 likes | 206 Views
Explore innovation in census delivery mechanisms with a focus on geo-linking capabilities. Develop value-added services like cartogram and bittorrent networks for dissemination.
E N D
DIaD Data Integration and Dissemination May 2009 James.Reid@ed.ac.uk Data Integration and Dissemination:DIaD
Background Who? EDINA – a JISC funded National Data Centre delivering on-line resources to UK Higher and Further Education The ESRC's Geography Data Unit for the Census Programme What? DIaD – an ESRC funded project aimed at exploring innovation in census delivery mechanisms The primary objective of this work is to develop a data dissemination model which demonstrates a more generic capability – that of ‘geo-linking’
Background - What? • The secondary objective of the work was to develop value • added services exploiting the results of the automated • linkage outputs, specifically: • A cartogram service • A bitorrent based network for dissemination of the Linked Outputs More on these later...first the rationale...
Background - Why? • The two most heavily used of the data sources are the small area statistics provided by the Census Dissemination Unit (CDU) and the digital boundary datasets provided by the Geography Data Unit (UKBORDERS). Together these sources allow end users (significantly researchers) to undertake a wide range of analytical and visualisation tasks, from for example, simple choropleth mapping to cartogram transformations to detailed small area spatial analyses. Each resource (the statistics on one hand and the boundary data on the other), are extremely valuable in their own right but in combination they provide a data resource of almost unparalleled versatility and richness to social science investigators
Background - why? Evidence from a recent ESRC survey of geospatial services and requirements Source: ESRC interim survey results, march 2009. n=512
How? • Via Open Standards* (a la Open Geospatial Consortia) • Specifically using: • the Geographic Linkage Service (GLS) Specification • the Web Feature Service (WFS) Specification • investigate the Web Processing Service (WPS) Specification Implicitly via use of Open Source Software * An open standard is a standard that is publicly available and has various rights to use associated with it, and may also have various properties of how it was designed (e.g. open process).
How? Geographic Linkage Service (GLS) Specification Purpose: to provide a simple way to describe and exchange data that contains geographically related information, but which does not include the detailed geometry of the geographic object. A GLS provides a simple standardized way to exchange attribute information that applies to a well-known geospatial dataset known as a Framework dataset. Attribute information delivered from a GLS can be used in a variety of ways, including use by models to perform calculations, or visualization as a web map.
Geographic Linkage Service (GLS) Specification GLS includes two related sets of operations. 1. GetData - Attribute data is provided to other computers on the network by implementing the GetData (and related) operations. The response to a GetData operation is an XML file, in a format known as GDAS (Geographic Data Attribute Set)*. 2. JoinData - At some other node on the network, another GLS configured for the JoinData operation allows a computer to incorporate the contents of the XML file into a local spatial framework dataset. This local dataset would normally in turn be used to support mapping of this information. * In early versions of the GLS the specification was split into two separate specifications, one of which was known as the Geographic Data Access Service (GDAS). Subsequent revision integrated the two specifications. Note that GDAS (original) and GDAS (current) are not the same thing! How?
In comparison to GML, GDAS provides the following specific benefits: • It is a single logical encoding for attribute data. • It is extremely light-weight. • It is optimized for the efficient discovery of vector attributes. • It includes attributes to support automated mapping, including titles, legends, and the classification of attributes. • It includes attributes to address the presence of null values in the dataset to facilitate their exclusion from calculations and legends. • It includes attributes to support the joining of tabular data to geometry in a N:1 or N:N fashion. • It is easy to validate its content and convert it into HTML or other formats. • It is easy to manipulate its content and enables the performance of calculations using XSLT. • It is easy to generate directly from corporate database management technology, using languages such as XQuery. GML vs GDAS
The GDAS format is designed to support simple as well as rich and complicated attribute databases that may not always be easy to interpret. • The metadata included in the encoding is designed to ensure that the user knows exactly what the content of the dataset is as well as which spatial framework it references, and has easy access to any associated documentation. • GDAS is produced in response to a GLS GetData request The general structure of the GDAS XML encoding is as follows. <GDAS> <Framework> ... [spatial framework metadata] <Dataset> ... [attribute dataset metadata] <Attribute> ... [attribute metadata] </Attribute> <Rowset> ... [attribute data] </Rowset> </Dataset> </Framework> </GDAS> GDAS in more detail
Value Added Services (1) Cartograms A Cartogram Generation Service “Cartograms represent map feature surfaces in such a way, as to make them proportional to a given statistical variable. This representation method mostly derives from "classical" maps (i.e., maps representing ground topography) in the sense that the transformation can only be processed on an already given geometry. Topographical polygon layers are thus mostly used as a starting point for the production of any cartogram.” - ScapeToad In reality it looks more like this... • Uses the excellent Scape Toad code at the backend to generate and output Cartograms (chorogram.choros.ch/scapetoad) • Uses the Gastner/Newman algorithm
Value Added Services (1) Cartograms We have developed a simple Cartogram Generation Service which takes a number of parameters, some of which have default values. We've mimicked these using the ScapeToad's API. layer=england_oa_2001 attribute=population attrType=mass|density (A mass (e.g. a population or a wealth) is measured or estimated over the whole surface of each polygon; a density can be a mass:mass ratio or a mass:surface ration) url= (example - must be encoded) http://diad.edina.ac.uk/service/joinedData?dataset=dataset_name quality=50 grid=true rows=100 http://a.website.ac.uk/service/cartogram? layer=england_oa_2001&attribute=ks0080001&url=http://anothersite.ac.uk/test_1240418191235.zip
Value Added Services (1) Cartograms Worldmapper example – Age of Deathhttp://www.worldmapper.org/ DIaD generated, Deprivation(ONS)Income scores, Swindon
Value Added Services (2) Bittorrent • A peer-to-peer file sharing protocol used for distributing large amounts of data. BitTorrent is one of the most common protocols for transferring large files, and by some estimates it accounts for about 35% of all traffic on the entire Internet. • The protocol works initially when a file provider makes his file (or group of files) available to the network. This is called a seed and allows others, named peers, to connect and download the file. Each peer that downloads a part of the data makes it available to other peers to download. After the file is successfully downloaded by a peer, many continue to make the data available, becoming additional seeds. • This distributed nature of BitTorrent leads to a viral spreading of a file throughout peers. As more seeds get added, the likelihood of a successful connection increases exponentially. Relative to standard Internet hosting, this provides a significant reduction in the original distributor's hardware and bandwidth resource costs. • Provides redundancy against system problems and reduces dependence on the original distributor.
Value Added Services (2) Bittorrent • A Bittorent creation service will be added to the linked outputs and a tracker established to allow geolinked reasults to be downloaded via a p2p client e.g uTorrent • Rationale – increasingly researchers/students want access to resources from home • Home machines tend to have lower bandwidth than those directly available from SuperJANET backbone • So the Bittorent approach means users can 'share the load' of large files • Note that boundary data + statistics data can = quite lareg files sizes (Gb vs Mb)
Whither open source? • Our demo client uses a front and back-end stack of OSS: • Openlayers • Postgis • Geoserver • OGR • ScapeToad • Our own code for the GLS is (will be) opensource
General Observations • Open standards (and OSS) have a definite role but... • They are not an end in themselves • They are not always as mature (or static) as you might wish • Things evolve - often in short time periods • Users (!) • Interoperability (the holy grail) is possible but there are significant barriers: • A&A issues (UKAMF + web services - GeoXACML?) • Scalability (Cloud/Grid ??) • Evolving delivery paradigms e.g. mobile • User expectations vs resourcing constraints