190 likes | 211 Views
This talk outlines the requirements, architectural principles, and importance of standards in building distributed statistical systems. It also discusses the efficiency and communication overhead challenges faced in such systems.
E N D
Building Distributed Statistical Services using XML Jostein Ryssevik RSS Statistical Computing / ASC Joint Event System Architecture for Statistical Applications 25th Jan 2006 RYSSEVIK RESEARCH
Outline • This talk is mostly about distributed access to statistical data and processes • Requirements to distributed systems (and how they differ from the requirements to stand-alone systems). • Architectural principles of distributed systems in general and how they apply to statistical software • The emerging role of Web-services and registries • The importance of standards (like DDI and SDMX for disturbed systems). • The general lack of standardization in the area of statistics (as compared to a related areas like GIS) and how this is holding back development. • ...examples taken from Nesstar (and other systems). RYSSEVIK RESEARCH
Requirements • Scalability • The ability to add nodes and users to the systems without effecting the stability, performance or user-experience • “Organic growth” – the ability to add nodes and functionality to the system without reconfiguration or central intervention • Openness and interoperability • Based on well defined and fully documented open standards • Shallow standardization (think about the success of e-mail and the Web. • Efficiency • Processing speed. Distributed systems are multi-user systems. Any single operation taking more than a split second is not only holding up the requesting user, it is creating a queue • Fault-tolerance • No single point of failure. • Resource location • Mechanism that allow for efficient location of service providers as well as resources distributed across service providers. • Web-enabled • Providing platform independent access through standard Web-browsers • Using the standard lightweight protocols of the Web RYSSEVIK RESEARCH
Efficiency • Efficient processing: • User’s of desktop systems can easily tolerate response times numbered in seconds, mainly due to the fact that the response times are predictable. • In a distributed system any process requiring more than split-second CPU-time is likely to generate a queue – and in effect generating longer and unpredictable response times. • Efficient use of memory: • On a desktop system memory allocation between applications can normally be controlled by the user. • In a distributed system parallel processes invoked by parallel users are competing for the same fixed amount of memory. • This has consequences for the amount of data that can be held in memory, as well as the length of the time data stays in memory. • Efficient communication: • On a desktop system the communication overhead is insignificant. • On a distributed system information are transported between system layers and across the wires. Input/output operations as well as transport are therefore crucial RYSSEVIK RESEARCH
Efficiency – benchmark examples • Dataset 1: 21 variables x 24.495 cases (double precision matrix = approx. 4 Mb.) • Dataset 2: 1.614 variables x 30.227 cases (double precision matrix = approx. 390 Mb) RYSSEVIK RESEARCH
Efficiency - lessons learned • Disk versus RAM: • Loading all the data into memory is an easy way to increase the efficiency of a desktop system. • For multi-user systems, holding entire datasets in memory is not a viable option. • Focus must be put on • How to get the relevant parts of the data from disk into memory quickly enough • How the speed up the processing in order to reduce the period data has to stay in memory • How to release memory quickly enough • Ways of reducing data load times: • Read/load only the parts of the data that is required by the query. • Invert the data matrix • Index the data matrix • Intelligent caching • Keep metadata in memory • Data processing versus communication overhead • When optimising statistical systems, too much effort are spent on improving the algorithms. • At least in multi-user systems, data processing (post data loading) represents a small fraction of the overall response time. • The largest fraction is represented by the communication overhead. RYSSEVIK RESEARCH
Communication overhead • Steps in the execution of a remote query (simplified): • Sending the query from the client to the server • Parsing the query • Sending the message to the relevant data processing unit • Retrieving and loading the relevant data • Executing the analysis • Writing the output (normally as XML) • Passing the XML to the Web-application • Creating Web-pages of the output (HTML) • Sending the HTML over the net to the clients browser • Parsing and displaying the web page in the browser • Ways of reducing the communication overhead • Don’t send data, send outputs. Always do the processing as close to the data as possible. • Use XML, but only when it makes sense for interoperability and other important reasons. • XML zips very well • Don’t carry a bigger load than necessary. Create intelligent data/metadata objects that can call “home” for supplementary information on demand. RYSSEVIK RESEARCH
Nesstar system architecture End-user client Standard Java script enabled Web Browser (HTML over HTTP) Nesstar Web engine Tomcat Web Client Application Object cache Proxy objects NEOOM protocol (XML/RDF over HTTP) Web Server/Container: Tomcat HTTP Interface Servlet RDF Class Interface Definitions BridgeRemote Bean J2EE compliant EJB Container: Jboss LocalBean Percistence manager MVCSoft Metadata database: Oracle/MySQL RYSSEVIK RESEARCH
Characteristics of the Nesstar architecture • Fully distributed architecture with no central server: • Meeting the requirement of fault-tolerance, but scalability only up to a certain level • All statistical objects “live” at a URL • Integrated by NEOOM (the Nesstar Object Oriented Middleware) – a pre-SOAP and pre-Web Services integration framework. • Interface Description Language based on RDF/XML (resembling WSDL in Web Services): • Self describing objects – when a client access the URL of the object, the object returns a description of its current state (and its available methods) • Remote Procedure Calls: RDF/XML messages over HTTP (resembling SOAP in Web Services): • The calls can be stored as URLs, specifying the location of the relevant object as well as the method parameters. • This allows for client side storage of statistical operations that easily can be rerun at a later stage thereby creating a simple batch language for operations on remote statistical objects. • Metadata Object Model based on DDI: • But extended to integrate essential parts of ISO11179 • Resource location done through cross-server searching organized by a single Nesstar Web-engine • no concept of a central index or registry in the original architecture • A Nesstar Registry service is currently under development • A single point of reference for information about Nesstar servers, their services and resources • Allowing server owners to register their servers • Automatic harvesting of metadata into a central index RYSSEVIK RESEARCH
The essential part of this story • A well defined namespace: • Statistical objects are uniquely identified by a universal naming system, a URL. • Self-describing objects: • Point a browser to the objects URI and a description of the object will be returned (including a description of the methods that can be used to access/manipulate the object). • A protocol for performing actions on the statistical objects: • XML over HTTP • A cataloguing service allowing for registration and location of statistical objects and services • A metadata registry RYSSEVIK RESEARCH
1Communication protocol 2Interface description 3Resource location mechanism, registry or cataloguing service UDDI is accessed using SOAP enables discovery of binds to WSDL enables communication between describes Web services WSDL –Web Service Description Language UDDI – Universal Description, Discovery & Integration SOAP – Single Object Access Protocol Web Services (1. generation) • 2. generation WS adds • security • orchestration • etc. RYSSEVIK RESEARCH
The importance of Registries • Enable discovery of services and resources? • Provide single point of access for end-users or software agents to look for relevant resources and services • Describe the services and the resources in enough detail for the end user or software system to access them. • Provide the address where the service and the resource is located. • Normally includes subscription services notifying users or software systems when new services or resources are made available or updated. • Registries hold metadata, not data • Registries provide service or resource providers with a publishing mechanism • Metadata can be published by the provider to a registry (push) • ...or metadata can be harvested by the Registry, following agreed policies between the providers and the Registry. • Registries are an important part of the Web Service model, but are used in other technical environments as well • UDDI (normally used in Web Service registries). • Another alternative is: ebXML Registry (an OASIS standard) RYSSEVIK RESEARCH
Registry UDDI/ebXML Registry Service Interface (WSDL) SOAP Service Interface (WSDL) SOAP Service Interface (WSDL) SOAP SOAP Service Interface (WSDL) Application Application Application Application Distributed computing using WS RYSSEVIK RESEARCH
Web Services & statistical systems • None-existing? • Extensive googling combining the term “Web Services” with terms like “statistical” & “statistics” returns few relevant hits. • Exceptions: • SAS BI Web Services • SDMX – Statistical Data and Metadata Exchange Initiative • Others? RYSSEVIK RESEARCH
SDMX – Statistical Data and Metadata Exchange Initiative • Sponsored by: • BIS, ECB, Eurostat, UN, OECD, IMF, World Bank • Objectives: • Develop standards for efficient exchange of statistical data • Focus on aggregated statistics and time series, typically the type of data that travels from NSOs to international statistical organizations • Modelled in UML, represented as XML Schemas • Version 2.0 released November 2005. • Relevance: • Metadata standards like SDMX are essential for distributed statistical systems • SDMX is taken up rapidly among it’s sponsors and beyond • SDMX 2.0 includes a specification for the structure and logical interfaces of an SDMX registry (SDMX RR) • A prototype/demonstration registry already implemented • A demonstration of how SDMX-ML data/metadata, the Registry and standard Web-services can work together to establish new ways of accessing, processing and analysing statistical data is available on the SDMX web-site. • Testing and implementation work already underway in OECD, Eurostat (SODI = SDMX Open Data Interchange), Federal Reserve (US) and others RYSSEVIK RESEARCH
DDI – Data Documentation Initiative • Member organization: • Data archives, Universities, some data producers (StatCan, World Bank), SPSS, Nesstar • Objectives: • Develop standards for efficient exchange of statistical data • Focus on survey data, but lately extended to support aggregated data as well. • Version 2.0 released 2003 as an XML DTD, later a Schema has been added • Version 3.0 in the making, modelled using UML, represented as an XML Schema. • Relevance: • Fast take among data archives, growing acceptance data producers. • StatCan & HealthCan • World Bank and the International Household Survey Network • WHO and the World Health Surveys • A growing number of tools and systems support and build on the DDI • Nesstar, IHSN’s Micro Data Management Toolkit • SDMX and DDI (along with Triple-S) is for the time being the two most ambitious attempts to create standardization in the world of statistics • Both are user driven • Radically different in approach from vendor driven standards like the the SPSS MR model RYSSEVIK RESEARCH
WHY standardisation? • Standards are essential for distributed systems • Standards lower the costs of entering the technology game, promotes development and competition • Standards remove big chunks of technology from the competitive part of the game and allow vendors and developers to focus on new and additional functionality. RYSSEVIK RESEARCH
Looking over the fence.... • An area like GIS (Geographical Information Systems) has been far more successful implementing and accepting standards than the statistical world. • OGC (Open Geospatial Consortium): Supported by the user community as well as the majority of vendors. • Developed GML – an xml standard for structuring geographical data and metadata (the parallel to SDMX and DDI), • Developed standards like WMS (Web Map Server) and WFS (Web Feature Server), which basically are standards describing minimum sets of functionality for standard GIS systems as well as external interfaces and communication protocols for this functionality. • ..and their standards are supported by the majority of vendors. • In practice this has created a situation where software from different vendors (as well as open source software) easily can be mixed and swapped in and out without making changes to the overall system. • In a statistical information system this would be parallel to swapping between different statistical engines without making any changes to the underlying data store or user front ends. RYSSEVIK RESEARCH
Concluding questions • Why is GIS so far ahead of SIS when it comes to standardisation? • What are the consequences of this for software development within the field of statistics? • What would it take to establish the level of standardisation in SIS as we see in GIS? RYSSEVIK RESEARCH