520 likes | 678 Views
Efficiency and Reliability of the Transit Data Lifecycle. A study of multimodal migration, storage, and retrieval techniques for public transit data. Presented by: Matthew Ahrens Faculty Mentor: Dr. Uma Shama. Overview- Background. GeoGraphics Lab
E N D
Efficiency and Reliability of the Transit Data Lifecycle • A study of multimodal migration, storage, and retrieval techniques for public transit data Presented by: Matthew Ahrens Faculty Mentor: Dr. Uma Shama
Overview- Background • GeoGraphics Lab • Maintain public transit data for Regional Transit Authorities (RTAs) in the Commonwealth of Massachusetts. • Services • Digitizing of static schedule data • Dynamic and real-time vehicle location data • Consultation and expert advice role
Overview- Background • This project • Interdisciplinary between Mathematics and Computer Science • Focus on real-world / business applications of data analysis • Time Span • Spring 2013 • exploratory analysis • Summer 2013 and ATP summer grant • Modeling experiments • Fall 2013 • Implementation and integration
Overview- Background • This project – cont. • Evolved through several iterations • Original Purpose: Spatial analysis on ridership and vehicle location data • Four areas of focus occurred, changing focus of project over time • 1. Concepts were unclear among Authorities • 2. Inconsistent data collection tools for historical analysis purposes • 3. development on systems affected core features • 4. documentation for systems was in code, no clear point of injection
Overview- Outline • Four sections • Abstraction and modeling of transit data • Analysis of design patterns and algorithms with comparison to existing systems • The design and implementation of a context free data model • The design and implementation of a multimodal, application-level interface
Abstraction • Research Questions • How can the different transit data protocols be described to compromise between conflicting definitions and structures? • Is there a compromise that can be reached that is still purposeful and clear? • Purpose • Comparison of three authorities • GTFS / GTFS-realtime • TCIP • Proprietary (various).
Abstraction • GTFS Example • Pros: • Descriptive, data type or storage inclusive. • Separation of required for definition and optional metadata • Cons: • Perspective of transit user • Many definitions do not have explicit relationships
Abstraction • GTFS-Realtime Example • Pros: • Descriptive, data type or storage inclusive. • Separation of required for definition and optional metadata • Cons: • Defined as a feed, no distinction or limitation of rate • Optional fields not purposeful for minimum definition or structure.
Abstraction • TCIP Example • Pros: • Complete, covers every aspect of transit • Cons: • Vague • Concerned with relationships between data systems • Specifies medium over message, requires XML/XSD format but does not clearly define data elements
Abstraction • Proprietary Example - ERSI • Pros: • Shows relationships between geospatial definitions • Standard Leader for GIS protocols (GML, OpenGeo ) • Cons: • Concerned with GIS and use definitions over technical definitions • Missing most transit data concepts
Abstraction • Methodology • Create an understandable, unambiguous definition for common transit concepts • Use as few primitives as possible to ease implementation • Use composition to aggregate data • Two options considered • Define a object – method relationship • Define a set-theoretical model of transit data structures
Abstraction • Methodology • Remove implementation and use specific contextfrom transit data structures • Find minimum required composition • Acknowledge commonly attributed metadata • Define data by production mechanism rate
Abstraction • Disambiguation • Real-time • Produced frequently in real-time • Best represented as a signal or a message stream • Dynamic • Infrequent but unknown rate of production • Best represented as a feed • Static • Infrequent, known interval rate of production • File system or other static resource
Abstraction • Results • Data flow model influenced the decision
Abstraction • Results • Set Theoretical Model • Description • Define implementation independent definition of primitives • Compose transit data structure from those primitives • Define complex data structures as supersets of simple structures
Abstraction • Commonly used examples • Primtives • Geolocation • Datetime • Unique, Index-friendly ID (numeric, simple text) • Simple structure • Stop • Trip • Composite Structures • AVL • ETA
Abstraction • Composition Example
Data Migration • Research Questions • What technologies, techniques, or models most efficiently and reliably move transit data from producer to consumer? • Which of those best embody the concepts of reuse, extendibility, and reusability? • Which ones are resistant to need modification and internal maintenance?
Data Migration • Purpose • Perform exploratory work to set standards for handling data transit • Which of those best embody the concepts of reuse, extendibility, and reusability? • Which ones are resistant to need modification and internal maintenance?
Data Migration • Methodology • Study of BusLocator– current data migration technology of AVL and Route specific data • Duplication of Timer-event concurrency model for real-time data • Pull design pattern vs. Push design pattern • Approximation Algorithms
Data Migration • BusLocator • C# Microsoft Solution in two parts • Windows Service using Timer-event concurrency • Pulls AVL data every 30 minutes • Pulls route data every 5 minutes • Sends via SOAP to WCF service • WCF • Webservice endpoint • Accepts data • Parses and stores in SQL tables
Data Migration • Graphical Depiction
Data Migration • Major bottlenecks • Event timer • Problems • Pulls too slow to deliver real-time produced data to be consumed in real-time • Pulls over timeframe, sends duplicate over the wire • Does not scale or load balance • SOAP XML message is large, metadata heavy • Not optimal for real-time
Data Migration • Effort to duplicate for ETA • Pull from ETA feed as Rest service via XML
Data Migration • Effort to duplicate for ETA • Purposes • Analytical use of AVL data as static resource, not real-time • Made easier to organize by set-theory model • Able to composite ETA from other sources • Able to automate analysis
Data Migration • Effort to duplicate for ETA • Problems • AVL not complete for historical use • Lead to development of clear definition of AVL and other transit data structures • Showed need for new system • Replace BusLocator • Define development framework for transit applications • Eliminate pull or approximate push design pattern
Data Migration • Pull vs. Push • Pull design pattern • A.k.a. Request-response, on-demand • Client (unknown) sends request to Server/Source (known) • Server processes and responds • Push design pattern • Subscription pattern • Client establishes connection to Server • Server pushes response to client upon local event
Data Migration • Pull vs. Push • Pull design pattern • A.k.a. Request-response, on-demand • Client (unknown) sends request to Server/Source (known) • Server processes and responds • Push design pattern • Subscription pattern • Client establishes connection to Server • Server pushes response to client upon local event
Data Migration • Pull best use cases • When data is not consumed as a string • Need the most recent data once or on demand • Example
Data Migration • Push approximating • Push is appropriate for real-time produced data • Goal • minimize time between production and availability for use • Problem • Push not supported by all web communication • Solution • Pull approximation
Data Migration • Appx. 1 – timer event approximation • Goal • Predict the rate of production using historical data • Method • Exponential Moving Average • Use previous history and predictions to make future predictions • Keep tabs of average interval between data updates • Take proportion of history for accuracy • Take proportion of predictions for smothing
Data Migration • Exponential Moving Average example • Real data hard to monitor, simulation was created • Simulate 10 vehicles • 10% chance of packet drop • Measurement criteria • Minimize difference between production time and consumption time • Minimize redundant data packets • Minimize dropped packets
Data Migration • Exponential Moving Average example • Cache free model was developed • Emulating current system • Adaptable to batch query and changing vehicle configuration • Measure average previous interval
Data Migration • Exponential Moving Average example • Psuedocode
Data Migration • Exponential Moving Average example • Results
Implementation: GLaaS Model and API • Goals • Taking the knowledge gained so far, implement and document a framework that exhibits best practices • Avoid anti-patterns • Choose the best medium for the job • Separate data, metadata, and implementation data • Keep business logic separate from data management • Migrate data near production rate • Multimodal retrieval and consumption mechanisms
Implementation: GLaaS Model and API • Considerations • Security • Closed Pipe vs. Open Pipe • Authentication • Access level • Differential Privacy • Analysis protection • Reusability • Maintenance • Scalability • Documentation and Training
GLaaS Model • Database Schema • Feature oriented • Consider transit data primitives as features • Make set defined elements required fields • Make metadata Optional fields • Design iterations • Trigger based trickle down model • Purpose • Fight over-index anti-pattern • Minimize select time purposefully • Output chain, batch-oriented
GLaaS Model • Structure • Tables • Primary • Insert Entry point • Guaranteed for analysis use • Acts as contract and definition of feature • Trigger • On insert, pushes and updates specific tables • Specific • Select / update point • Only accessible by stored procedure • Info • Metadata chainable by indexed fields
GLaaS Model • Refactoring • Triggers did not work the way intended • Appearance • Separate files, separate queries • Resemble event handling • Simple and Concurrent in imperative languages • Function • Append to insert query • Not concurrent • Artificial dependency • Traced • One failure invalidates entire insert -- including original
GLaaS Model • Output variable • Represents inserted data similar to trigger • Called from and insert into primary stored procedures • Calls down the chain, separated by query delimiter • Enforces statically declared batching • Concurrent, let SQL environment make dependency decisions • Responsible for populating specific tables
GLaaS Model • Results, integrity and protocol
GLaaS Model • Explicit use of API and Stored Procedures • No direct application level queries • API only approved access point • Explicit enforcement of authentication by function not by data type • Eliminates need for application specific tables • Fights Sql injection
GLaaS API • Multimodal approach to consumption • Mechanism for static, on-demand, and real-time consumption • File system and known URI • Similar to GTFS-realtime implementation • Application specific feed format • Request-Response • REST in several mediums • Binds to specific URI and HTTP Verb • Eliminates need for expensive header • SOAP backwards compatibility • Subscription model via push pattern • Websocket
GLaaS API • Soap vs Rest • Soap • XML defined package • URIs surrogate for Endpoints • 1 URI per service • Message header contains definitions and method bindings • RPC • Message data contains payload
GLaaS API • Soap vs Rest • Soap definition example for AVL
GLaaS API • Soap vs Rest • Rest • URI multiplexing via routes • URI structure relative to root bound to request definition • Request object definition and HTTP verb binds to method and response • Request messages • Only contain data needed for functionality • No header, light-weight • JSON, XML, URI-embedded, any custom data organization
GLaaS API • Soap vs Rest • Rest
GLaaS API • Goals • Maintenance • Dynamically generated use documentation • Compartmentalized object definition • Requests • Response • Global Entry Point • Configuration • Application level authentication • Service Definition
GLaaS API • Goals • Extensibility • Add data functionality to feature • Add specific tables • Add metadata specific data columns • Add application level functionality • Add request, response DTOs • Add service method bindings • Replication • Feature encapsulates protocol defined parts • Replicate abstraction model and appropriate retrieval mechanisms for new feature