First Data Investigation on the Grid: FirstDIG Terry Sloan , Paul Graham, Adam Carter

First Data Investigation on the Grid: FirstDIG Terry Sloan, Paul Graham, Adam Carter Edinburgh Parallel Computing Centre (EPCC) Telephone: +44 131 650 5155 Email: t.sloan@epcc.ed.ac.uk p.graham@epcc.ed.ac.uk a.carter@epcc.ed.ac.uk

Overview • The Project • Motivation • Methodology • Data Sources, Cleaning, Analysis • OGSA-DAI • Future Work

The Project

The Project Two aims: • Demonstrate deployment of OGSA-DAI within the First South Yorkshire bus operational environment and learn from it • Short data analysis using OGSA-DAI service enabled data sources to answer business questions posed by First South Yorkshire

The Project (cont) • Partners • First plc represented by First South Yorkshire • National e-Science Centre represented by EPCC • Timescales • 9 months • Start May 2003 • End JanDec 2004 • Nov 2003 = Project Month 7 (PM7)

Motivation

Motivation • First plc • Few UK e-Science projects involve service companies such as First plc • Operate worldwide in variety of transport sectors • Over 10000 vehicles in the UK, 23% of the market • UK’s largest operator • Challenge is meeting the needs of the travelling public whilst making money • Data Mining may assist but huge range of fragmented data sources • OGSA-DAI : Data Access and Integration • Potentially provides a solution • Need business users to make transition from science to commerce

Methodology

Methodology • Business questions • Data sources • Data cleaning/analysis • OGSA-DAI service-enabled data sources • Replicate data cleaning/data analysis • Feedback on OGSA-DAI suitability and areas for improvement.

Data Sources, Cleaning & Analysis

Data Sources in the Bus Industry • Many different kinds of data involved with running a bus company • Mileage, revenue, customer contact, schedule, fuel consumption, vehicle maintenance, routes… • Many means to collect data • Manually entered data at depot • Data collected on buses from ticket machines • Data collected on buses from GPS systems • GPS system notes when bus passes through a predefined “footprint” and records the time at which this happens

Disparate Databases • Data is typically stored in disparate databases • Various reasons for this: Incremental construction of systems. • Not a problem for day-to-day running and querying but… • Introduces challenges for Data Analysis • Systems introduced at different times • Different database engines • Different front-ends • Different operating systems • Different physical locations • Different ways of representing data

An Example Process CLEAN AGGREGATE RE-FORMAT JOIN ANALYSE RE-FORMAT CLEAN AGGREGATE

Cleaning and Reformatting • One Bus, Many Names • e.g. Service 25A might be “025A”, “25A”, “25a” • Sometimes referring to individual depots, and sometimes to operating regions which may include various depots. • Furthermore, if data is stored separately for each depot, data might not explicitly include a reference to a depot – this has to be added when the data is aggregated • Pre-processing can often be done with SQL after some initial analysis • e.g. Create tables with entries corresponding to the depot and columns containing data on how this depot is labelled in the different databases.

Cleaning and Reformatting 2 • Pre-processing with SQL (continued) • Harder for example of service names: Need larger table. Requires effort, but need only be created once. • Alternative: • Read data from database • Process data with other tools (Perl, SPSS, …) • Load results to new table in database

Aggregation • Data can be aggregated in various ways • e.g. By Service, By Day • SQL can do much of the simple aggregation: SELECT Service, Region, SUM(Revenue) AS TotalRevenueFROM RevenueTableGROUP BY Service, Region • In practice SQL can be somewhat more complicated

Aggregation 2 • As before, an alternative is: • Read data from database • Aggregate with external program (SPSS, Perl, even Excel) • Load data back into database • Whether or not this is worth doing depends on • Availability of Aggregation Functions in database engine • Extent of processing required: If a database is stored on a small or heavily-used machine, it may be quicker to export, process, and import.

Joins • Can combine data from more than one database: • Complaints versus Lateness • Revenue versus Lost Miles • Complaints versus Lost Miles • Often Joins are on data aggregated in some way: • By Service • By Day • Subsets of the data can also be considered • e.g. no weekends

Hurdles: Non-Standard SQL • Non-Standard SQL introduces some hurdles for transparent integration of data • Date Formats: • No standard data format: DD/MM/YYYY or MM/DD/YY • No standard date handling functions • Compare MS Access and mySQL: SELECT * FROM AccessTable WHERE IncidentDate BETWEEN #11/30/2000# AND #11/30/2002# SELECT * FROM MySQLTable WHERE IncidentDate BETWEEN '2000-11-30' AND '20021130'

Hurdles: How representative is Data? • Data available for mining can influence results • Representative data required for meaningful results • Since data is not collected for the purposes of data mining, it may be incomplete • For example, data might only be collected to analyse a perceived problem with a particular route

OGSA-DAI OGSA-DAI Required Datamining Tools • SQL can be used for basic data analysis but OGSA-DAI doesn’t replace datamining tools • More complicated data analysis requires external tools: e.g. C5, Perl, SPSS, Excel • OGSA-DAI’s use here is to extract data required for analysis and deliver it to the system on which analysis is to be performed in a useful format GRID Machine performing analysis

OGSA-DAI

The problem • Access to databases at First • The databases: • Are located at different sites • Are hosted on different operating systems • Are not all available via the internal network • Have different DBMS • Require ability to analyse their contents in a uniform manner and include cross-database analysis

The solution • OGSA-DAI • Open Grid Services Architecture Data Access and Integration • DAIS-WG at GGF • Grid middleware: • Assists with the access and integration of data from separate data sources via the Grid • Represents databases as Grid Services • Enables access from other machines in a secure manner • OGSA-DAI Partners • Funded under UK e-Science Core program • Universities of Edinburgh, Manchester and Newcastle • IBM and Oracle • http://www.ogsadai.org.uk

OGSA Data Access and Integration • Based on Grid Services concept • Stateful web services with an associated lifetime • Has a set of behaviours, and conforms to a set of interfaces through which a client may interact • Three main Grid Services: • DAI Service Group Registry (DAISGR) • Holds a list of … • Grid Data Service Factory (GDSF) • Associated with a single database • Grid Data Service (GDS) • A “session” with a database

OGSA-DAI typical interaction 1/3

First and OGSA-DAI • Our remit: • To evaluate the suitability of the use of OGSA-DAI in a commercial environment • Need to find out if OGSA-DAI: • Is appropriate • Is secure • Is straightforward to deploy and use • Does what we need! • Feedback from project goes straight to OGSA-DAI team

Progress • Have a test deployment running at EPCC • Using two of the databases identified in the data analysis WP • The Customer Contact System • Microsoft Access • Information on customer complaints e.g. time, service, nature • The Mileage database • dBASE IV • Information on bus mileage e.g. lost miles

Issue • OGSA-DAI currently does not officially support Access or dBASE IV ! • However, does support JDBC-accessible databases • Solution • Use the Microsoft provided ODBC driver • Use the Sun provided JDBC-ODBC bridge

Set up • Using three machines within our firewall • One to host the CCS database • One to host the Mileage database • One to act as the client

Limitations • Data type support • The BIT data type (Yes/No fields) • The Date format • “Out of range” character codes • Limitation of XML • Firewalls • General Grid computing discussion • Usability • Use of XML can be confusing

Future Work

Future work • Deploy at First • And test within their network • A client tool • To improve usability • Additional databases and DBMS • First have other databases under different DBMS they want to integrate • Single DAI Service Group Registry • These databases should be registered centrally • More complex interactions • Joins across databases …

First Data Investigation on the Grid: FirstDIG Terry Sloan , Paul Graham, Adam Carter