A Case Study of Bayesian Modeling on a Real World Problem

A Case Study of Bayesian Modeling on a Real World Problem RAM Energy Energester/Enziro Bob Mattheys, Malcolm Farrow, Giles Oatley, Garen Arevian, Souvik Banerjee

ISS – Intelligent Systems Solutions • Group of researchers/academics • Working with CAS (Centre for Adaptive Systems) • Remit: • Provide Technology Transfer and Expertise to Industry • Assist NE SME’s and stimulate business growth • Obtain funding, e.g. SMART Awards, GONE, etc.

ISS Projects • RAM Energy – Intelligent Data Analysis • Neptune Engineering – Intelligent Diagnostics • HASS – Back-office system/DBase • Hart Biological – Back-office system/Dbase, process manufacturing • Etc.

RAM Energy • Founded 2000 • Clients in Oil/Gas, Energy, Process, Manufacturing, Haulage Industry • Products Energester +Enziro • Ester based synthetic lubricants and greases, enzymatic cleaning solutions, absorbents and blasting media • Better lubrication, heat dissipation and vibration reduction than oil or grease in isolation and conventional additives

RAM Energy • Problem • Demonstrate effectiveness and cost efficiency • Data collected by RAM Energy • very large • major differences across the various sectors • Assist RAM Energy in structuring their data collection and storage in general • Heavy haulage industry

RAM Energy • Trials • RAM energy carried out select trials with clients. These included: • Monitored consumption prior to Energester use • Monitored consumption post Energester use • Use of control vehicles (no Energester use) • Temperature data collected

RAM Energy Haulage • Data collected via diesel receipts • Information consisted of • Card number (allocated to regn number) • Vehicle registration • Date • Fuel • Mileage

RAM Energy • Analysis • Performed using Excel spreadsheets • Discrete mpg (mileage since last fill/diesel input) • Some cumulative mpg using total mileage/total diesel input to date) • Attempt to normalise using mean temperature records • Some regression analysis

RAM Energy Results

RAM Energy Problems • Missing data consisted of • Driver information (who?) • Loading information (full/empty) • Length of journey • Type of journey (long haul vs short haul) • Urban or motorway conditions • Etc.

RAM Energy Conclusion • Results very poor and inconclusive

Database • Excel sheets were converted to an Access database with deletion of unnecessary rows and columns. • The Access database was then imported into SQL Server for data query and subsequent analysis

Data Cleansing • Brief outline of most obvious problems with the data • 1. Card Number • 2. Registration Number • 3. Date • 4. Fuel Added • 5. Mileage

Card Number • There were duplicate Card Numbers for (presumably) the same Card, e.g. • 85944 and 0085944 • In a few cases, for a given Registration Number, there appear additional Card Numbers, e.g. for ‘N151EUB’ there are the Card Numbers: • 38195 0038195 56408

Registration Number • Registration numbers seemed to be always entered correctly • However, the field Reg Entered did not always tally with this • RAM recommendation to ignore

Date • Dates entered very consistent • preserved the ordering • distance between dates • the actual date • An important question was: CAN WE PRESUME THE DATE IS ALWAYS ENTERED CORRECTLY ? • If this was so, then this provided us with a convenient check on the Mileage, as Date and Mileage should both increase together.

Fuel • Outlier identification • Very small and very large values easily detected over large dataset • Take mean of the sample and flag as outliers data more than 3 or 4 SD’s away from the mean • Very small values e.g. 0 or 1 assumed as bogus values • 9999, 999, etc. taken to be bogus values • Some small and large values mistyped, with either the decimal place occurring too soon (e.g. 38.6 instead of 386) or extra digits added (e.g. 3860 instead of 386)

Fuel • Difficult errors • e.g. 693392.. could be 69392 ? What if 693399 ? • Data must be flagged as erroneous

Mileage • Some values were entered as {0,1,999,9999,2,3,5,10,111,1111,123,789, etc} • If we can presume that the Date is a sensible value, then in a dataset where there are only a few missing or obviously incorrect values for the Mileage, these values can be amended as follows

Mileage We do not know if the day 13 entry is wrong, or day 14. So we can look ahead:

Mileage Or

Mileage Collapsed to:

Mileage • Small and very large values could be ignored • Problem was determining whether any of the remaining data was valid – data validation • Evaluating the degree of correlation between the increasing Date, and the supposed increasing Mileage • Useful approaches for estimating rank-orderedness and correlation between lists • Spearman’s coefficient of rank correlation • Kendall’s Tau

Data Cleansing

Ram Energy Data Validator

Bayesian - Approach • In Bayesian approach to statistical inference, express uncertain beliefs about things in terms of probability • E.g. that there is a 50% chance that the average fuel consumption of a vehicle will be less than 30mpg • Can use probabilities in this way to describe uncertainty about things we do not know • E.g. amount of fuel in a vehicle’s tank at 10.00am yesterday

Bayesian - Approach • Once we accept this view of probability, the principle for learning from data is simple • Before we see the data, we have a probability distribution based on our knowledge up to that point • prior distribution • When we see the data our probability distribution changes, in the light of new information in the data • posterior distribution.

Bayesian - Approach • Calculation used to get from the prior distribution to the posterior distribution • Uses Bayes’ theorem • Hence Bayesian statistics • Very straightforward interpretation of the results when using this method • Posterior distribution tells us how likely it is that various things are true, after we have used the evidence in the data

Bayesian - Approach • Different observers can have different prior beliefs and this means that their posterior distributions will also be different • make prior distribution represent very little information • in practice prior tends to have little effect on posterior • One advantage of this approach is that it is straightforward to calculate what we expect various things to be after seeing the data • For example, can calculate a posterior probability distribution for the cost savings of applying the fuel additive to a whole vehicle fleet

Bayesian - Model • The basic model used is a regression, with fuel used as the dependent variable and distance travelled as one of the explanatory variables • Each observation corresponds to the time between two successive additions of fuel to the fuel tank • Expect zero fuel to be used if zero distance were travelled, amount of fuel used is not necessarily proportional to the distance travelled • For example, fuel efficiency may be greater on longer journeys

Bayesian - Model • Simplest form of the model, assume that fuel used is proportional to distance travelled • Constant of proportionality which is the slope of the line on a graph • Various other forms of relationship were also investigated. • While distance travelled is most obvious explanatory variable, there are several other variables and factors which must be taken into account

Bayesian - Factors • Vehicle Types • Type of vehicle has effect • Individual vehicles of same type may also have different characteristics • Effect of individual vehicles (within a type) was regarded as a random effect • Vehicles seen as a sample from all vehicles of that type

Bayesian - Factors • Drivers • Driver identified by card number • Drivers closely associated with vehicles • In this case, difficult to separate effects of vehicles from the effects of drivers • However, if this were not the case, then it would be possible to make inferences about individual drivers as well as individual vehicles

Bayesian - Factors • Time of year • Fuel efficiency may be affected by ambient temperature/meteorological variables • Ideally use meteorological data • Obtained data for this purpose • But, as a first step, a simple substitute is to use the time of year, e.g. month

Bayesian - Factors • Presence of fuel additive • The main question of interest is, “How does the use of the fuel additive affect fuel consumption?

Bayesian - Complications • Fuel • How full the fuel tank was before or after fuel was added • Precisely how much fuel was used between fills • True tank content regarded as a latent or “hidden” variable • Such variables can be built into a Bayesian analysis

Bayesian - Complications • Data entry errors • Graph of odometer readings against date for a single vehicle shows the general pattern - spurious values • This built into the model by allowing certain prior probabilities for errors of different types • The analysis can thus “recognise” errors by calculating posterior probabilities that a reading is an error of the various types • Those values which have large posterior probabilities of being erroneous are, in effect, ignored by the rest of the analysis.

Bayesian - Conclusions • Prototype Bayesian models were successfully run • Demonstrated feasibility of approach for this problem • However: • Need to overcome problems of missing data • Uncertainty over when additive would be expected to have an effect • Pattern of this effect • Confounding of additive effect with the effects of other factors such as the changing seasons

Bayesian Results Posterior probability density for the effect of the additive, in litres per mile

Conclusions • Recommendations: • Design of better trials and data acquisition • Collection of ambient temperatures, etc. • Future Directions • Fraud detection • Efficiency of individual drivers/vehicles • Patterns of work, optimisation

A Case Study of Bayesian Modeling on a Real World Problem

A Case Study of Bayesian Modeling on a Real World Problem

Presentation Transcript

Real World Case Study

Social Knowledge Dynamic s : A Case Study on Modeling Wikipedia

Problem Based Learning: A Case Study

ASP.NET for a Real-World Problem

World View: A Case Study

340 B Program and a Real World Case Study

REAL WORLD CASE STUDY

Real Data, Real World, Real Stories : A case study approach to demonstrating impact on the student experience

Hard and soft modeling. A case study

340 B Program and a Real World Case Study

A Case Study on Pride

Deploying mono in the real world: a case study

A Case Study

A Case Study in Regional Inverse Modeling

A case study

A CASE STUDY

Bayesian AVO Inversion and Application to a Case Study

A Case Study on Blackout

A Study on Linear Programming Problem

A Case study on Zipcar