460 likes | 670 Views
A Case Study of Bayesian Modeling on a Real World Problem. RAM Energy Energester/Enziro. Bob Mattheys, Malcolm Farrow, Giles Oatley, Garen Arevian, Souvik Banerjee. ISS – Intelligent Systems Solutions. Group of researchers/academics Working with CAS (Centre for Adaptive Systems) Remit:
E N D
A Case Study of Bayesian Modeling on a Real World Problem RAM Energy Energester/Enziro Bob Mattheys, Malcolm Farrow, Giles Oatley, Garen Arevian, Souvik Banerjee
ISS – Intelligent Systems Solutions • Group of researchers/academics • Working with CAS (Centre for Adaptive Systems) • Remit: • Provide Technology Transfer and Expertise to Industry • Assist NE SME’s and stimulate business growth • Obtain funding, e.g. SMART Awards, GONE, etc.
ISS Projects • RAM Energy – Intelligent Data Analysis • Neptune Engineering – Intelligent Diagnostics • HASS – Back-office system/DBase • Hart Biological – Back-office system/Dbase, process manufacturing • Etc.
RAM Energy • Founded 2000 • Clients in Oil/Gas, Energy, Process, Manufacturing, Haulage Industry • Products Energester +Enziro • Ester based synthetic lubricants and greases, enzymatic cleaning solutions, absorbents and blasting media • Better lubrication, heat dissipation and vibration reduction than oil or grease in isolation and conventional additives
RAM Energy • Problem • Demonstrate effectiveness and cost efficiency • Data collected by RAM Energy • very large • major differences across the various sectors • Assist RAM Energy in structuring their data collection and storage in general • Heavy haulage industry
RAM Energy • Trials • RAM energy carried out select trials with clients. These included: • Monitored consumption prior to Energester use • Monitored consumption post Energester use • Use of control vehicles (no Energester use) • Temperature data collected
RAM Energy Haulage • Data collected via diesel receipts • Information consisted of • Card number (allocated to regn number) • Vehicle registration • Date • Fuel • Mileage
RAM Energy • Analysis • Performed using Excel spreadsheets • Discrete mpg (mileage since last fill/diesel input) • Some cumulative mpg using total mileage/total diesel input to date) • Attempt to normalise using mean temperature records • Some regression analysis
RAM Energy Problems • Missing data consisted of • Driver information (who?) • Loading information (full/empty) • Length of journey • Type of journey (long haul vs short haul) • Urban or motorway conditions • Etc.
RAM Energy Conclusion • Results very poor and inconclusive
Database • Excel sheets were converted to an Access database with deletion of unnecessary rows and columns. • The Access database was then imported into SQL Server for data query and subsequent analysis
Data Cleansing • Brief outline of most obvious problems with the data • 1. Card Number • 2. Registration Number • 3. Date • 4. Fuel Added • 5. Mileage
Card Number • There were duplicate Card Numbers for (presumably) the same Card, e.g. • 85944 and 0085944 • In a few cases, for a given Registration Number, there appear additional Card Numbers, e.g. for ‘N151EUB’ there are the Card Numbers: • 38195 0038195 56408
Registration Number • Registration numbers seemed to be always entered correctly • However, the field Reg Entered did not always tally with this • RAM recommendation to ignore
Date • Dates entered very consistent • preserved the ordering • distance between dates • the actual date • An important question was: CAN WE PRESUME THE DATE IS ALWAYS ENTERED CORRECTLY ? • If this was so, then this provided us with a convenient check on the Mileage, as Date and Mileage should both increase together.
Fuel • Outlier identification • Very small and very large values easily detected over large dataset • Take mean of the sample and flag as outliers data more than 3 or 4 SD’s away from the mean • Very small values e.g. 0 or 1 assumed as bogus values • 9999, 999, etc. taken to be bogus values • Some small and large values mistyped, with either the decimal place occurring too soon (e.g. 38.6 instead of 386) or extra digits added (e.g. 3860 instead of 386)
Fuel • Difficult errors • e.g. 693392.. could be 69392 ? What if 693399 ? • Data must be flagged as erroneous
Mileage • Some values were entered as {0,1,999,9999,2,3,5,10,111,1111,123,789, etc} • If we can presume that the Date is a sensible value, then in a dataset where there are only a few missing or obviously incorrect values for the Mileage, these values can be amended as follows
Mileage We do not know if the day 13 entry is wrong, or day 14. So we can look ahead:
Mileage Or
Mileage Collapsed to:
Mileage • Small and very large values could be ignored • Problem was determining whether any of the remaining data was valid – data validation • Evaluating the degree of correlation between the increasing Date, and the supposed increasing Mileage • Useful approaches for estimating rank-orderedness and correlation between lists • Spearman’s coefficient of rank correlation • Kendall’s Tau
Bayesian - Approach • In Bayesian approach to statistical inference, express uncertain beliefs about things in terms of probability • E.g. that there is a 50% chance that the average fuel consumption of a vehicle will be less than 30mpg • Can use probabilities in this way to describe uncertainty about things we do not know • E.g. amount of fuel in a vehicle’s tank at 10.00am yesterday
Bayesian - Approach • Once we accept this view of probability, the principle for learning from data is simple • Before we see the data, we have a probability distribution based on our knowledge up to that point • prior distribution • When we see the data our probability distribution changes, in the light of new information in the data • posterior distribution.
Bayesian - Approach • Calculation used to get from the prior distribution to the posterior distribution • Uses Bayes’ theorem • Hence Bayesian statistics • Very straightforward interpretation of the results when using this method • Posterior distribution tells us how likely it is that various things are true, after we have used the evidence in the data
Bayesian - Approach • Different observers can have different prior beliefs and this means that their posterior distributions will also be different • make prior distribution represent very little information • in practice prior tends to have little effect on posterior • One advantage of this approach is that it is straightforward to calculate what we expect various things to be after seeing the data • For example, can calculate a posterior probability distribution for the cost savings of applying the fuel additive to a whole vehicle fleet
Bayesian - Model • The basic model used is a regression, with fuel used as the dependent variable and distance travelled as one of the explanatory variables • Each observation corresponds to the time between two successive additions of fuel to the fuel tank • Expect zero fuel to be used if zero distance were travelled, amount of fuel used is not necessarily proportional to the distance travelled • For example, fuel efficiency may be greater on longer journeys
Bayesian - Model • Simplest form of the model, assume that fuel used is proportional to distance travelled • Constant of proportionality which is the slope of the line on a graph • Various other forms of relationship were also investigated. • While distance travelled is most obvious explanatory variable, there are several other variables and factors which must be taken into account
Bayesian - Factors • Vehicle Types • Type of vehicle has effect • Individual vehicles of same type may also have different characteristics • Effect of individual vehicles (within a type) was regarded as a random effect • Vehicles seen as a sample from all vehicles of that type
Bayesian - Factors • Drivers • Driver identified by card number • Drivers closely associated with vehicles • In this case, difficult to separate effects of vehicles from the effects of drivers • However, if this were not the case, then it would be possible to make inferences about individual drivers as well as individual vehicles
Bayesian - Factors • Time of year • Fuel efficiency may be affected by ambient temperature/meteorological variables • Ideally use meteorological data • Obtained data for this purpose • But, as a first step, a simple substitute is to use the time of year, e.g. month
Bayesian - Factors • Presence of fuel additive • The main question of interest is, “How does the use of the fuel additive affect fuel consumption?
Bayesian - Complications • Fuel • How full the fuel tank was before or after fuel was added • Precisely how much fuel was used between fills • True tank content regarded as a latent or “hidden” variable • Such variables can be built into a Bayesian analysis
Bayesian - Complications • Data entry errors • Graph of odometer readings against date for a single vehicle shows the general pattern - spurious values • This built into the model by allowing certain prior probabilities for errors of different types • The analysis can thus “recognise” errors by calculating posterior probabilities that a reading is an error of the various types • Those values which have large posterior probabilities of being erroneous are, in effect, ignored by the rest of the analysis.
Bayesian - Conclusions • Prototype Bayesian models were successfully run • Demonstrated feasibility of approach for this problem • However: • Need to overcome problems of missing data • Uncertainty over when additive would be expected to have an effect • Pattern of this effect • Confounding of additive effect with the effects of other factors such as the changing seasons
Bayesian Results Posterior probability density for the effect of the additive, in litres per mile
Conclusions • Recommendations: • Design of better trials and data acquisition • Collection of ambient temperatures, etc. • Future Directions • Fraud detection • Efficiency of individual drivers/vehicles • Patterns of work, optimisation