400 likes | 539 Views
Battling entropy: the development of the MLwiN statistical modelling package:the confessions of a well intentioned hacker. Jon Rasbash Centre for Multilevel Modelling University of Bristol. The way it is. Here is Edward Bear, coming downstairs now, bump, bump, bump, on the back of his head.
E N D
Battling entropy: the development of the MLwiN statistical modelling package:the confessions of a well intentioned hacker.Jon Rasbash Centre for Multilevel ModellingUniversity of Bristol
The way it is. Here is Edward Bear, coming downstairs now, bump, bump, bump, on the back of his head. It is, as far as he knows, the only way of coming downstairs, but sometimes he feels that there really is another way, if he could stop bumping for a moment and think of it. And then he feels that perhaps there isn’t.
Another relevant opening paragraph A doctor, a civil engineer and a computer scientist were arguing about what was the oldest profession in the world. The Doctor said “well in the Bible it says that God created Eve from a rib taken from Adam, clearly this required surgery so my profession must be the oldest in the world.” The civil engineer interrupted – “but earlier in the book of Genesis it says that God created the order of the heavens and the earth out of chaos. That was certainly a most spectacular feet of civil engineering. So Doctor my profession is older” The Computer Scientist smiled confidently – “who do you think created the chaos” Grady Booch – Object Orientated Analysis and Design.
Origins of MLwiN Mike Healy’s Nanostat(~1981) a Minitab clone written in RATFOR. • B. W. Kernighan, RATFOR -- A Rational Fortran, Workshop on Fortran Preprocessors Pasadena Calif., pp. 3, November 1974. Mike wanted to do something on his Osborne Portable Computer – so he wrote NANOSTAT
NANOSTAT architecture Like MINITAB data represented as a set of columns Command verbs taking columns, numbers and boxes as arguments Commands can be strung together outputs from 1 command acting as inputs to another A simple architecture involving a command parser, functions to create columns and a series of a hundred or so commands that take inputs create outputs with no side-effects.
ML2,3,N DOS programs We added capabilities to fit a two level multilevel model in 1988 and called the program ML2. ML3 was released in 1990 and the source code was translated to C. MLN was released in 1995 and the new N-level algorithm was written in C++.
MLN and C++ The N-level computational algorithm(never published) is a set of C++ classes for handling problem specific highly patterned matrices. To illustrate consider the model: One computationally intensive step in the IGLS algorithm is to estimatethe variances and covariance of the random effects. Lets look at what that involves-from a computing perspective
Estimating Equation for But given block diagonality of V*-1 this simplifies to Which greatly reduces computational load Storage nj4and flop counts are proportional to nj3 ifnj=100 RAM requirements > 100MB in the early 1990s this was not possible on PC’s so
Exploiting patterns All the large matrices were highly structured and could be represented in terms of complex expressions using smaller building block matrices. Doing this reduces computation Storage from nj4tonj p and flop counts from nj3 to nj p2
Creating the C++ matrix class hierarchy In designing this class hierarchy I wanted to be able to take expressions such as and program them directly as theta=inv(~Zstar*inv(Vstar)*Zstar)*(~zstar*inv(V))*ystar However, we are working here in terms of the big matrices which directly reflect statistical logic but are hopelessly inefficient computationally. Each big matrix is representend internally as a patterned set of smaller rectangular and symmetric matrices. The statistical logic can then be expressed at an abstract level but the details of storage and computation handled efficiently by subclasses.
Success? Code was fast and efficient and been pumping away for over a decade. But did C++ and OOD help? Not sure. C++ syntax, compiler error messages and garbage collection difficult. For example, get some complex message about why a variable could not be seen, when I thought I had followed, C++/OOD principles and syntax. Then I think oh sod it I’ll just make the variable global – breaks the encapsulation principle. Have not touched the code for at least 5 years and have no intention of extending it. I ignored advice “dont do a new application and learn C++/OOD at the same time” How well do OOD, which conceptualises problems around a series of communicating objects with taxonomic relationships specified by class-hierarchies work for the highly procedural business of statistical algorithm development? Would have helped to have a mentor with good applied experience of OOP/OOD.
Is there a macho(or perhaps lawyer like) culture lurking in software engineering? COM example?? My early experience contacting computer scientists
MLwiN In 1996 we begun work on a windows version of MLN. Key difference between console based MLN and windows based MLwiN: In MLN you only see something e.g. model setup, graph, prediction, data, multilevel residuals, model constraints, hypothesis tests etc when you ask for them with a command. In MlwiN all these interdependent objects can be displayed simultaneously on screen in different windows and an action changing one can have effects on the objects viewed in all the other windows and the other windows must be re-drawn. We therefore require an architecture that passes messages to windows when their displays have become out of date : the windows can then respond by redrawing themselves as they see fit. Objects responding to messages : OOD paradigm.
MLwiN implementation GUI front end written in VB. Turn command driven console app from EXE to DLL. Simultaneously we had an application into JISC for a parallel and distributed processing version of MLN/MLwiN. Where GUI runs on PC and computation is done on a server or a grid. This required minimising data transfer from GUI to DLL handling the computation. Recording system state and task processing handled by the C++ DLL. The VB front end is a view on the system(collecting input and displaying output)
User interface windows Command interpreter Data structures VB GUI MLwiN architecture to handle simultaneous interdependent displays and buffering of GUI/back end data. Action : what data structures are set out of date by the action Window:what actions effect it register interest in actions request action send commands Action manager (dispatcher) request data notify windows of action copy data Data buffers invalid flags(one per data item) data invalid C++ command driven program
Done with some help Above architectural framework has worked well. A friend, Bruce Cameron was hired as a project consultant, to design the framework. We benefited greatly from the input of an experienced software engineer/system analyst. Bruce’s input probably crucial to MLwiN’s success such as it is. MLwiN 1.0 released in 1998.
The Equations window One of the design features was to allow users to work with statistical equations directly to specify and explore multilevel models This is because expository materials were all based around equations representations and users learning MM had a double whammy of understanding how the equations operationalised the techniques and then translating from that representation to equations running the model and then back translating text based tables of results to the equation representation. This translation placed an unnecessary cognitive load on learners. Many quantitative social scientists were resistant to equations. But the influential quantitative social scientists loved it.
..and view results(after running the model) Equations window An IO device that allows, via direct manipulation, models to be specified and changed and results to be viewed. An IO device embedded in the statistical context. Not an open ended declarative symbolic language processor. ML regression model with random intercepts already specified by pointing and clicking. To extend to random slopes.
Programming the Equations window The equations window was a great success – but extremely straight forward to implement. This was because we had the right frameworks: • VB’s GUI programming model • Bruce’s synchronisation architecture.
MCMC The project in 1998 was joined by Bill Browne who implemented MCMC algorithms for Multilevel Models in MlwiN Bill implemented special case, optimised code. It became apparent that MCMC algorithms were easier to extend to a wide range of statistical models than the IGLS and other algorithms we had been working with. Also these algorithms scaled well in terms of computational load. Bill worked with the Centre for Multilevel Models from 1998-2003 much of his work on the program is recorded in: Browne, W.J. (2003). MCMC Estimation in MLwiN (Version 2.0) Institute of Education University of London
Extensibility problems By 1999, although the architecture for the move to windows was reasonably sound, another architectural problem was coming into focus. The software architecture reflecting the representation of statistical models was ten years out of date with new developments being “shoe-horned” into the old architecture. A few key differences over the decade: 1989 1999 Normal Responses Hierarchical population structures IGLS estimation Normal, Poisson, Binomial, Multinomial responses Hierarchical, crossed, multiple membership structures IGLS, bootstrap and MCMC estimation
Time for a major redesign of the software Update architecture to reflect new types of models that we had developed Make new model information structures estimation method independent eg convenient to plug in IGLS, MCMC, quadrature, SIM_ML, bootstrapping, AIP. Current model structures IGLS-centric. A central strand of statistical analysis is the process of working through a series of models and comparing them. Update software architecture to support multiple “live” statistical models. Create an object model of the objects that are the stuff of statistical modelling :data, models, estimates, predictions, graphs, estimation engines etc Design in interoperability with other software (via COM, CORBA)
A big task-could UML help? After reading quite a bit of Grady Booch and other 3-Amigo texts I got excited about using UML and OOD to help us implement the next generation of the MlwiN software. I thought this is a great opportunity to learn OO design and process skills and bring some much needed rigour, clarity and good practice to our software design and development procedures. I set to work… A year later I crumpled into a heap and simply could not continue.
What went wrong? UML helping communication: A key feature claimed for the UML diagrams is that they serve as representation that software developers and application experts(statisticians) can use to communicate reasonably unambiguously. This helps ensure that the developers build the system the application experts want and that the objects in the system(and their inter-relationships) correspond to objects in the application knowledge domain – facilitating extensibility. When I tried to use UML diagrams to talk about statistical structures and processes to statisticians I found they got in the way. This could be due to my in-expert use of the diagrams. They got frustrated and I got defensive.
Lost in the process I got lost in the UML multi-phase, iterative process. Had I spent enough time developing use-cases? Should I now move on to static class diagrams? How detailed should they be at this stage? Have I got the fundamental class design right? Would these interaction diagrams be useful now? And what exactly was this Rational Unified Process anyway? First of all, I thought if I read enough, I would be able to get things clear. Which seemed to work. Until I tried to apply what I had read. Then I thought, well I’ll just plough on anyway and it will become clear through doing. Oh I am still confused better go back and read some more. After a year of this I had failed to produce a single line of code.
Not another bloody ticket sales application. All the UML texts used airline ticket sales or loyalty card schemes as their exemplars – hundreds of pages for a single worked example sometimes. But I found it hard to transpose those exemplars onto using UML to design/implement a statistical modelling system.
A victim of hype Although the UML texts contain statements like “there is no silver bullet”. They are very persuasive, they are selling a methodology and in the case of Rational Rose, software products to go with the methodology. Some stronger health warnings on the packet might have been helpful and also some case studies of where and why UML failed.
Mentor required In hindsight I realised that I needed a mentor to guide me through the process. Mea Culpa : I could have sought out a mentor but I had the feeling that I really better clarify things a bit before I seek help from an expert. A possibly fatal lack of confidence on my part. Friendly, accessible experts required.
Current development strategy for new statistical models We are currently developing MCMC estimation models for Multilevel latent category models(aka growth trajectories) Multilevel mover/stayer models Multilevel factor analysis and structural equation models Multilevel multivariate response models with response of different types defined at different models : useful for simultaneous equation models, multiprocess models, causal models. And as an engine for multiple imputation for missing data. All these models are being developed in MATLAB
MATLAB as a prototyping+environment Relevant MATLAB features Excellent features for matrix programming + thus good for prototyping algorithms. A GUI RAD programming framework (combos, slider, buttons, radio boxes, check boxes, textboxes, menus, list box, button group, panel with all the obvious event hooks defined. If that is not enough a container for any activeX control. Render tex strings into equations. Excellent external interface to other systems : DLL(with extensive examples for C and FORTRAN), COM,DDE and SOAP. The MATLAB compiler will translate a set of .m files to C or C++, compile and link them. This allows easy creation of a royalty free EXE or a DLL
Matlab compiler Call DLL + convert MLwiN data matrix to MATLAB matrix Development process For each new model to be implemented : C DLL interfaced to MLwiN Develop algorithms, model set-up interface, model output display and model diagnostic devices in MATLAB Pass results back to MLwiN structures MLwiN – model appears on MlwiN menu.
Are we using MATLAB as a development engine or a prototyping environment? Code is not as fast as handcrafted C/C++. By about an order of magnitude. Architecture is a little piecemeal, treating each new model type as as a separate entity. Lacks extensibility. What happens if you want to combine model types? However 2 project programmers and two statisticians have a very immediate need to learn MCMC and this provides a good platform for that. As team members develop a better understanding of MCMC we can then think about a more general, extensible architecture.
MCMC learning group We are seeking funding to set up an MCMC learning group which will be for a group of about 10 people associated with team: mathematical statisticians, programmers/software engineers and applied social statisticians. Group will use an online learning environment and work through simple to more complex models using MCMC estimation. Covering MCMC estimation theory for each model Implementation in MATLAB, handcrafted C/C++ routines, BUGS and openBUGS Applications of the model to substantive problems.
Outputs of the learning group A better understanding across the team of the potential of MCMC estimation. Better understanding of computing issues for the specification, estimation and interpretation of statistical models using MCMC. This increased understanding will guide decisions on a future more general architecture. Which could be for MLwiN to become a front end for OpenBUGS. Provides general model specification structures and access to samplers for nodes. Leaving a learning ladder for others to follow.
LEMMA Recently received funding to do some statistical methodology development but mostly capacity building for social scientists. From our workshops we know that many quantitative social science researchers in government and academic departments dont understand the mechanics of a multiple regression equation with interactions between continuous and categorical variables. We are thinking hard about who we can target for progression, where conceptually and socially(e.g. work environment) they are getting stuck and what software tools, training materials and formats they need. The architecture of the learning environment we are developing could be the subject of another whole presentation.
Social Scientists Statisticians Usability Software Engineers Informed Hackers ICT, Learning technology, E-Learning A cross-disciplinary model for development? Software and training materials
Standards for statistical model representation Many tools exist for transferring primary data between proprietary formats and existing standards. However no standards exist for the secondary data of statistical model structure and no tools exist for transferring between proprietary standards for representing model structure. (Some exceptions in data mining). Development of a cross-platform language independent component for storing model specification is highly desirable....
Other GUI components, eg equation window, graphical model ModelGUI interface Data Source ModelData source interface Estimation Engine Standard Model component Generic statistical model representation ModelEE interface
Usual advantages of component based design New EE algorithms can be plugged into the model making comparison of EE much easier – good science Different data sources e.g. EXCEL, SAS etc worksheets can easily be bound to the model. Alternative GUI devices can be plugged into the model for developing model specification and exploration tools. Facilitates collaborative working. Is graphical modelling a good framework to use to build the model representation component?
Reflections Its papers not programs, stupid. Software engineering not credited. ESRC for many years explicitly did not fund it. They had a policy of prototyping only and leaving commercial outfits to exploit and further develop into widely usable systems. Misguided – interaction between software engineers, statisticians and applied researchers crucial. Commercial outfits take too long to respond. We “sneaked in under the radar”. Now changing with rising profile of GRID and E-learning/ICT. Academic environment produces organic rather than structured development. Software engineering can be very valuable but software modelling techniques can be complex and easy to get lost in. Again good cross-disciplinary communication required.