a starting point for:

a starting point for: “Using simulation in parallel computing for faster sample size calculations in complex random effects models” Toni Price, University of Bristol

MLPowSim • Developed in a separate ESRC-funded project • Generates both MLwiN macro code and R language code for performing sample size calculations on multilevel models • Works for a selection of multilevel nested and crossed designs • Text-based interface • Uses C code to gather user input and generate output

Initial objective: Use MLPowSim as a basis and extend to support a broader range of models • Good starting point, but would benefit from an automated way of testing that generated code matches expected output (especially as new and more complex models are added)

First step Put into a cohesive framework: • Streamline duplicated code (e.g. for user input which is similar across different models) • Also improves code maintenance (e.g. bug fixes impacting fewer lines of code) • Improve input validation • Makes for a better user experience and reduces crashes • Automate testing of generated code and results • Add multiple user interfaces, e.g. command line / file input / web-based

Ruby is … • Much like Python in a number of ways • Cross-platform • A good choice for metaprogramming • Excellent for text processing … though in the end boils down to personal preference

… moving to Ruby In the words of the official Ruby site (http://www.ruby-lang.org/en/) Ruby is “A dynamic, open source programming language with a focus on simplicity and productivity. It has an elegant syntax that is natural to read and easy to write.” (… I agree!)

Input methods • Command line • Current input method • File input • Useful during development • Facilitates automated testing • Web interface • Familiar mode of input • ‘Easy’ to use

# Input params # # Example 1 (p. 8 in MLPowSim user manual) # MLwiN code output general: output_lang: mlwin rnd_num_seed: 1 sig_level: 0.025 n_sims: 1000 model: n_levels: 1 response_type: normal est_method: igls include_fixed_intercept: yes n_explanatory_vars: 0 estimates: beta_0: -0.140 sigma_sq_e: 1.051 sample_size: level_1: low: 20 hi: 600 step: 20 File input – Example for a 1-level model

# Input params # # Example 8 (p. 39 in MLPowSim user manual) # MLwiN code output general: output_lang: mlwin rnd_num_seed: 1 sig_level: 0.025 n_sims: 1000 model: n_levels: 2 is_balanced: yes structure: nested #=> nested | cross-classified response_type: normal est_method: igls include_fixed_intercept: yes include_random_intercept: yes n_explanatory_vars: 0 estimates: beta_0: -0.177 sigma_sq_u: 0.151 sigma_sq_e: 0.916 sample_size: level_2: low: 10 hi: 50 step: 10 level_1: low: 10 hi: 60 step: 10 File input – Example for a 2-level model

Advantages of adding a Web interface • More accessible • No download required • Indexed by search engines • Cross-platform (Windows/Mac/Linux) • Up-to-date version available as soon as deployed • Centralised bug fixes • New features • No distribution overhead • Opportunity to collect usage information • E.g. model parameters … aligned with e-Stat objectives

Disadvantages of Web interface • “Constrained” by browser functionality • Need to be online to use it • Needs hosting resources … fine for code-generation app as it stands, but would be too resource-intensive to run simulations and model-fitting on server

[Demo of command-line and Web-based interfaces for MLPowSim]

Improving speed • Another, parallel (so to speak ☺) objective is using parallelization to speed up run-time for generated power calculation code • Have taken an initial look at using capabilities of multi-core processors by executing more than one run simultaneously • Exploratory code makes use of Unix (Linux) ‘forking’ to create sub-processes • This approach will not work on Windows (since Windows does not support forks) • Precludes possibility of using this approach for MLwiN

Improving speed … contd. • For now, doing tests on R code in Linux Initial results (very rough, just a starting point): • Model: 1-Level, Normal response, Fixed intercept, No explanatory variables • R code with sample sizes from 400 to 600 in steps of 100 (i.e. 400, 500, 600)

Improving speed … contd.

Improving speed … contd. Summary

Where to from here? … this is just a small start … • Extend MLPowSim to support more models • Add test cases for code generation to cope with more models • Add automated tests for verifying actual numerical output • Further develop Web interface • Continue investigating speed improvements through parallelization

a starting point for: