Linking administrative data to RCTs

Linking administrative data to RCTs John Jerrim (UCL Institute of Education)

What do we mean by administrative data? Central government records Typically available for every person in the population Not typically collected for research purposes….. …rather for ‘record keeping’ / registration purposes Examples include Health Education Finance (tax) Criminal records

Example . The National Pupil Database (NPD) One of the most widely used administrative datasets in England! Data for all state school children in England….. …..excludes (or missing a lot of information for) private school kids Test scores at age 5, 7, 11, 14, 16 and 18 Demographic information (e.g. FSM, ethnicity, EAL) Available from around the mid 1990s Now routinely linked to education RCTs in England (via EEF) England is lucky to have this data! Most countries don’t have it!

Example. Cluster (school) level data… ‘Admin’ data doesn’t have to be at individual level…. Can have information on an administrative unit that a person attends… E.g. A school, hospital, police station, prison. Often more easily accessible than pupil level data Particularly useful in cluster RCTs (when you are randomising the cluster itself). Education example School inspection (OFSTED) ratings….. School level demographics (e.g. % children eligible for FSM) School level prior achievement

What are the benefits of admin data? Low cost….. Not intrusive to collect from participants…… Regularly updated with new information….. Often collected in a consistent way across individuals….. Low levels of missing data…. Low levels of measurement error….. Together, this makes administrative data very attractive to include in our analysis of RCTs!

What are some of the main challenges we face in RCTs?..... (….and how can admin data be used to try and resolve them?)

1. Boost statistical power

A lack of statistical power…… In education: mostly cluster RCT’s Rather than randomise individuals….. Randomise whole schools Issue = ICC (ρ). Low power…… EXAMPLE Secondary schools (clusters) = 100 200 children per school ρ = 0.20 20,000 pupils in trial Minimum detectable effect = 0.25 standard deviations 95% CI = 0 to 0.50 standard deviations

Example admin data to ↑ power…. One way to ↑ power is to control for stuff that is linked to the outcome…. …use NPD for this purpose EXAMPLE Maths mastery Year 7 kids New way of teaching them maths Test end of year 7 CONTROL for KS2 MATH scores from NPD Detectable effect = 0.36 without control (CI = 0 to 0.72) = 0.22 with NPD controls (CI = 0 to 0.44) MASSIVE BOOST TO POWER

2. Reduce evaluation cost.

Costly (including to test)…. Imagine it costs £5 to test each child in this trial…… …you have spent £100,000 just on a post-test! Got to deliver intervention in 50 schools (expensive…..) Many EEF secondary school RCT’s > £500,000 …….. …..average detectable effect across trials = 0.25 Big ££ for quite wide confidence intervals……

Example: administrative data to reduce cost….. In previous example, could have conducted a pre-test rather than use NPD. Maths Mastery in 50 schools of 200 children = 10,000 kids £5 per test. Hence pre-test would have cost a minimum of £50,000 ADMINISTRATIVE DATA SAVED THIS MONEY…. NPD data is there, ready to use. - LETS USE IT! - Doing a separate pre-test here would have had almost no benefit

3. Minimise attrition

Attrition… Schools (and pupils within schools) drop out of the trial….. ….particularly when assigned to the control group! Problems - Breaks randomisation. Loses key advantage of the RCT - Lose power Example (my trial) - 50 schools. 25 Treatment and 25 control - Treatment follow-up = 23 / 25 schools - Control follow-up = 9 / 25 schools Worst of all worlds: - Bias (selection effects) - Low power - High cost

Example: NPD to reduce attrition Schools would have had to have taken time out of maths lessons to conduct this pre-test….. …there would be significant administrative burden on them to conduct the test This burden is a major reason for control schools dropping out Administrative data has…. (i) massively reduced the burden on schools (ii) Improved validity of the trial

4. Allow long-run follow-up

Administrative data for long-run follow-up Test / follow-up often immediately at the end of the trial …. ...often when intervention most effective BUT we are really interested in long-run, lasting effects I.e. Much point ↑ age 11 test scores if kids don’t do any better at age 16?? Ideally want short, medium and long-term follow-up….. ….but this again ↑ $$$ However, administrative data may include long-run follow-up information about individuals….

5. Insight into external validity

External validity Most RCT’s recruit participants via convenience sampling….. ….not from a well defined population How “weird” is our sample of trial participants? Have mainly rich pupils? Have only high-performing schools? How far can we generalise results? BIG ISSUE: - Will we still get an effect when we scale up / roll-out? BUT, FRANKLY, OFTEN IGNORED IN RCT’S

NPD for external validity / generalisability Most RCT’s based upon non-random samples of willing participants. Big issue. But often glossed over! Without random samples, how do we know if study results generalise to a wider (target) population? Admin data – give us some handle on this…….. As we have data for (almost) every child/person in the country……. …….We can examine how similar trial participants are to target population in terms of observable characteristics

6. Additional characteristics in dataset

Additional characteristics Administrative records may include information we did not collect as part of our RCT..... …. because it was too difficult too …. because too costly …. because we forgot!? These are additional variables we can use in our analysis of our trial. E.g. Additional variables we can perform ‘balance checks’ with…. E.g. Additional variables to examine heterogeneous effects…..

Example: Maths Mastery heterogeneous effects…. Linked in cluster (school) level administrative data on OFSTED (inspection) ratings… Found big heterogeneity by OFSTED rating! ONLY POSSIBLE AFTER WE LINKED TO ADMIN DATA!!!

7. Potential for clever designs…. See this paper: Improving recruitment of older people to clinical trials: use of the cohort multiple randomised controlled trial design. Age Ageing 2015 doi:10.1093/ageing/afv044

Step 1: Admin data on population Points to note 1. You never make any contact with control group!2. If everyone you ask says yes – then you have a perfect RCT! (Both internal & external validity)3. Statistical power very high….4. ‘Business as usual control’ (by necessity)… 5. Non-compliance = People saying no when you approach them = the issue (ITT vs CA-ITT analysis) Step 2: Randomly ask people if they want to receive treatment Step 2: Control group. Individuals not approached Step 3: Follow up in admin data Step 3: Follow up in admin data

Issues with linking to administrative data….

Sensitive data = high levels data security…. Most administrative is potentially identifiable……. you know who the person is!! Some data probably won’t be given to you (e.g. names)…… You may not be the one doing the linking……. …..it may be left up to others (who may not do this correctly!) When you have access to linked data, you need to store it securely. E.G. UCL Safe Data Haven. https://www.ucl.ac.uk/isd/itforslms/services/handling-sens-data/tech-soln Potential for big penalties if you don’t abide by the rules….. £500,000 fine….. Jail….

Ethics and consent…. Participants usually needs to give you consent to link their admin data to RCT…. Opt-in consent = They need to tick the box saying that you can link Opt-out consent = They only need to contact you if they don’t consent. Sometimes the person giving consent is not the person themselves….. Example (education) Opt-in consent from schools needed to access children’s NPD data…. Parents typically asked about opt-out consent…. Ethical issue with long-term linking? What happens if your school and parent give consent to link when you are 10…. …..but then you decide you don’t want this at age 18? …..should we have to re-ask for consent once children become adults?

Practicalities. How do you link? 1. Unique ID Variable that uniquely identifies an individual in both datafiles to be merged E.g. UPN in NPD; national insurance number in tax records. 2. By name Individuals named in both datafiles…. Not as straightforward as it may sound! Names spelt wrong/differently across files….. Maiden vs married names….. Individuals with same name (e.g. NPD and children called Mohammed in London) 3. By individual characteristics AKA: ‘fuzzy matching’ Need enough characteristics so can identify individuals….. E.g. Gender, Date of Birth, FSM etc. The more, the better!

Case study. Chess in Schools and communities. www.bbc.co.uk/news/education-13343943

The intervention → Children to receive 30 hours of chess lessons during one academic year (year 5) → Follows a fully developed curriculum by the Chess in Schools and Communities (CSC) team → Chess lessons likely to be accompanied by an after school chess club RQ. Does teaching primary school children how to play chess lead to an improvement in their educational attainment?

Why is this of interest? • In 30 countries (e.g. Russia) chess is part of the national curriculum • ‘Well-known’ that influences maths test scores (at least within the chess world!) “we have scientific support for what we have known all along--chess makes kids smarter!” (Chess Life, November, p. 16 / Johan Christiaen) • Reasonably strong previous evidence A cluster RCT in Italy produced effect size 0.35 Though caution – external validity!

Big previous effect sizes….but poor research designs

Why is this of interest? • Intervention is VERY cheap to implement - If +ive impact, then also likely cost effective! • Fairly serious money invested in the project- £700K ($1m) for this RCT alone • Putting men into primary schools More information see: http://www.psmcd.net/otherfiles/BenefitsOfChessInEdScreen2.pdf

An interesting feature of this particular RCT is that it used administrative data only!!

Step 1. Defined the population using administrative data….. → 11 LEA’s (geographic areas) in England purposefully selected → Year 5 (age 9 / 10) children in 2013 / 14 academic year (born Sep 2003 – Aug 2004 ) → Disadvantaged schools > 37% of KS 2 pupils eligible for FSM in the last six years → Total of 450 on population list (sampling frame)

Step 2. Pre-specified use of administrative data in study protocol… → Primary outcome = Key Stage 2 math test score - National examination in England - Children will sit 1 year after end of intervention - Due to sit tests in June 2015 (children age 11) - ‘Intention to treat’ (ITT) analysis - Information from NPD (administrative data) - Should get 100% follow-up (very rare for RCT!) → Secondary outcome - Math sub-domains (e.g. mental arithmetic) - English & science test scores

Step 3: Power calculation Assumptions Between school ICC = 0.15 60 children per school on average Correlation pre / post test (Key Stage 1 and Key Stage 2 test scores) = 0.65 80% power for 95% CI NOTE: We are can base these assumptions on analysis of admin data from previous years! Strong basis! With 100 schools, we can detect an effect size of 0.20. Hence recruit 100 schools …....

Step 4: Selection of the ‘sample’ (and external validity) → Chess in schools given list of all 450 schools → Asked to recruit 100 from this list → Sampling fraction of around 22% → How does our sample of children from the 100 recruited schools….. ……compare to the ‘population’ of children from the 450 schools? → USE ADMINISTRATIVE DATA TO FIND OUT!!!!

Example: Using the NPD to investigate external validity.. Chess in Schools Able to show participants very similar to population of interest (in terms of observables…..) …but very different to population of England as a whole!

Step 5: Random assignment → Stratify schools into 9 groups - 3*3 matrix of %FSM and KS2 test scores at school level → Randomly select children from within each of these strata → 50 Treatment schools (children taught chess) → 50 Control schools (business as usual) →All children within these schools taking part in the trial. → Q. WAS BALANCE ACHIEVED? A. USE ADMINISTRATIVE DATA TO FIND OUT!!!!

Balance on prior achievement using admin data….. Balance upon KS1 average points scores…. These are tests children took at age 7………..Two years before the intervention took place! (But that’s ok!)

Balance on other characteristics…… ETHNICITY Pre-test SES

By using NPD, we almost eliminated attrition…. Clever design with NPD data means we can (almost) eliminate drop-out EXAMPLE: Chess in Schools - Year 5 children learn how to play chess during one school year - 50 treatment schools receive chess - 50 control schools = ‘business as usual’ - Use age 7 (Key Stage 1) as the pre-test scores - Use age 11 (Key Stage 2) as the post-test scores Almost no burden on schools (no testing to be done) Key stage 2 results for all children Have test scores even if they move schools…… …..should have very little attrition

Allocation Randomised school n=100 pupil n=4,009 Intervention School n= 50 Pupil n= 2,055 Control School n= 50 Pupil n= 1,954 Analysis Analysed School n = 50 Pupil n =1,965 Analysed School n = 50 Pupil n = 1,900 Almost zero attrition!

Did it work? Outcomes 1 year post-intervention Answer NO! Note Able to look at heterogeneous effect by FSM due to admin data link…..

Planned long-run follow-up (using admin data) Trial conducted in Year 5 (age 9/10). First follow at end Year 6 (age 10/11). Treatment and control children then move onto secondary school. Will be able to track these children via their unique pupil number. Hence long-run control: Do treatment children do better in math GCSE? (Age 16) Are they more likely to study maths post-16? Are they more likely to enter a high-status university? Administrative data means we can answer these questions at little extra cost. Can answer the question – is there a lasting impact of the treatment?

Limitations Exclusive use of administrative data meant 1. Could only look at educational attainment measures……. …..and not look at impact upon ‘non-cognitive’ skills. 2. Outcome measured one year after intervention……. …..might there have been an immediate effect? 3. Statistical power would have probably been higher with a specific pre-test…. …..but also would have been costly! 4. Balance checks and heterogeneous effects limited to characteristics observable in administrative data only.

Linking administrative data to RCTs

Linking administrative data to RCTs

Presentation Transcript

Linking Surveillance to Prevention through Data

Linking Barcode Data to Multiple Users

Linking P-20 Education Data to Workforce Data

Model of transformation administrative data to statistical data

Data linking with kblog

Linking HES to THIN Data

Linking systems to improve data quality

Administrative Data

Linking Data to Action

Linking Data to Open Access Publications

Linking Open Data

Handling (and Preventing) Missing Data in RCTs

Linking Data to Instruction

Linking Research Data

Administrative Data Sources

Business data linking

Missing Data Issues in RCTs: What to Do When Data Are Missing?

Administrative Data Matching

Linking data resources

Data Collection in the Context of RCTs

Business data linking