Crime Section, Central Statistics Office.

Case Study- Matching Criminal Justice Administrative Datasets in the absence of common unique identfiers Crime Section, Central Statistics Office.

Acknowledgments • The Crime Section would like to acknowledge the assistance provided by the Probation Service in this project. • In particular, we would like to thank Michael Donnellan and Aidan Gormley.

Areas of Discussion • Connectivity between the various Criminal Justice Database Systems • The Challenge - Absence of unique identifier • The Solution – CSO statistical matching. • Results of matching exercise • Future Goals

Connectivity between the various Criminal Justice Database Systems • Robust links between PULSE and CCTS. • Tenuous link between PULSE/CCTS and Probation • Need to make these links into strong links - but how?

The Challenge – Absence of common unique identifier. • Common unique identifier allows rapid integration of datasets. • The common identifiers between PULSE and CCTS include Charge No., Summons No. • These are linked to the Person PULSE ID in PULSE, to allow linking by individual. • Result: Able to produce statistics combining police and court outcome data. • However, there is a problem....

The Challenge – Linking Probation and PULSE data • No such common identifier between CCTS/PULSE and Probation • Probation Service uses its own unique identifiers. • No linking between this and PULSE identifiers such as Person PULSE ID and Court Outcome number. • Cannot link the datasets and cannot produce statistics.

The Challenge, and its solution • But a solution exists: • If persons in the separate systems can be matched across variables that exist in both systems: • Then a table linking unique identifiers can be produced. • Variables such as first name, surname, data of birth and address exist in both systems. • These can be used to link the two systems. • This is the basis of the CSO solution.

The Solution – CSO statistical matching. • The CSO received a test dataset from the Probation Service, for years 2007 and 2008. • Over 8700 data orders with corresponding info. • First, a manual matching exercise was carried out to test feasibility • Matching by first name, surnames, addresses, dates of birth on over 7800 probation records. • A random sample of 800 records • It took 8.5 person-days to process this 10% sample. • At this rate, it would have taken over90 days to process the entire dataset.

The Solution – CSO statistical matching. • The next step was to automate the matching process, for entire dataset. • Fully automated matching solution – not really possible. • A mixed-model method incorporating automatic and manual matching, to achieve 99% matching. • 70% of matches were automatically matched, without human role. • This match was on first name, surname and date of birth.

The Solution – CSO statistical matching. • Additional sorting/matching algorithms to simplify manual matching of remaining 28%. • There were four additional stages, with progressively increasing human role. • These were to identify cases where age or address data does not match, for example. • Processes still mainly automated and algorithm based, so fast to process. • The entire process was completed in 2man-day. 99% of all the records (7,800+) matched. • Compared to projected (90+ man days).

The Solution – CSO statistical matching. • Step one. • Both datasets sorted by names, addresses and dates of birth. NB All datasets shown are merely representations, not actual data

The Solution These are large datasets.

The Solution

The Solution – CSO statistical matching. • Step Two. • The probation and PULSE records are matched automatically by names and date of birth – using SAS. • 70% of entries are matched automatically, this way. • For each probation ID, the corresponding PULSE Ids are listed. • People may have multiple PULSE Ids, for each probation ID.

The Solution – CSO statistical matching (ctd.) • Step Three. • The next step is to ensure that surnames with the prefix “O’” are recorded in the same manner in both datasets • Step has minimal human involvement. • One dataset records “O’ ” as “O” • This is not detected or matched in initial stage • This can be performed with an automatic software “Replace” function • When the automatic matching (Step Two) is run again: • Now 85% of records match automatically.

The Solution – CSO statistical matching (ctd.) • Step Four • The next step is to match on cases where the surname and date of birth match, first names are closely related: • This step has more human involvement. Geographical info is used as a further check. This allows us to find aliases. • Example shown here: • It is clear that although “Liz” and “Elizabeth”, and “Alex” and “Lex” differ, they refer to same person.

The Solution – CSO statistical matching (ctd.) • Step Five. • Additional matching steps are then carried out. • One is to check for matching first names, surnames and geographical info, but where dates of birth differ. • Special checks can identify matching cases here. • Another set of checks involves searching for matching first name, date of birth but slightly different surnames. • All these steps lead to match of over 95%. • The final step is a fully manual operation to match the remaining 5%

Results • The CSO produced detailed results from this linkage. • Tables were produced showing: • Number of subsequent First Offices (recidivism), during the period 2008-11, by individuals with probation orders issued in 2007-08 • Table B: Subsequent First Offences (recidivism), during the period 2008-11, by individuals with probation orders issued in 2007-08, as percentage of the Original Primary Offence • Table C: Subsequent First Offence (recidivism) by individuals, during the period 2008-11, with probation orders issued in 2007-08 as a percentage of total original primary offences • Table D: Subsequent First Offence (recidivism) during the period 2008-11 of individuals with probation orders issued in 2007-08 as a % of total subsequent First Offences • Unfortunately, we can show only sample data here.

Future Goals • Further development of matching model. • To incorporate text analysis, fuzzy matching. • To develop a fully automatic process to match to 99%.

Conclusion • This project shows a simple, effective solution to integrating datasets in the absence of a common identifier. • This project doesn’t invalidate the importance of development of unique identifiers. • But it does allow matching of records where it is not feasible to retroactively apply any planned common identifier. • This method is not limited to Criminal Justice Administrative Data. • It can be applied to any datasets with common information on names, dates of birth etc.

Crime Section, Central Statistics Office.