160 likes | 299 Views
Finding a Person!. Building an algorithm to search for existing people in a system. Rahn Lieberman Manager Emdeon Corp (Emdeon.com). rahn@theliebermans.com http://rahnsworld.com. Understanding the problem. We work with hospitals around the county
E N D
Finding a Person! Building an algorithm to search for existing people in a system Rahn Lieberman Manager Emdeon Corp (Emdeon.com) rahn@theliebermans.com http://rahnsworld.com
Understanding the problem • We work with hospitals around the county • These hospitals (our customers) send us sets of people to work with • We put these people into our internal system for our workflow rahn@theliebermans.com http://rahnsworld.com
Understanding the problem • A little bit of complication: • A person can visit a hospital more than one time. • When this happens, we need to recognize the person so we don’t load them repeatedly • And more complication: - The person can then visit another hospital, and we still need to recognize them rahn@theliebermans.com http://rahnsworld.com
Our Environment • We get data from customers in batch files and through real-time (HL7) interfaces • Data base be entered into our system via our interfaces or manually • We process upwards of a quarter million records a day, importing approximately 10K new records. • Electronic entry accounts for approximately 85% of the data entry rahn@theliebermans.com http://rahnsworld.com
Bragging • We get reports of duplicate data being entered approximate once every 2-3 months. • On investigation, these are almost always user errors. • These are so rare, I don’t have statistics on actual number of errors. rahn@theliebermans.com http://rahnsworld.com
How’s it Done? • What to Know: • Know your data! • Know that it will take some time to get it right • And of course, know your users • The data is the actual elements of what you’re working with. Names, birthdates, etc. rahn@theliebermans.com http://rahnsworld.com
How’s it Done? (cont.) • Start with your basic object first. In my case, it’s a person • Take stock of common elements your object that should always be know, and are commonly part of your data: • First Name, Last Name, Middle Initial • Date of Birth • Gender • Is that enough to uniquely ID someone? • If not, keep adding, but realize you may not get everything you want rahn@theliebermans.com http://rahnsworld.com
How’s it Done? (cont.) • How about adding an address and a social security number? • That gets close to giving enough information. • What? SSN’s are unique! • Not everyone has one, or know what it is • People may share them or make them up rahn@theliebermans.com http://rahnsworld.com
How’s it Done? (cont.) Do the same thing for other items in your domain. In my case, hospitals have standard data elements: • Account Number, Medical Record Number • Unique identifier in our system to distinguish the hospitals Once you have these core data elements, setup a query in your database for them, allowing elements to by null • You’ll need to know how data is stored (null versus empty strings, default values, etc.) rahn@theliebermans.com http://rahnsworld.com
How’s it Done? (cont.) • Search and weigh the results • Go through all reasonable variations of your data, and collect the PersonID found in your system if one is found. • Create a dictionary of all values found • After all values are collected, sort them and count up the number of like PersonID’s • The PersonID with the most hits is most likely the correct person • Note that if none are found, then you end up with a “0” as the most common PersonID. Be sure to account for this. rahn@theliebermans.com http://rahnsworld.com
Example of weighing • First + Last + DOB + Account Number + CustomerID: 20 • Last + SSN: 20 • Last + Account Number + DOB: 20 • First + Medical Record: 19 • First + Address + CustomerID: 19 • Total count of 20: 3 • Total count of 19: 2 • Our person is PersonID 20 rahn@theliebermans.com http://rahnsworld.com
More details • The key is the number of searches, and making sure the searches provide unique results. • In my production system, we do 30+ searches • F+M+L names + DOB + Address information • SSN searches with first and last names (need to account for last name changes) • Customer searches based on our entire customer hierarchy • I even go as far as cleaning the data to remove dashes, hyphens and other possible data that may not be entered consistently rahn@theliebermans.com http://rahnsworld.com
Testing • Search for a know person in your database (i.e. PersonID 1, where you know all the data about them) • More complicated tests: • Pull X number of records, remove some of the data from them, and try to find them again. • This works well because you always know what you’re expected to find • Sorry, SQL Compact doesn’t support TOP, so I can’t demonstrate. rahn@theliebermans.com http://rahnsworld.com
Known issues • Babies are hard to match because they don’t have SSN’s and they don’t have names • Most hospitals will enter a baby with a generic name, Baby Boy Smith, then change it in their system at a later date(after paperwork has been filed) • If a Senior and a Junior live at the same address, and we don’t have both of them in our system already, we may get a false result • Speed! 30 queries take a long time to run. I’m looking into ways to speed this up • Running multiple instances of the import/search at the same time will fail. We have safeguards against this in place rahn@theliebermans.com http://rahnsworld.com
Demo • This sample data is ALL FAKE! • Do not think these are real people. If they are, it’s purely a coincidence • Social Security numbers are the phone numbers minus 1 digit • This is not my full production system. It’s been simplified to show proof of concept, and it uses SQL Compact for ease of demonstrating offline rahn@theliebermans.com http://rahnsworld.com
Questions • Slides and code available at http://rahnsworld.com rahn@theliebermans.com http://rahnsworld.com