150 likes | 301 Views
Cleansing and Geocoding Spatial Data. Jyoti Kamal, PhD . Introduction.
E N D
Cleansing and Geocoding Spatial Data Jyoti Kamal, PhD
Introduction In developing an enterprise-wide clinical and financial data warehouse at the Ohio State University Medical Center (OSUMC), we invested tremendous time and effort in analyzing the quality of historical data coming from the source system as a first step to our commitment to data quality. The OSUMC has developed an in house technique and process to cleanse and geocode spatial data in a single pass on the city, state and zip code combination. In developing this technique and process there were several goals in mind.
Goals Identify potential marketing regions for hospital outreach clinic services. Identify regions by patient demographics for direct marketing mailings. Correlate abnormal diagnosis with patient regional distributions. Identify geographic correlations with other patient trends. Identify regions by physician referrals. Bio-surveillance: Retrospective public health surveillance.
Available Data 7 years worth of patient addresses 7 years worth of patient demographics Data Issues Unknown integrity of data No geocoding of data
Cleansing Logic and Process Purchased list of valid city, state and zip code combinations along with zip code to county mapping, zip code centered latitude and longitude coordinates Data loaded into warehouse and refreshed monthly to keep synchronized with United States Postal Service Tool: Informix Data Stage TM used for extraction, transformation and load.
Cleansing Logic and Process continued Figure 1. Cleansing logic showing different transform stages and output files.
Cleansing Logic and Process continued Table 1. Sample of data before and after cleansing and the additional geographic information attached with each output record. Changes in the cleansed data are italicized
Spatial Analysis To provide analytical ease we divided the geographic area around OSUMC in predefined rings and sectors. To define the rings we chose imaginary concentric circles in increments of ten miles (larger increment after hundred miles) and partitioned them in eight equal sectors of 45 degrees, each indicating a broad different direction. One can, in fact, choose any increment and angle depending on the granularity of distribution desired. With this picture in mind, we could write an application, which in combination with patient zip code and the data shown in table 2, could determine which ring and sector the patient zip codes would fall. From this information, in turn, it was easy to determine what percent of the OSUMC patient population fell in which ring sector
Spatial Analysis Table 2. Sample of distance and bearing calculations in reference to the OSUMC
Spatial Analysis Figure 2. Center of these circles is the OSUMC. The rings are drawn in increments of 10 miles and the circles are divided in eight equal sectors of 45 degrees, each signifying a broad direction. N: North; NE: North East; E: East; SE: South East; S: South; SW: South West; W: West; NW: North West
Results Figure 3. Patient distribution by distance and bearing in relationship to the OSUMC
Results Figure 4. Top five services by patient count and gender distribution for patients that are in the age range of 40-45, live within 10 to 20 miles on the south east direction of the OSUMC
Results Figure 5. Patient distribution by in relationship to the OSUMC for oncology services
Conclusions Process was sufficient for the OSUMC Inexpensive process Extended the application beyond patient data to physician data Satisfied business analysis and marketing personnel with service, gender and age patient distribution data
Acknowledgments Authors thank Dr. Joel Saltz, Dr. Hagop Mekhjian and Maxine Moehring for their support and encouragement. The ongoing help of the Data warehouse team, in particular, Jennifer Santangelo, Mike Ostrander and Israel Rosales is greatly appreciated. Kevin Li and Mike Ostrander’s work on the validation process and OLAP application is invaluable.