1 / 11

Data cleaning: hints and tips

Data cleaning: hints and tips. Felicity Clemens Stata Users’ Group meeting London, 17 & 18 th May 2005. Introduction. Data cleaning – one of the most time consuming jobs of all! Many ways of attacking the same problem when using Stata

elina
Download Presentation

Data cleaning: hints and tips

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data cleaning: hints and tips Felicity Clemens Stata Users’ Group meeting London, 17 & 18th May 2005 Felicity Clemens 18 May 2005

  2. Introduction • Data cleaning – one of the most time consuming jobs of all! • Many ways of attacking the same problem when using Stata • The talk will describe some common problems and propose possible solutions • These are mostly reminders! Felicity Clemens 18 May 2005

  3. Contents • Introduction to the first datasets • Identifying and removing duplicates – by hand • Merging data and uses of the merge command • Generating a moving target variable Felicity Clemens 18 May 2005

  4. The study • A case-control study carried across 3 central European countries • Exposure of interest: exposure to chemicals in the environment • Outcome of interest: cancer Felicity Clemens 18 May 2005

  5. Identifying duplicates in a dataset • This can be done automatically (using the duplicates set of commands) • We will demonstrate a manual method of identifying duplicates • Two different possibilities: • The same data have been entered on more than one occasion; Felicity Clemens 18 May 2005

  6. Identifying duplicates in a dataset • This can be done automatically (using the duplicates set of commands) • We will demonstrate a manual method of identifying duplicates • Two different possibilities: • The same data have been entered on more than one occasion; • Different data have been entered using the same identifier (id numbers) Felicity Clemens 18 May 2005

  7. The merge command A necessary command in data management of most big studies There are many different uses of the merge command. We look at two of them: • Simple merge on id • Multiple merge on id Felicity Clemens 18 May 2005

  8. Identifying a moving target • Scenario: we have data for each town giving the chemical concentration for each year between 1982 and 2002 • Problem: we need to identify the year counting backwards from 2002 in which the chemical changed from its 2002 level • Why? We need to overwrite the 2002 value with a new value, and overwrite backwards until the value changed Felicity Clemens 18 May 2005

  9. Identifying a moving target (2) Felicity Clemens 18 May 2005

  10. Identifying a moving target (3) We will use the forval loop to examine the relationship between each year’s observed value and the observed value for the previous year Felicity Clemens 18 May 2005

  11. Summary • Identifying duplicates – can be done by hand or automatically using the “duplicates” set of commands • Use of the merge command – to merge on a specific variable, to multiply merge datasets • Generating a moving target variable – the use of the “forval” loop Felicity Clemens 18 May 2005

More Related