250 likes | 402 Views
Managing Your Own Data (…if you have to). Kathryn A. Carson, Sc.M . Senior Research Associate Department of Epidemiology Johns Hopkins Bloomberg School of Public Health kcarson@jhmi.edu. Overview. Principles of Data System Design Data entry and management systems Self managed systems
E N D
Managing Your Own Data (…if you have to) Kathryn A. Carson, Sc.M. Senior Research Associate Department of Epidemiology Johns Hopkins Bloomberg School of Public Health kcarson@jhmi.edu
Overview • Principles of Data System Design • Data entry and management systems • Self managed systems • How to manage data in MS Excel • Programmer managed systems • How to manage data in MS Access • Sample data sets • How to prepare data for statistical analysis • Confidentiality/security
Principles of Data System Design • Data Input/Entry • What resources are available • Amount of data • Set up time versus usage • Double data entry • Data Validation • Data type (e.g., numeric, date, text ) • Range checks • Missing data • Violation of protocol checks • Coding and spelling errors • Consistency checks
Principles of Data System Design (2) • Data Audit • Computer audits • Manual audits • Data Edit • Single line through incorrect value on data form, write correct value, initial and date • Make same changes to database file • Data maintenance • Single database file • Data archive • Have plans to archive data after the end of the study • Data can be archived on a CD • Data need to be stored for at least five years
Principles of Data System Design (3) • Identification of Study Participant • Do not use names, hospital history numbers or Social Security numbers • Patients should be identified with a unique study assigned ID number • Maintain a log linking the patient’s name and other personal information to the study ID • Kept separately under lock and key or encrypted and password protected • Only personnel who need access to information should have it • HIPAA guideline compliant – collect the least amount of protected health information (PHI) needed for the study
Data Management Computer Systems • Self-managed computer systems: • Spreadsheets • Excel, Lotus • Programmer managed computer systems: • Database management software • Access, dBase • Statistical software • SAS, SPSS, Stata • Web-based systems • Gsurvey, REDCap • Fax/Teleform systems
Self-managed Systems • Advantages • Self managed • Convenient for small data sets • Descriptive statistics and graphics available • Disadvantages • Data types are defined by first few entries • Not conducive to data validation • Cumbersome for very large data sets • Forms need to be designed separately • Repeated column names or no column names allowed • Data codes are entered manually into a separate file • Unable to do consistency checks across forms
Creating an Excel Spreadsheet Unique variable names should be in the first row Data should be in column format Data in the same column should be of the same data type Some data validation features are available Data audit features are available for existing spreadsheets
Programmer-managed Systems • Advantages • Friendlier data entry environment • Computerized data validation • Ability to perform consistency checks within and across tables • Ability to track editing changes • More manageable for large data sets • Disadvantages • Require more up front planning and resources • Require database knowledge to develop a file • User training
Creating an Access Database • Primary Key(s) • Must be unique and not missing • Indexes on this value • Be careful of “Default value” • Default setting is zero for numeric data (pre 2007 versions) • Use “Required” only when necessary • Will not allow field to be left blank
Sample Datasets What not to do! Real life examples
Preparing Files for Statistical Analysis • Allow adequate time for data preparation • Better the quality of data, less time analyzing • Know your data • Look at frequency distributions and scatterplots • Multiple checks for errors • Minimize missing data if at all possible • Be aware of amount of data missing and why • Freeze the dataset • Copy to another file and date the file • Document any corrections made to file and also correct in original database and on forms • Plan on recoding categorical variables so each group has a sufficient sample size • Prepare a separate code sheet for data
Preparing Files for Statistical Analysis (2) • General spreadsheet design • One line header row with a unique one word name for each variable • Do not mix data types within one column • Unique identifying number for each case • Only include raw, un-summarized data, i.e., no summary statistics or graphs in spreadsheet • Date format with four digit years • Avoid underlining, bold fonts, or italics • Do not leave blank rows or columns in between data • Do not use a row to label a group, use a grouping variable (column)
Preparing Files for Statistical Analysis (3) • Missing Data • Consider what software will be used for analysis • Use different codes to indicate reason missing • e.g., not applicable, unable to complete, or missing • If numeric field • Must not be a valid data point • Do not use text, such as “NA”, “missing”, “*”
Exporting Excel data into Stata • Save the Excel file as comma delimited • In the Save As dialog box choose CSV(comma delimited) for Save As type • In Stata • Go to drop down menu “File” – “Import” – “ASCII data created from a spreadsheet” • or use the command ‘insheet using “filepath.csv”, comma names’
Exporting Access Data into Stata • Save Access file as a Comma Delimited File • Open the Access table • From the File Menu, select Export • In the pop-up dialog box click on “Save as file type” and select Text Files • Click “Save All” and inthe Export Text Wizard select delimited and comma • Follow instructions for importing a comma delimited file into Stata
Data Transfer Software • Software is available to transfer data between applications • Stat/Transfer and DBMS/COPY • Access, ASCII, dBase, Epi Info, Excel, JMP, Paradox, QuattroPro, SAS, S-Plus, SPSS, Stata, Statistica • Need to update as software updates
ICTR Resources • Data management and statistical support available for ICTR (CRU) protocols • http://ictr.johnshopkins.edu/connection/ • Computer facilities with data management and statistical software • Located in Carnegie 446 • http://www.hopkinsmedicine.org/gcrc/ • Gsurvey, Teleform, scanning, sample size programs, statistical software, data transfer software
Data Security and Confidentiality • Do not include unnecessary protected health information on research data files • No names, addresses, phone numbers, social security numbers • No medical record numbers • Files that link the study ID to the PHI should not be maintained on removable storage drives or laptops • E-mailing of data files should be limited • If PHI are on the file, then the files should be encrypted and password protected • Do not e-mail the password • Use JShare or SharePoint to share or transfer data files
Summary • Well designed systems minimize data errors and future problems • Data management systems should be chosen based on resources and individual needs • Spreadsheets are appropriate for small and simple data sets • Databases provide more options for data management • Add simple validations to check data entry • Following guidelines for preparing files for statistical analysis will save time • Data transfer software is available to transfer data between applications • Limit PHI and keep data secure