280 likes | 468 Views
Preparing Research Datasets. Data Request Data Cleaning Dataset Preparation Documentation. Beverly Musick. Research Data Request. A concept proposal is a detailed plan for a research project and typically includes: The aims of the study Associated hypotheses
E N D
Preparing Research Datasets Data Request Data Cleaning Dataset Preparation Documentation Beverly Musick
Research Data Request A concept proposal is a detailed plan for a research project and typically includes: • The aims of the study • Associated hypotheses • Statistical analysis plan • Description of the cohort • Specific data/variables needed (example Hunt Proposal)
Steps to Fulfill Research Data Request • Identify and resolve questions regarding requirements (may be repeated) • Determine data source and variables needed • Define any derived variables • Define cohort and subset of visits/observations to be included • Data cleaning • Dataset preparation • Documentation
Raw Data Cleaning For data that are stored in Access, Excel, or text files, data cleaning should begin with the original table, spreadsheet or file. • Back-up the original data files. • Eliminate blank records and any records used for testing. • Locate duplicate records and resolve. • For numeric variables, identify outliers by sorting and reviewing the overall minimum and maximum. This is particularly useful for continuous variables such as dates, ages, weights etc. • For categorical variables such as gender or marital status, sorting will reveal invalid response codes or use of mixed case (f, F, m, M for gender). • Review the frequency of missing data when records are sorted. Does it make sense that x records have no value for variable y?
Raw Data to SAS Datasets Create a SAS program that converts the database file(s) to permanent SAS dataset(s). • For Access or Excel files can use ‘Proc Import’ PROC IMPORT OUT= WORK.demog DATATABLE= "tblDEMOG" DBMS=ACCESS REPLACE; DATABASE="I:\Projects\Kenya\CFAR\cfar.mdb"; dbpwd=‘password' ; RUN; • For text files can write specific input statement data copd ; infile 'c:\kenya\hiv\copd.txt' ; input @1 patientid $9. @@ ; run ;
SAS Dataset Creation • Merge or append (concatenate) tables as necessary. • Double-check the merging process by looking at the number of observations in each dataset before and after the merge. • The number of records is dependent on the overlap among the datasets. This relationship should be known in advance and the expected outcome confirmed.
Merge Example data vis1 ; set h.visitDemo(keep=patient_id apptdate age weight height bmi cd4) ; if patient_id in (1,2,3,4,5) ; run ; NOTE: There were 933654 observations read from the data set H.visitDemo. NOTE: The data set WORK.VISIT has 71 observations and 7 variables. data vis2 ; set h.visitDemo(keep=patient_id apptdate clinic hgb sao2) ; if patient_id in (4,5,6) ; run ; NOTE: There were 933654 observations read from the data set H.visitDemo. NOTE: The data set WORK.VIS2 has 46 observations and 5 variables. data bothvis ; merge visit vis2 ; by patient_id apptdate ; run ; NOTE: There were 71 observations read from the data set WORK.VIS1. NOTE: There were 46 observations read from the data set WORK.VIS2. NOTE: The data set WORK.BOTHVIS has 83 observations and 10 variables.
SAS Dataset Creation (cont.) • Confirm that the total number of variables in the merged dataset is correct. • The number should be the sum of all variables minus the (number of key fields*(number of datasets in merge minus 1)). In the previous example: 7 + 5 – 2*(2-1) = 10 • If the number of variables is less than this, then you know that you have the same variable(s) in one or more of the datasets. This should be strictly avoided.
Merge Example data vis1 ; set h.visitDemo(keep=patient_id apptdate age weight height bmi cd4) ; if patient_id in (1,2,3,4,5) ; run ; NOTE: There were 933654 observations read from the data set H.visitDemo. NOTE: The data set WORK.VISIT has 71 observations and 7 variables. data vis2 ; set h.visitDemo(keep=patient_id apptdate clinic hgb sao2) ; if patient_id in (4,5,6) ; run ; NOTE: There were 933654 observations read from the data set H.visitDemo. NOTE: The data set WORK.VIS2 has 46 observations and 5 variables. data bothvis ; merge visit vis2 ; by patient_id apptdate ; run ; NOTE: There were 71 observations read from the data set WORK.VIS1. NOTE: There were 46 observations read from the data set WORK.VIS2. NOTE: The data set WORK.BOTHVIS has 83 observations and 10 variables.
SAS Dataset Creation (cont.) Always review the SAS log ERROR, WARNING, and NOTE messages. The following messages are often overlooked but do require action: "NOTE: MERGE statement has more than one data set with repeats of BY values.“ • This indicates that one or more of the datasets that you are trying to merge contains multiple observations that are not uniquely distinguishable based on the variables listed in the by statement. Merged dataset will contain spurious and unexpected results. Further processing should not continue until this note has been resolved. “NOTE: Variable _____ is uninitialized” “NOTE: Variable _____ has never been referenced” • These indicate that the variable has not been properly defined. Many times a variable name has just been misspelled. “NOTE: Character values have been converted to numeric values…” • This indicates that SAS has automatically converted a character variable to numeric. Because unexpected results can occur, it’s best to do the conversion manually with the input or put function. “WARNING: Multiple lengths were specified for the BY variable _____ by input data sets. This may cause unexpected results.” • This indicates that the by variable is not consistent across all data sets.
SAS Dataset Creation (cont.) To create permanent datasets for analysis: • Recode missing values used in the raw data tables/files to appropriate SAS missing values. For example, if 9's were used to indicate missing data for numeric fields in a data table then these should be converted to .'s. • Calculate appropriate summary scores (ex. AUDIT-3, BMI) • Calculate differences between dates such as time from enrollment to ART initiation. • Label all calculated and created variables. • Attach formats to the variable values where necessary.
Cleaning Data in SAS Create a cleanup program. • Generate frequencies, means, and univariates to better understand the dataset and to check for invalid data. • Plot the data. • For the numeric and date fields look at minimums and maximums to verify all values are within expected range. • Locate duplicate records and resolve. • Compare fields when appropriate (i.e. dob and age, confirm date of initial visit < date of follow-up).
Cleaning Data in SAS (cont.) • Identify important fields such as summary scores and verify their values. • Merge all longitudinal datasets to identify date inconsistencies, variable format inconsistencies, and to locate missing questionnaires. • Merge cross-sectional (demographics) dataset with longitudinal datasets to identify subjects in one but not the other.
SAS Program Files • Save all logs and outputs from SAS programs especially when creating analysis datasets for publication • Naming conventions – studyx.sas, studyx.log, studyx.lst • Only the program that generates the permanent dataset should overwrite it. • Never overwrite a permanent dataset (even with a proc sort) from any other program.
Documentation • Internally document SAS programs. At minimum include file name, location, purpose, author, date, and revisions. • May be helpful to include the names of any permanent SAS datasets created within the program • All SAS printouts should have at least one title, which includes the project name (ex. title ‘Treatment Interruptions Analysis Dataset’ ;) • It’s helpful to use the footnote option to display the path and file name of the SAS program on the listing (ex. options footnote ‘R:\AMPATH\Research\Braitstein\TxInt\TxInterrupt.sas’ ; )
Documentation (cont.) • If any variable values have been formatted, include a copy of the “proc format” section in the documentation. • Generate form keys. • Provide a description of any variables included in the datasets that are not found on the form keys.
Summary Score Documentation Detailed algorithms of how summary scores are calculated should include the following: a. which variables are used to calculate which summary scores b. which variables (if any) are recoded and how c. what is the minimum number of non-missing items needed to calculate the score d. how are missing values addressed. Typically when calculating a total or sum score the mean should be imputed for missing data. If the summary score is a mean itself then the missing data can be ignored. In both of these cases it is essential that c. above is followed and that summary scores are coded as missing if there is insufficient data to calculate. e. what is the meaning of the score and how is it scaled. Indicate the possible range and how a high score differs from a low score. For example include something like “Higher score indicates more depression”.
Dataset Cover Sheet Notes on Analysis Datasets Project Name: Principal Investigator: Date of Original Data Request {please attach}: Datasets Created: Name and Location of (SAS) Program used to Generate Datasets: Creation Date: Created By: Biostatisticians: Cohort: Derived Variables (name of variable, coding, precise description): SAS Formats: procformat Preliminary Statistics:
PracticumCreate a Data Cover Sheet for dataset created during Programming Standards Practicum (male patients with at least 2 CD4’s)
SAS General Notes • If the study is longitudinal, at least two datasets are needed: one containing the demographics and other information which does not change over time; and one containing the data for multiple time points. • Never put cross-sectional variables such as gender in the longitudinal dataset. • Format all date fields with 4-digit year (ddmmyy10. or date9.) • Choose data type numeric whenever possible.
Distributing SAS Datasets • If possible, have another data manger review the datasets and documentation before distributing • The following should be included: • The form keys • All appropriate SAS datasets (should have the extension .sas7bdat) • Dataset Cover Sheet • Latest Data Request Form • Any other documents or notes which would further explain the data.
Distributing SAS Datasets (cont.) In most cases the following should not be distributed: • Any Protected Health Information (PHI) such as study subject’s name, address, phone numbers, social security number, hospital id number. Date of birth should only be included if absolutely necessary. But usually age can be calculated and given instead. • SAS generation programs. These often contain PHI.
File Maintenance & Archiving For your own records at minimum, you should have: • A copy of everything you give to the biostatistician or investigator and the date given. • A copy of the log of all the SAS programs especially those that create any permanent SAS datasets which were passed along to others • Grant protocols, meeting notes, scoring algorithms, instructions for data entry, corrections made, etc. • It may be helpful to maintain a subdirectory that exactly mirrors the subdirectory of the pc where the data is actually being entered. This subdirectory would include all the RDMS programs, format files, and tables. • For longitudinal studies in particular, it is important to archive datasets and SAS programs/logs, which were used for analysis for abstracts, papers, grant proposals, and other publications.
Data Managers Working with Investigators and Biostatisticians • Attend study meetings • Date all documents and meeting notes • Comment on proposed study changes • Understand the statistical analysis plan • Review statistical reports (preferably before presented to research team) • Review and critique abstracts/manuscripts Your contribution is EXTREMELY important!