290 likes | 301 Views
Learn data entry techniques using Excel, cleaning data effectively, utilizing existing data, and tracking datasets efficiently in this comprehensive workshop. Enhance your research data management skills now.
E N D
Managing Research Data Statlab Workshop Spring 2009 Amelia Aldao, M.S. Jeremy Green
Managing Data Outline Data entry Cleaning data Using someone else’s data Keeping track of data From data to results
Data Entry Excel is the best software for data entry: Easy to use Files can be easily imported to data analysis packages
Data Entry - Excel Earlier versions (before 2007) only had 256 columns per sheet You can get around this by creating several sheets, then merging them with data analysis packages Although Excel has some basic statistical functions (e.g. mean, standard deviation), they are limited, so it is best to only use it for entering data and leaving all calculations for statistical packages
Data Entry - Excel - Useful Tips Each column should contain a variable. Use the first row to label the variables Given that most modern data analyses packages allow for long variables names (e.g. more than 8 characters), be specific with your variable names If you want to see a particular set of columns while entering data (say, to make sure you are entering values in the ID), go to Window-Spilt Sometimes it can be useful to sort data while entering it. To do so, go to Data - Sort and select the variables you would like to sort your data by
Data Entry - Excel - Useful Tips When entering data from categorical variables you have several options: Enter the data in string format and keep it that way Enter the data in string format and convert it to numerical values later Enter values that have already been converted to numerical If you will be assigning numerical values to categorical ones, please keep track of what your codes mean. It is usually helpful to write a codebook where you keep track of all this information
Data Entry - Codebook Always create a codebook that contains: Instructions for entering data Instructions for making decisions when data are ambiguous Instructions for handling missing observations Numerical codes you will use for categorical data General troubleshooting information Treat it as a working document
Data Entry - Multiple Coders If you will have multiple people working on your data, you should encourage them to append the codebook with relevant information as they work on it Also, you should create a column for them to enter their initials next to each line of data they enter. This way, if any issues come up, you can always find out who entered that particular line of data Every time they open the master sheet, ask them to resave it with today’s date and their initials. This will also help you track who did what
Data Entry - Missing Values The easiest thing is to leave them blank Sometimes, however, you might want to make distinctions between different kinds of missing data (e.g. non-applicable, non-available, unclear). In this case, use a different code for each (and write this on the codebook). When using codes for missing values always remember to: Use values outside of the range of your variable Define all missing values in your data analysis package
Data Entry - Long vs. short format Long form: one line of data per participant • Short form: multiple lines of data per participant
Cleaning your data In order to minimize errors while manually entering data, you can set ranges in Excel so that if a value outside the range is entered, the cell will change color. To to this go to Format - Conditional Formatting and specify the ranges for which a different format should show up. Also, you can use the data validation options. Go to Data - Validation
Cleaning Data - Tips Take advantage of labeling options in software packages If possible, sort your variables so that you can easily find them Ask for descriptive statistics of all variables to check if values outside the range were entered Also, ask for skewness. Sometimes a highly skewed statistic might signal something was entered wrong
Cleaning Data - Tips It is always a GREAT idea to plot your data
Dealing with existing data Sometimes you might download the data (e.g. ICPSR) or get it form a colleague, advisor, etc As tempting as it is to start running analysis because the data look clean, you need to make sure the data set is cleaned and organized in a way that makes sense to you
Dealing with existing data To do this, always ask for: Codebook Syntaxes Outputs they have produced Papers published with that data Suggestions on how to handle that particular data if you are not very familiar with it
Keeping Track of Data Sets Ever time you make changes to your data, save it with the current date Keep a document with a list of the major changes with each version A good idea is to keep a folder with the original data sets and create different subfolders as you make changes to the data set. Sometimes it is also a good idea to keep a working directory for currently active files Always make backup copies
Keeping Track of Syntaxes and Outputs Save all the syntax you write Save all the output you produce and try to annotate it as much as possible Save your syntax and output with the data file name a brief description of the analyses and the current date Save syntax and output in a separate folder from your data
Variables Dependent Variable (response, outcome, criterion) Independent Variables (explanatory or predictor variables) • Treatment Variable • Covariates / Confounding Variables Categorical and Continuous Variables Remember: Types of variables we choose, determine the statistics we use
Endogeneity Problem: “Independent” variables are not really independent • The “dependent” and “independent” variables are determined in equilibrium (example: effect of education on wages) • Treatment effects will be biased Modeling approaches to deal with this • Assumption-based methods • Instruments
So, how do I analyze my data? Correlation • Correlation allows you to quantify relationships between variables (r, r-squared) • Regression allows prediction of dependent variable based on one or more independent variables Group differences • t-test & ANOVA • Chi-square for categorical and frequency data Significance v. effect size More Complex Model
Descriptive Statistics Variables Dependent Variable(s) Independent Variable(s) Important Covariates Graphs Summary Statistics on Key Variables Number, Mean, Minimum, Maximum, Standard Deviation Cross-Tabs
Putting Output into a Paper Cut and Paste Graphs Cut and Paste into Word Processing document Save as .jpeg or .tif file Tables Cut and Paste Format in Word Processing document Import into Excel, format, and then place in Word
Last piece of advice…. As tempting as it is to start running analyses as soon as you get a chance, please make sure that you have cleaned your data! There is nothing worse than having to redo analysis….and finding out they have changed! Invest your time wisely. Plan ahead and be organized!