Data Manipulation III Concatenating & Merging, Randomize Sample, and Data Restructuring

Data Manipulation IIIConcatenating & Merging, Randomize Sample, and Data Restructuring SPSS Training Thomas V. Joshua, MS July, 2012 College of Nursing

Lecture Overview • Concatenating & Merging • Randomize Sample • Creating an ID variable • Data Restructuring

Concatenating SPSS Files • Concatenating refers to placing two things on top of one another. • Concatenating data files basically involves adding cases to an existing data file. • In order to use this option, the two files must be identical in number of variables and variable names.

Concatenating SPSS Files • From the menu in the Data Editor: Data -> Merge Files -> Add Cases • An open dataset and an external SPSS data file

Concatenating SPSS Files • You do have an option to rename variables by clicking the "Rename" button, but it actually works better if the variables in the two files are the same prior to merging. • The option to "Indicate case source as variable" will indicate which data file your cases came from. • If you happen to have more variables in the data set to be merged, they will show up in the Unpaired Variables box above. • # of observations = # of obser. in file A + # of obser. in file B

Concatenating SPSS Files • What about the data type for the variable Salary?

Merging Files: one-to-one • To combine the working data file, EmployMain.sav with an external file (EmployBeginSalary.sav) that contains two variables, employee ID (id) and beginning salary (salbegin). • From the menu in the Data Editor: Data -> Merge Files -> Add Variables (with EmployMain.sav opened)

The variable id has identical values to the variable with the same name in the EmployMain.sav, whereas the variable salbegin is unique to the external data file, EmployBeginSalary.sav. • The variable names that appear in both datasets will be in the box labeled Excluded Variables – id • At least one variable must be common to both files in order to perform a merge.

In this example, only the variable id appeared in the Key Variables box because it was the only variable that is in both data files.

Merging Files: one-to-one • Both data sets must be sorted by the same variable(s).

Merging Files: one-to-one • Resulting (merged) data set

Merging Files: match on the basis of a particular variable • For example, you may want to add to your data file a data column containing the average salary for a person's job category. • The dataset for this example, MeanSalary.sav, is shown below, containing a variable for job category, jobcat, and a variable representing the average salary for that job category, meansalary.

Merging Files: match on the basis of a particular variable • Assumes that the EmployMain.sav file is open in the Data Editor. • Non-active dataset means the External file. • Both files need to be sorted on the key variable, jobcat in this example. • Specify a variable on which the two files can be matched.

Randomly Select the Subset • From the menu in the Data Editor: Transform->Random Number Generator. • This step tells SPSS to start at a random place in its table of random numbers. • When doing research involving random numbers — for example, when randomly assigning cases to experimental treatment groups — you should explicitly set the random number seed value if you want to be able to reproduce the same results.

Randomly Select the Subset • Data->Select Cases • Select Random sample of cases • For example, we could select 60% of cases for the model building, and 40% for the model validation.

Creating an ID variable • Often we need to construct an ID variable as the identifier for the observation. • The ID variable is very useful for any data merging, grouping, or stacking. • The data file, smoking.sav, does not have the ID, but case # only. • From the menu in the Data Editor: Transform->Compute

Creating an ID variable • $casenum - a system-wide variable, used to store the identifier for each case

Restructure Data • In order to analyze the data, either in SAS or SPSS, each observation (not each subject) must be on its own line. • In the following example, Anxiety 2.sav, we need to restructure these 4 treatments from the columns to rows.

We like turn selected variable trial1 – trial4 into the cases. It’s typical way to sort your data by the treatment in order to do the analysis for repeat measurement.

From the menu in the Data Editor: Data->Restructure Step 1 Data set will be restructured from wider to longer

Restructure Data Step 2 Variables: trial 1 – trial4 Cases: for a new variable score

Step 3

Restructure Data Step 3 Completed

Restructure Data Step 4

Restructure Data Step 5 You could use the old variable names as the values for this new variable. The default Name is the Index.

Step 6 Step 7

Anxiety2.sav RestructuredAnxiety.sav

Thank You

Data Manipulation III Concatenating & Merging, Randomize Sample, and Data Restructuring