240 likes | 385 Views
A Demonstration and Exercise Investigation into Wide Reshaping. The dataset “bp long.dta” purports to be a long dataset relating to blood pressure measurements in three groups of male and female volunteers before and after taking medication. Our investigation here will focus on the
E N D
A Demonstration and Exercise Investigation into Wide Reshaping
The dataset “bp long.dta” purports to be a long dataset relating to blood pressure measurements in three groups of male and female volunteers before and after taking medication. Our investigation here will focus on the application of reshaping to explore this data graphically using Stata 8’s advanced graphics The name of the dataset strongly suggests that the dataset is in long form … how could you confirm this? What are the major properties of long data? . use "bp long", clear (fictional blood-pressure data) . describe Contains data from bp long.dta obs: 240 fictional blood-pressure data vars: 5 11 Feb 2004 17:54 size: 2,640 (99.9% of memory free) ------------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------------- patient int %8.0g Patient ID when byte %9.0g bp int %9.0g sex byte %9.0g sex Sex agegrp byte %9.0g agegrp Age Group ------------------------------------------------------------------------------- Sorted by: patient when
. summarize Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- patient | 240 60.5 34.71221 1 120 when | 240 1.5 .5010449 1 2 bp | 240 153.9042 13.0837 125 185 sex | 240 .5 .5010449 0 1 agegrp | 240 2 .8182029 1 3 . gen When="Before" if when==1 (120 missing values generated) . replace When="After" if when==2 (120 real changes made) . graph bar bp, by(age sex, rows(3)) over(When) ylabel(,angle(0)) The only simple way of describing this data graphically is with a bar plot over the ‘key’ variable of interest, ‘when’ in regard to taking the medicine
Improvements become available when we widen the dataset. How is it likely that widening the dataset does improve our graphics options? Notice that we ‘quietly’ replaced the integer version of the variable ‘when’ with a string version of the variable … ‘When’. Why did we do this? How did we accomplish this? What does this mean for widening the dataset? What would we need to do to this dataset before widening it? Now reshape this dataset to a wide form using ‘When’ as the group variable
First remove the variable ‘when’ . drop when . reshape wide bp, i(patient) j(When) string (note: j = After Before) Data long -> wide ---------------------------------------------------------------- Number of obs. 240 -> 120 Number of variables 5 -> 5 j variable (2 values) When -> (dropped) xij variables: bp -> bpAfter bpBefore ---------------------------------------------------------------- Now we reshape as requested What indications do we have here that the dataset has been widened? What new variables have been created? What old variables have vanished? Where have the variables gone?
How do these commands, describe and summarize, help us confirm that we have achieved what we needed with the reshape? . describe Contains data obs: 120 fictional blood-pressure data vars: 5 size: 1,440 (99.9% of memory free) ------------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------------- patient int %8.0g Patient ID bpAfter int %9.0g After bp bpBefore int %9.0g Before bp sex byte %9.0g sex Sex agegrp byte %9.0g agegrp Age Group ------------------------------------------------------------------------------- Sorted by: patient Note: dataset has changed since last saved . summarize Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- patient | 120 60.5 34.78505 1 120 bpAfter | 120 151.3583 14.17762 125 185 bpBefore | 120 156.45 11.38985 138 185 sex | 120 .5 .5020964 0 1 agegrp | 120 2 .8199201 1 3
Using the command graph bar bp*, by(age, cols(3)) over(sex) we produce the below bar graph … what advantages have we gained over the ‘long’ bar plot? What additional improvement could be sought?
A simple variation to the graph command … graph bar bp*, over(age) over(sex) yields the following. Are further plot improvements indicated?
Reshaping in either direction, wide or long, can be performed multiple times. Indeed, it becomes more manageable to reshape in a series of steps than in a single, complex, and possibly confusing step. So armed with this we will widen the dataset again, this time by sex. Is ‘sex’ currently in an appropriate form for reshaping, or would it be better if its form was changed? How would you recommend that we alter the ‘sex’ variable to assist with the reshape step? How will your recommendation actually help us? Use this dataset description response to help you here . describe Contains data obs: 120 fictional blood-pressure data vars: 5 size: 1,440 (99.9% of memory free) ------------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------------- patient int %8.0g Patient ID bpAfter int %9.0g After bp bpBefore int %9.0g Before bp sex byte %9.0g sex Sex agegrp byte %9.0g agegrp Age Group -------------------------------------------------------------------------------
. decode sex, gen(Sex) . reshape wide bp*, i(patient) j(Sex) string (note: j = Female Male) Data long -> wide ----------------------------------------------------------------------------- Number of obs. 120 -> 120 Number of variables 6 -> 7 j variable (2 values) Sex -> (dropped) xij variables: bpAfter -> bpAfterFemale bpAfterMale bpBefore -> bpBeforeFemale bpBeforeMale ----------------------------------------------------------------------------- What was accomplished by the above ‘decode’? How were values of the variable ‘Sex’ derived? What does the ‘*’ mean in the reshape command? How many variable were created from the reshape? How many variables were lost? Has the dataset length changed? Are there redundancies in the dataset? With this data can you see how we can easily separate the ‘sex’ effects of the medicine response, by age group?
The female pattern by treatment (medicine) and age group is graph bar bp*Female ,by(age, cols(3)) How does a single entity enable us to see all 6 responses?
Here we show a box plot of the same results but for the males in our study group What new information do we obtain from ‘box’ versus ‘bar’ plots? Comparing this plot with the last one what sex difference do we see? graph box bp*Male, by(age, cols(3)) ylabel(,angle(0))
The below code from our ‘wide plot’ demonstration loops through the ‘3’ age groups in our datasets and plots the male and female, before and after, responses … forvalues Age=1/3 { graph bar bp*Male bp*Female if `Age'==age, title("Age group: `Age'") more } What is the meaning of ‘Age’ in this code? How do we ensure that only one age group is plotted at once? Are we seeking ‘box’ or ‘bar’ plots? Try changing this code so that the other plot style is produce. That is if the segment calls fro box plots modify it to produce bar plots, and if it produces bar plots change it to produce box plots.
By reshaping long twice we will naturally be able to return the dataset from its current form to its original form. Can you think of some special precautions we will need to negotiate to accomplish this? Immediately following a ‘reshape wide’ command what command, without any arguments will reshape the dataset long … that is to its state just prior to the reshape wide? Try this command and see what it achieves (hint ‘reshape long’). After getting the dataset back one reshape long check the dataset to: a: confirm that you have achieved what was needed b: isolate the next step, or need for data refinement to successful perform the next reshape long.
First reshape command . reshape long (note: j = Female Male) Data wide -> long ----------------------------------------------------------------------------- Number of obs. 120 -> 240 Number of variables 7 -> 6 j variable (2 values) -> Sex xij variables: bpAfterFemale bpAfterMale -> bpAfter bpBeforeFemale bpBeforeMale -> bpBefore ----------------------------------------------------------------------------- What has this achieved? Can you spot a problem? How can we correct the problem? Try the correction and see if the problem has been resolved
The problem was that the dataset organization after the second reshape wide was not as compact as it could have been and this lead to missing values being placed in the dataset. We remove the missing values as follows: . drop if bpAfter==. (120 observations deleted) . reshape long bp, i(patient) j(When) string (note: j = After Before) Data wide -> long ----------------------------------------------------------------------------- Number of obs. 120 -> 240 Number of variables 6 -> 6 j variable (2 values) -> When xij variables: bpAfter bpBefore -> bp ----------------------------------------------------------------------------- . encode When, gen(when) . drop When Now we are in ‘good shape for our next ‘reshape long’. Note that our reshape long commands applied to restore our dataset bear a close resemblance to the reshape wide commands which led to our widened dataset All that remains is to put the variable ‘when’ back in its appropriate form
Answers to questions Slide 1: How could you confirm that the data is in long form? Variable name is single purposed. As many observations in the dataset as there are ‘actual’ observations. Slide 1: What are the major properties of long data? Is always analyzable, as is, by statistical software. Is an inefficient format for data entry. Does NOT expose data traits by simple inspection Slide 3: What are the strengths and weaknesses of this plot? Single color representation. Does NOT easily expose key traits of the study outcome. Not in an acceptable form for most journals .. in spite of conservative color format. Slide 5: How is it likely that widening the dataset does improve our graphics options? We can configure plotting characteristics for each of the variables in each plot command. This means that it is easier to expose and separate the features. Slide 5: Why did we do this? To give meaning to the new variables. Numbers per se do not have meanings by their values
Slide 5: How did we accomplish this? With the code . gen When="Before" if when==1 . replace When="After" if when==2 Slide 5: What does this mean for widening the dataset? Specifying that our group (or heritable variable) is a string Slide 5: What would we need to do to this dataset before widening it? Drop the old ‘when’ variable. It adds nothing Slide 6: What indications do we have here that the dataset has been widened? Number of observations has been reduced. Number of variables has NOT Increased Slide 6: What new variables have been created? bpBefore, and bpAfter What old variables have vanished? bp, and When Where have the variables gone? bp -> bpBefore and bpAfter, When was dropped Slide 7: How do these commands, describe and summarize, help us confirm that we have achieved what we needed with the reshape? They show a) the new variables, b) the size of the new data organization, c) details of the values of the newly created variables
Slide 8: what advantages have we gained over the ‘long’ bar plot? Control over the portrayal of the individual plot objects. What additional improvement could be sought? We could explore gaining control of the ‘sex’ display traits by additional reshaping wide. Slide 9: Are further plot improvements indicated? Perhaps … tho’ this is up to the journal and the material to communicate Slide 10: Is ‘sex’ currently in an appropriate form for reshaping, or would it be better if its form was changed? Strings work better that numbers for wide reshapes because numbers don’t convey any meaning How would you recommend that we alter the ‘sex’ variable to assist with the reshape step? Recode it as a string How will your recommendation actually help us? By tagging the newly created long variables appropriately. Slide 11: What was accomplished by the above ‘decode’? Sex was coded as a string. How were values of the variable ‘Sex’ derived? From the label of the encoded variable sex. What does the ‘*’ mean in the reshape command? Accept any ‘group’ of characters in this position as a means of identifying the variable.
Slide 11: How many variable were created from the reshape? This is why we describe the reshaped dataset as soon as we have reshapes How many lost? Has the dataset length changed? Are there redundancies in the dataset? Yes and we must correct the dataset in this regard if we intend to work with it further. We went from 120*6 data items to 120*7 data items. It would be good if you listed the dataset now to see where the problem has arisen. This is the most common problem associated with multiple reshaping. Slide 12: How does a single entity enable us to see all 6 responses? Three age groups, and two times, Note that the time points Before and After are dealt with implicitly Slide 13: What new information do we obtain from ‘box’ versus ‘bar’ plots? The distribution of the bp values, as well as any extremes. Comparing this plot with the last one what sex difference do we see? We see that the men seem better controlled in the mid-age group because the box plot is much richer than the bar plot, enabling us to make finer empirical judgments
Slide 14 What is the meaning of ‘Age’ in this code? It’s a loop index variable .. running from 1 to 3 How do we ensure that only one age group is plotted at once? Age and age must have the same value Are we seeking ‘box’ or ‘bar’ plots? Bar Slide 16: Can you think of some special precautions we will need to negotiate to accomplish this? We’ll need to track down the redundancy associated with our last wide reshape. Immediately following a ‘reshape wide’ command what command, without any arguments will reshape the dataset long … that is to its state just prior to the reshape wide? Reshape long Slide 17: What has this achieved? Check for yourself using describe. Can you spot a problem? There is a redundancy How can we correct the problem? Delete missing observations
* A demonstration of reshaping in conjunction * with graphics use "bp long", clear describe summarize gen When="Before" if when==1 replace When="After" if when==2 graph bar bp, by(age sex, rows(3)) over(When) ylabel(,angle(0)) drop when reshape wide bp, i(patient) j(When) string describe summarize graph bar bp*, by(age, cols(3)) over(sex) graph bar bp*, over(age) over(sex) decode sex, gen(Sex) reshape wide bp*, i(patient) j(Sex) string graph bar bp*Female ,by(age, cols(3)) graph box bp*Male, by(age, cols(3)) ylabel(,angle(0)) forvalues Age=1/3 { graph bar bp*Male bp*Female if `Age'==age, title("Age group: `Age'") more } reshape long drop if bpAfter==. reshape long bp, i(patient) j(When) string encode When, gen(when) drop When