440 likes | 456 Views
Understand the importance of getting to know your variables in multivariate analysis, including context, unit of analysis, restrictions on your sample, level of measurement, missing values, and interpretation of values.
E N D
Getting to know your variables Jane E. Miller, PhD The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Overview • Why “get to know” your variables? • Context • Unit of analysis • Restrictions on your analytic sample • Information about each variable • Level of measurement • Missing values • Valid range of values • Substantive interpretation of values • Distribution of observed values in your data set
Why is it important to get to know your variables? • Each variable measures • A specific concept • Numeric values have particular meanings that differ depending on the nature of that concept • In a particular context • When, where, to whom do those numbers pertain? • Collected with a specific study design • Need to understand why some values are missing • By design • Due to non-response
Example of failingto get to know variables • In a nationally representative survey sample from a developing country circa 2002, birth weight in grams observed range up to 9999. • Data set downloaded from a research data web site; not cleaned or evaluated before use. • Mean birth weight over 8000 in the sample • First red flag: Implausible as an actual birth weight, given its meaning and units. 9,999 grams ~= 22 lbs. • 9999 was a code for missing value • Lesson learned: Must become familiar with what a particular value means for that concept and context.
Secondred flag • 2/3 of sample had a birth weight value of 9999 • Very high value for a substantial share of the sample, unlikely to be explained solely by • outliers • data entry errors • Lesson learned: Look at study documentation and questionnaire to find out why this distribution was observed. • Occurred due to a skip pattern designed to minimize recall bias in birth weight reporting.
Resources needed for this exercise • Documentation on the data source • Description of study design • Questionnaire • Codebook for electronic data file • Electronic file of database • Statistical software • Your research question • Articles, books, web sites etc. on your topic • Dependent and key independent variables
Time needed for this exercise • This is not an assignment that can be done overnight • Multi-step process involving multiple resources • About data source • About topic • Results from early steps will inform later steps in the exercise • May also need feedback from • Mentors • Colleagues • Persons involved in original data collection
Getting to know your variablesis project-specific • Several issues that inform this assignment are specific to research question and data set. • Unit of analysis • Restrictions on your analytic sample • Roles of different variables in your analysis • Dependent, independent, control, filter • Even experienced researchers should complete this assignment when undertaking a project with a new topic or data set.
Valuable information for all parts of a well-written research paper • Reading the literature on your topicwill provide information needed for • Introduction • Literature review • Discussion • Detailed knowledge of study design and variables from documentation, questionnaire and codebook will provide information for • A comprehensive data section • Appropriate model specification • Interpretation of statistical results
Context of the data • When • One point in time? • Several points in time (e.g., rounds of data collection) • Where • Geographic location • Institution(s) • Whom • All people in that place and time? • Limited to demographic, health or other subgroups? • Important because the topic alone is often insufficient to identify unrealistic values of variables.
Unit of analysis • Do data pertain to • Individual people? • Families? • Census tracts? • Institutions? • Knowing unit of analysis helps ascertain plausible range of values, • E.g., the mean number of family members will be much lower than population of a census tract or a school.
Restrictions on the analytic sample • Before you acquaint yourself with the range of values for each variable in your analysis, impose any limits related to your research question. E.g., • particular demographic traits • minimum test scores • a specific disease • Exclude subgroups that don’t meet minimum sample size if: • there aren’t enough cases in one or more subgroups of a key variable to provide sufficient statistical power • it would not be theoretically sensible to combine them with other subgroups used in your analysis
Attributes of each variable to familiarize yourself with prior to analysis
Labeling, coding, and missing value information for your variables • To help you create a comprehensive record of information on each of the variables in your analysis, fill out a grid like this one • An electronic version can be found online in the “getting to know your variables” assignment.
“Old” and “new” variables • Familiarize yourself with all variables to be used in your analysis. • Variablesanalyzed in the same form in which they appeared in the original data set • Variables you created from those variables, e.g., • Categorical versions of continuous variables • Aggregated variables, e.g., • income calculated from several sources • scales that combine responses to multiple items • Calculated variables • E.g., body mass index calculated from weight and height • Transformed variables (logged, standardized)
Organizing variables within the grid • Using major row headings, label sections for each of the following, based on their role in your analysis • Dependent variable(s) • Key independent variables • Control variables • Sampling weights • Filter questions (e.g., used to restrict sample)
Variable names and labels • For each variable, fill in: • Variable name: a short (up to ~8 character) acronym used to identify the variable in the software program you are using • Variable label: a descriptive phrase of up to 40 characters that helps convey the meaning of the variable • If you rename an item with a more informative variable name (e.g., “gender” instead of Q117), include the original question name in the variable label
Level of measurement • Categorical variables are those that are classified into ranges or categories. • Continuous variables • Measured in numeric units, but not grouped. • Two types of continuous variables: • Interval • Zero is not lowest possible value • e.g., temperature °Fahrenheit • Ratio • Zero is lowest possible value • e.g., temperature Kelvin, height, weight Helps to anticipate limits on range of values
Categorical variables Nominal variables Ordinal variables Categories have an inherent numeric order Examples: Letter grades Age group Likert scale items E.g., from strongly disagree to strongly agree • Noinherent order to the categories • Numeric value labels have NO mathematical interpretation • Examples: • Gender • Race • Geographic region
Units of measurement • System of measurement • E.g., Metric or British or other? • income in dollars or euros or pesos? • Level of aggregation • E.g., income per hour or per week or per year? • Scale • E.g., income in dollars or thousands of dollars or millions of dollars? • See also podcast on “reporting one number”
Missing values • Missing values on a variable can occur because they are • Not applicable • Missing by design • Non-response • See chapters 4 and 10 of The Chicago Guide to Writing about Numbers, 2nd Edition for more on missing values and missing by design.
Not applicable • Some questions are not asked of specific respondents because they don’t pertain. • E.g., if someone reports that they are unemployed, they wouldn’t be asked about their current job type or earnings • Look • At the questionnaire or form used to collect the data for • a filter question • a skip pattern • At the codebook for one or more missing value codes for non-response
Missing by design • Some topics are not asked of specific subgroups due to concern about the accuracy of their responses. • E.g., to minimize recall bias, mothers asked birth weight only for children under age 5 years. • Surveys sometimes administer specialized topic modules only to a randomly selected subsample of respondents. • Used to obtain a smaller representative sample that meets statistical power requirements while reducing study costs. • Read the study design documentation to find out whether these pertain to the question you are using.
Item non-response • Another reason for missing values is when a respondent does not answer a question that was asked of them. • Item non-response is particularly common for • stigmatized topics • questions that require complex or detailed answers • unclear instructions about number of allowed responses • Examples: Respondent was asked • to report income, but didn’t know it • immigration status, but had concerns about deportation
Types of non-response • Don’t know • Refused to answer the question • Didn’t answer the question (unspecified reason) • Other • marked too many answers to a single response question • wrote an illegible answer • Look up the pertinent missing value codes for each of your variables in the codebook for your data set.
Valid range of values • Definitional limits • Conceptually plausible range • Context of measurement • Observed range • Watch for numeric values for missing values • Label them in your electronic database, so they are treated correctly during analysis.
Definitional limits on values • Some variables by definition have limits on the range of values they can assume: • A percentage share of a whole must fall between 0 and 100 • Likewise, a proportion must fall between 0 and 1 • However, a percentagechange can be • Negative (<0) • Greater than 100 • Variables at the ratio level of measurement cannot take on negative values • Other topic- or field-specific variables also have such restrictions • E.g., a Gini coefficient must fall between 0 and 1
Plausible range of values for the concept being measured A value of 10,000 • Makes sense in at least some contexts for • Annual family income in dollars • Population of a census tract • An annual death rate per 100,000 persons • Does NOT make sense for • Hourly income in dollars • Birth weight in grams • Number of persons in a family • A Likert scale item • A proportion • An annual death rate per 1,000 persons
Another example of plausible range of values A value of –1 • Makes sense in at least some contextsfor • Temperature in degrees Fahrenheit or Celsius • Change in rating on a 5 point scale • Change in death rate • Percentage change in income • DoesNOTmake sense for • Temperature in degrees Kelvin • Number of persons in a family • Death rate • A Likert scale item • A proportion
Descriptive statistics on your variables • After you have • Imposed restrictions on your analytic sample • Filled in missing value codes for each variable • Complete a grid like the one below, with descriptive statistics on each of the variables in your analysis. • An electronic version of this grid can be found online in the “getting to know your variables” assignment.
Familiarizing yourself with the concepts under study • To identify plausible ranges of values for each of your variables, read the literature on your dependent and key independent variables. • Read for • how each concept is operationalized in the data set • standards, cutoffs, or transformations commonly used for that variable in your field • range of values observed • but pay attention to differences in who, when, where studied
Check each distribution against the codebook for the original source • Codebooks for some data sets provide information on • frequency distribution of categorical variables • range and/or mean values for continuous variables • number of cases with missing values, by reason for missing value (not applicable, refused, etc.) • Check the distribution of values observed in your analytic sample for each variable against the codebook for your data set. • If any distributions are inconsistent, do NOT analyze the data until you have resolved the discrepancies!
Identify reasons for inconsistencies • Review your answers to the previous steps in this exercise to identify possible reasons for discrepancies between your statistics and the codebook, such as: • Units of analysis • e.g., family instead of individual • Restrictions on your analytic sample • e.g., excluding a subgroup that is included in the statistics shown in the overall codebook • Scale • e.g., grams instead of kilograms • Transformations you have made to the variables, e.g., • logged values • multiples of standard deviations rather than original units
Check each distribution against the literature on similar variables • Track down information in the published literature on each of your main variables for a similar population. • Check the distribution of values for each variable in your data set against the values from the external source of information about that variable. • Again, if the values of your data are substantially different from those used in other studies of the same concepts, do NOT analyze the data until you have resolved the discrepancies!
Identify reasons for inconsistencies • Review your answers to the previous steps in this assignment to explain possible reasons for discrepancies between your data and other similar data sets, such as: • Population studied, e.g., substantially different time, place, and/or subgroup • Units of analysis, e.g., family instead of individual • Units of measurement, e.g., metric instead of British units • Scale, e.g., grams instead of kilograms • Transformations of the variables, e.g., percentiles instead of original value
Summary • Before you conduct your analyses, it is critical that you and other members of your research team become familiar with the following for each variable • Levels of measurement • Units and categories • Plausible and observed ranges of values • Missing values and their reasons • Compare observed values against • Documentation for the data set you use in your analysis • The published literature on your topic
Summary, continued • These attributes are essential information for • Data preparation • Inclusion criteria for your analytic sample • Creation of new variables • Choice of pertinent descriptive and multivariate statistics • Design of correct charts and tables • Writing correct prose descriptions for the data and methods and results sections of your paper. • Even experienced researchers should complete this assignment when they undertake a project with a new topic or data set.
Reasons for getting to know your variables, redux • Exercises in this podcast are time-consuming but very valuable for generating in-depth knowledge needed for your paper • Reading the literature on your topicwill yield information needed for the introduction, literature review and discussion sections. • Detailed knowledge of study design and variables from documentation, questionnaire and codebook will yield information for the data and methods and results sections.
Resources on your topic • Articles, books, reports, or web sites related to the main independent and dependent variables in your data • Definitions of concepts under study • Operationalization (how those concepts are actually measured in a particular data set) • Observed distributions in populations similar to those from which your sample is drawn • Commonly used transformations of those variables prior to analysis
Resources on your data set • Documentation on study design • Context (who, when, where) • Sampling • Unit of analysis • Questionnaire or other data collection instrument • Modules • Wording of questions • Skip patterns • Codebook • Levels of measurement, categories, units • Missing value codes • Distribution of observed values for the study sample
Suggested readings • Miller, J. E. 2015. The Chicago Guide to Writing about Numbers, 2nd Edition. • chapter 4 on levels of measurement, units, standards and cutoffs • chapter 10 on data and methods • chapters 4 and 10 on missing values and missing by design • Chambliss, Daniel F., and Russell K. Schutt. 2012. Making Sense of the Social World: Methods of Investigation, 4th Edition. Thousand Oaks, CA: Sage Publications, or other research methods book for information on • study design, conceptualization, and measurement
Suggested online resources • Podcasts on • Reporting one number (re: units) • Comparing two numbers or series of numbers (re: levels of measurement) • Planning how to create the variables you need from the variables you have
Suggested practice exercises • Study guide to The Chicago Guide to Writing about Numbers, 2nd Edition. • Problem sets for • chapter 4, questions # 6 and 13 • chapter 10, questions #1, 3, and 5 • Suggested course extensions for • chapter 4 • “Reviewing” exercise #1 • “Estimating statistics and writing” exercises #1 and 2 • chapter 10 • “Reviewing” exercises #1 and 3 • “Writing” exercise #4 • “Revising” exercise #3
Contact information Jane E. Miller, PhD jmiller@ifh.rutgers.edu Online materials available at http://press.uchicago.edu/books/miller/numbers/index.html