230 likes | 247 Views
Measurement. Class 8. Purposes of Measurement. Make connections between concepts and data A measure is a representation of a variable, or a construct A measure also has to capture the true meaning of the construct. The accuracy of this capture is called validity
E N D
Measurement Class 8
Purposes of Measurement • Make connections between concepts and data • A measure is a representation of a variable, or a construct • A measure also has to capture the true meaning of the construct. • The accuracy of this capture is called validity • But there are many ways to operationalize and measure a construct such as “damages,” “job satisfaction,” “productivity,” “social class,” “segregation” or “attitudes toward the law." • What is critical to understand is that decisions that we make in measuring any of these constructs can bear on the results of the research and the decision to reject the null hypothesis. This is called measurement error. • There is no bigger cancer on social research than measurement error
Variables and Measures • The critical idea of a variable is that it has to vary. That is, it must represent the range of values implicit in the construct. • Variables are operationally defined by how they are measured. Precision in measurement, and close relationship between measure and construct, allows other researchers to replicate findings from other studies • No replication → No validity in the causal claim • By operationally measuring a variable, you allow others to reach an independent judgment about the meaning of your results, and you allow others to try to reach those conclusions on their own.
Illustrations of Theoretical Meanings in Measurement • Some variables can be operationalized only in one way: gender, for example, or jury size. • But most variables can be operationalized and measured in many different ways. • Age can conceptualized as a number, as a range, or as a descriptive—qualitative assessment • Race and ethnicity – two constructs or one ? • Problematic – census categories – see Brent Staples in today’s NYT Editorial • But gender too can be operationalized in more than one way • Consider dangerousness in a civil proceeding, or exposure in a tort case • Violence – acts or consequences? threats or just physical acts? what about robbery?
Types or Levels of Measurement • Examples of nominal measures (measures that indicate distinct categories or types) • Gender • Religious preference • Region • Type of defense counsel • Examples of ordinal measures (measures that enable the ranking of categories but offer no information about the meanings of the intervals between ranks) • Birth order • School grade • Types of criminal sentences
Examples of interval scales (scales where the distance between the points are equal and signify actual differences in the construct)-- - the "zero" point remains arbitrary. • Attitude scale scores • Crime rates (do we actually know what the "zero" point is?) • Temperature • Examples of ratio scales (scales where the distances between units are based on a meaningful "zero" point, and where differences signify relative distances as well as actual distances) • Age, income • Exposure to toxins • Prison sentences lengths • Punitive damage awards • Segregation • Consumer confusion
Scales • Scales are composites of separate items to form composite and complex representations of constructs • Scales avoids reliance on individual items for representation of a construct or phenomenon • Most scales demand that the face validity of the items comprising the scale is high • Never trust a single item scale !!!
Types of Scales • Likert Scales • Scores are obtained directly from respondents and there is no discarding of items with disagreement. Assignment of arbitrary #’s to low or high values. Use of reversals within related items to avoid “response sets).” • The scale is developed in stages where we begin with a large number of items and reduce them through item analysis. Redundant items (where high and low scorers answered specific questions in the same way) are eliminated from the analysis. Scale scores represent the total of responses to items in the scale
Example:Collective Efficacy Scale For each of these statements, please tell me whether you (1) strongly agree, (2) agree, (3) neither agree nor disagree, (4) disagree, or (5) strongly disagree • If there is a problem around here the neighbors get together to deal with it. • This is a close-knit neighborhood. • When you get right down to it, no one in this neighborhood cares much about what happens to me • There are adults in this neighborhood that children can look up to • People around this neighborhood are willing to help their neighbors • People in this neighborhood generally don’t get along with each other • If I had to borrow $30 in an emergency, I could borrow it from a neighbor • People in this neighborhood do not share the same values • People in the neighborhood can be trusted • Parents in this neighborhood know their children’s friends
Guttman Scales • A scale where the responses indicate the precise order of response to the constituent items. That is, responses to one item predict responses to other items. It constructs a scale based on the logic that if one scores positively or high at the upper boundaries of the scale, they also (at least 90% of the time) score high or positively on the lower ranges of the scale. It is a convenient way of organizing ordinal or even nominal data into an interval or ratio scale. • Statistical tests are available to assess whether items fit into a Guttman scale and what the scale properties are (the coefficient of reproducibility). • Raises questions of the temporal dimension of scaling: items may scale well when the time span is long but the scale may be plagued by errors if the time span is short. • Example: Spouse assault scale -- What happens if lengthen or shorten the time frame? Validity threats from this concern?
Example: Conflict Tactics Scale We also are interested in whether your partner has done any of these things have happened to you during any relationship you’ve been in over the past year. Please tell me if this has happened in the past [YEAR], and, if so, how many times you think it has happened. In the last [YEAR]…..No Yes # Times • Has your [PARTNER] pushed, grabbed, shoved, slapped, or shaken you? • Has your [PARTNER] punched, choked, strangled, kicked, bitten, or you? • Has your [PARTNER] thrown an object at you or tried to hit you with an object? • Has your [PARTNER] threatened you with a knife or gun? • Has your [PARTNER] ever shot at or stabbed you? • Has your [PARTNER] tried to stop you from working or studying? • Has your [PARTNER] tried to stop you from having contact with family, friends or co-workers? • Has your [PARTNER] become angry (e.g., yelled, gotten real upset) when you disagreed with his or her point of view? • Has your [PARTNER] damaged, destroyed, hid or thrown out any of your clothes or possessions? • Has your [PARTNER] damaged or destroyed any other property when angry with you? • Has your [PARTNER] locked you out of the house? • Has your [PARTNER] insulted or shamed you in front of others? • Has your [PARTNER] threatened to leave you? • Has your [PARTNER] called you stupid, fat or ugly? • Has your [PARTNER] used physical force or threats of force to make you have sex when you didn’t want to?
Simple additive scales or arbitrary scales • Example: social class – a combination of income, job category, and location (plus subjective or perceptual factors) • Factor analysis • Rather than use an additive or other computational technique to arrange items in a scale, factor analysis offers the possibility of identifying and quantifying underlying patterns or dimensions among the items in a scale. It avoids the assumption that all items tap the same dimension of the phenomenon or construct. • Examples: • Neighborhoods • Childhood exposure to violence • Organizational climate
Procedures: The researcher constructs a correlation matrix, and factors that share a high correlation among themselves are organized computationally as a "factor." Each items is given a factor "loading" that shows its relative correlation within the factor. • The researcher has the choice of either selecting those items within each factor that best represent the factor (those with the highest factor loads), or using the factor score, a composite index of the items based on weights assigned that reflect their factor loads. • Criticisms: • It is sample-specific. The results will vary with the response patterns of the sample. Addition of a few cases can alter the factor scores. • It often is abused and used in a context void of theory. In this regard, it is simply an exercise in "barefoot empiricism:" aimlessly tiptoeing through the data. • When used to analyze data that may itself have measurement errors (e.g., arrest records or bad scales, the errors approach a level that cannot be tolerated.
Evaluating Scales or Measures: Validity and Reliability • Validity and reliability are assessment tools that are, like a Greek chorus, a constant presence in the background, standing off-center or off-stage, commenting on the state of affairs and suggesting plots and weaknesses that undermine the affairs of state at the center stage.
Validity • Validity asks, quite simply, did we measure what we thought we measured? • A variable is validly measured if it accurately measures what you want it to measure. Thus, we ask whether what we observed a function of the actual phenomena and relationships we have hypothesized, or is it an artifact of the research design (especially measurement) that we used to generate these data? • In fact, some of the trends that we think have held up consistently regarding some simple relationships in criminality are explained better by the artifacts of the study design than by the behaviors themselves.
Types of Validity • Face validity • Am I measuring what I think I am measuring? • Example: Family supervision -- LIKE PARENTS • Content validity • Does the item measure the concept it addresses? This is dimension of validity similar to face validity. Yet it differs in one important way: it refers to the ability of the item to distinguish among people within a population. • Examples: Test scores -- all students score the same SRD scales -- kids with black eyes report NONE on the item relating to fighting
Construct validity • Am I measuring what the theory states and what the construct implies? The match between the theoretical and the operational definitions of the concept. Error could lie in the measurement, or it could be in the formulation of the construct, but something is mismatched in this critical relationship. • Examples: Fighting -> deviance --- maybe some behaviors are normative! • Concurrent validity • Does the item have the ability to accurately state the present state of another variable? • Example: Measures of spouse assault from one member of the couple • Predictive validity • Does the item accurately forecast behaviors or outcomes in the future? • Examples: "Dangerousness" and future behavior after release • "Rehabilitation" and parole outcome
Convergent validity • How consistent or distinctive are multiple measures of the same or multiple constructs? Three measures of the same construct should point in the same direction. One measure of different constructs should yield distinctive results. • Examples: Using husband and wife self-reports of both victimization and offending to measure spouse assault • Using MMPI (standardized)and Rorschach (projective, subjective) tests to measure psychopathology • Another element of this strategy is the use of multiple methods. For example, using participant observation to determine whether drinking precedes gang fights • Techniques for assessing validity in surveys • Social desirability scales • Lie detector tests • Known-group tests • Secondary sources of data
Reliability • A reliability coefficient is a measure of the consistency and stability of measurement across subjects and populations. • Consistency: a reliable variable is one where you keep getting the same value every time you measure it. • Stability refers to the consistency of measurement across periods. Simplest example is the test-retest score -- -scores should be consistent across time (after controlling for rival causal factors such as history or maturation) Internal consistency refers to the associations among items within the measurement of a complex phenomenon. • Examples: • Weight – how consistent is a digital v mechanical scale? • Parental supervision
Types of Reliability • Inter-rater reliability refers to the degree of agreement between two independent people on a measure. The judges of Olympic skating provide a good test each year of the reliability of the judging technique. When the East German judge holds up a number very dissimilar from the other scores, we question whether the measure of scoring has good inter-rater reliability. • Test-retest reliability refers to the relationship between the score a person gets on one occasion on one variable and the score he or she receives on a subsequent occasion. The LSAT has good test-retest reliability: in general, scores on one administration are similar to the score he or she receives on a second administration (compared to others who also are taking a second administration) • Multiple instrumentation -- e.g., randomization of items within multiple administrations of a test (used widely by the Educational Testing Service and the major polling organizations). • Split-half methods -- random sampling within a test population to determine internal consistency.
Measuring Reliability • Cronbach's alpha is used most widely, and is available on most statistical packages where N is equal to the number of items and r-bar is the average inter-item correlation among the items. • If you increase the number of items, you increase Cronbach's alpha. Additionally, if the average inter-item correlation is low, alpha will be low. As the average inter-item correlation increases, Cronbach's alpha increases as well. • CFI, RMSEA, etc. – “latent construct” measures – shows whether all items form a single construct when measured concurrently.
The Relationship between Validity and Reliability • You must have reliability in order to have (measurement) validity. If everyone rating a variable came up with a different score every time the variable was measured, you can't draw an inference about the measure of the variable. You don't know if the measure is accurate, or if it is not tapping into the dimension you really want to measure. • But you can have reliability without validity. For example, you could use hat size as a measure of IQ. Reliable yes, but valid? Hardly. • Nothing wrong with the tape measure, differences from one measurement to the next aren't likely to vary much. But hat size has little to do with IQ, so why measure? • Reasonable people will argue about validity. It is rarely an all or nothing assessment, as in the case of age. Validity means not just that the measure was accurate and clear, but that it was true. • Consider the different measures of faculty productivity: article credits, pages per faculty member, prestige of law reviews or journals, footnotes per faculty member, etc. (What about teaching productivity???) All measures are quite reliable, but how valid are they re: "productivity"?