Content-based Interpretations of Test Scores

Content-based Interpretations of Test Scores Michael Kane National Conference of Bar Examiners Maryland Assessment Research Center for Education Success October, 2008

Overview • Argument-based framework for validation • Three content-based interpretations: • observable attributes, • operationally defined attributes • traits • Limitations of content-based validity evidence • “Begging the question”

Validation • To validate test score interpretations and uses is to evaluate the plausibility of the interpretations and the appropriateness of the uses. • Validation is therefore contingent; the evidence relevant to validation depends on the proposed interpretations and uses.

Argument-based Framework for Validation

Interpretations/Uses of scores • In order to evaluate an interpretation, it is necessary to specify what it claims. • What inferences are being draw? • What rules of inferences are being relied on? • What supporting assumptions are being made? • The format used to specify the interpretation and uses is not important. That they be specified is essential.

Argument-based Approach to Validation • The interpretive argument specifies the interpretations and uses of the test performances in terms of the inferences and assumptions used to get from a person’s test performance to the conclusions and decisions based on the test results. • The validity argument provides a critical evaluation of the interpretive argument.

Toulmin’s Model of Inference Datum  [warrant]  so{Qualifier} claim   Backing exceptions

Warrants as Generic Inferences

Characteristics of the Interpretive Argument • “Informal” - Involves substantive inferences and assumptions - not just logical or statistical inferences and assumptions. • “presumptive” - does not prove the conclusions, but develops a presumption in favor of them. • “tentative” - conclusions are uncertain. • “defeasible” – can be overturned in particular cases.

Criteria for Validating/Evaluating Interpretive Arguments • Clarity of the interpretive Argument • Coherence of the interpretive argument • Plausibility of Inferences • Plausibility of Assumptions

Asking the Right Questions • An essential step in validation is the clear, explicit, and complete specification of the proposed interpretations and uses of test scores. • In the absence of a clear and complete understanding of the proposed interpretations and uses, validators literally do not know what they are doing. • To evaluate/validate the claims based on test scores, it is important to know what is being claimed.

Three Distinct Content-based Interpretations: Observable attributes Operationally defined attributes Traits

A Family of Content-based Interpretations • A cluster of closely related attributes that derive much of their meaning from content domains (Observable Attributes, Operational Definitions, and Traits). • These attributes are interesting in themselves. • And they illustrate the dependence of validation on the details of the proposed interpretations and uses.

Observable Attributes • Some kind of behavior is of interest • A target domain (TD) of possible observations (often large and somewhat fuzzy) is specified. • The target score (TS), the expected value over the TD for the person is taken to be the value of the observable attribute (OA) for the person. • Because it is not generally possible to observe all of the observations in the TD, the TS has to be estimated using samples from the TD.

Possible Observations • Observable attributes are dispositions. • They report a tendency to respond in a some way to some kind of stimulus or to perform in some way given a task. • Each possible observation in the TD involves some task or stimulus, some conditions of observation, some context, some response, and a categorization of the response (e.g., good, adequate, marginal, inadequate).

Notes on OAs • OAs are “observable” in the sense that they are expected values over (very large) domains (or sets) of potential observations. • They are inductive summaries. • OAs do not require an explanation for the observations, and they do not assume any latent trait that accounts for the observations. • But they do not rule out explanations in terms of theories, latent traits, etc. Rather, they invite explanation.

What Shapes TDs? • Why do we include some observations and not others in the TD? • Practical needs: performances involved in a job, sport, or other activity • Theoretical context: performances serve the same role or are accounted for in the same way by a theory. • Experience: performances seem to hang together • However, once the TD is specified, it defines the observable attribute.

Examples of Observable Attributes • Performance in shooting free throws in basketball • Performance in responding appropriately to written materials in English • Performance in a job • Performance in a trade or profession • Tendency to respond in some way to some kind of stimulus

Measuring Observable Attributes • Typically, it is not feasible to draw random or representative samples from the TD. • Rather, a measurement procedure is defined in terms of a subset of the TD, from which we can draw random or representative samples. • I will refer to this subset of the TD defining the measurement procedure as the universe of generalization (UG) for the procedure. • I will refer to a person’s expected value over the UG as the person’s universe score (US).

Interpretive Arguments for OAs • Evaluation: from observations to an observed score (OS) • Generalization: from the observed score (OS) to a universe score (US) • Extrapolation: from the universe score (US) to the target score (TS)

Evaluation Generalization Extrapolation Expert judgment supporting scoring rule Generalizability study Criterion-related data study, analyses of relationships between UG performances and TD performances Validity Arguments for OAs

Operational Definitions • In some cases, OAs may be defined in terms of a domain from which it is possible to draw random or representative samples, and the attribute can be operationally defined in terms of a measurement procedure. • For such operationally defined attributes (ODAs), there is no extrapolation to a broader domain, and therefore no need for evidence supporting extrapolation. • So validation is much easier for an ODA than it is for a broadly defined OA.

Interpretive Arguments for Operationally Defined Attributes • Evaluation: from observations to an observed score (OS) • Generalization: from the observed score (OS) to a universe score (US)

Uses of Operational Definitions • An operationally-defined attribute is interpreted in terms of expected test performance. • Any inferences about non-test performances will generally require specific criterion-related evidence. • An ODA can also be used as an indicator for a theoretical construct, but this use requires construct-related validity evidence.

Traits • Trait definitions incorporate target domains of possible observations, but add assumptions about underlying causal traits, that account for performance in the target domain. • As a result, trait interpretations are much richer than the interpretations of observable attributes or operationally defined attributes.

Trait Language 1 • A trait is a disposition to behave or perform in some way in response to some kinds of stimuli or tasks, under some range of circumstances. Much of the meaning of the trait is given by the domain of observations over which the disposition is defined, but trait interpretations also assume, at least implicitly, that some underlying or latent attribute accounts for the observed regularities in performance (Loevinger, 1957).

Trait Language 2 • Messick defined a trait as: “a relatively stable characteristic of a person ... which is consistently manifested to some degree when relevant, despite considerable variation in the range of settings and circumstances” (Messick, 1989, p 15). • Trait language tends to be implicitly causal, but no specific mechanisms describe how the trait influences performance or behavior.

Traits • One can think of a trait as an observable attribute with an added dimension, the underlying latent attribute that accounts for the observed performances. • Alternately, one can think of a latent “trait” (e.g., anxiety, quantitative aptitude), and then specify a corresponding target domain of possible observations. • Either way, we have a target domain and an underlying latent trait.

Interpretive Arguments for Traits • Evaluation: from observations to an observed score (OS) • Generalization: from the observed score (OS) to a universe score (US) • Extrapolation: from the universe score (US) to the target score (TS) • Explanation/Implications: from the target score (TS) to the latent trait and to any implications associated with the trait

Validating Trait Interpretations • Validation requires backing for the scoring and generalization inferences, and typically for an extrapolation inference. • In addition, validation calls for backing for any additional inferences associated with the trait claims: • Unidimensionality • Agreement with theory (as in Cronbach and Meehl, 1955) • Relationship to other variables • Fit to an IRT model

Limitations of Content-based Validity Evidence

Criticisms of the Content Model • Content-based judgments about content relevance and representativeness are typically made during test development and have a confirmationist bias. • Messick (1989) saw content-validity evidence as playing a minor role in validation because it doesn’t apply directly to “inferences to be made from test scores” (p. 17). • Cronbach (1971, p.452) maintained that, • Judgments about content validity should be restricted to the operational, externally observable side of testing. Judgments about the subject’s internal processes state hypotheses, and these require empirical construct validation. (italics in original)

Judgment

Confirmationist Bias and the Stages of Validation • Development Stage: Creating the test and the interpretive argument • Done by test developers • Tends to be confirmationist • Most content-related evidence is collected • Appraisal Stage: challenging the interpretive argument

The Begging-the-question Fallacy • Begging the question occurs if a large part of the question at issue is simply taken for granted or “begged”. • In the weakest applications of content-validity models, content judgments are used to justify very expansive interpretations (e.g., in terms of traits, theoretical constructs) and uses (accountability).

To validate the interpretations and uses of test scores is to evaluate all of the claims being made.

Content-based Interpretations of Test Scores

Content-based Interpretations of Test Scores

Presentation Transcript

Content-Based Overlays

Position Velocity Including Calculus Based Interpretations

How to Raise Test Scores

Linear Regression: Test scores vs. HW scores

Interpretations

Making Sense of Standardized Test Scores

Statistics: Test Scores

RAISING TEST SCORES

CHAPTER 5 Test Scores as Composites

test attachment content

Statistics: Test Scores

How to Interpret Test Scores

Interpretations of I

Test scores

Interpreting Test Scores: Making Sense of the Numbers

Importing Standardized Test Scores

Impact of Interruptions on Test Scores in Indiana

Issues in Comparability of Test Scores Across States

Content-Based Instruction

Comparisons of Content Scores

Test content

Interpretations of Backoff