Statistical Concepts for Appraisal: Understanding Measures, Distributions, and Relationships

Comprehensive Exam Review Click the LEFT mouse key ONCE to continue

Appraisal Part 2 Click the LEFT mouse key ONCE to continue

Statistical Concepts for Appraisal

A frequency distribution is a tabulation of scores in numerical order showing the number of persons who obtain each score or group of scores. A frequency distribution is usually described in terms of its measures of central tendency (i.e., mean, median, and mode), range, and standard deviation.

The (arithmetic) mean is the sum of a set of scores divided by the number of scores. The median is the middle score or point above or below which an equal number of ranked scores lie; it corresponds to the 50th percentile. The mode is the most frequently occurring score or value in a distribution of scores. The range is the arithmetic difference between the lowest and the highest scores obtained on a test by a given group.

Variability is the dispersion or spread of a set of scores; it is usually discussed in terms of standard deviations. The standard deviation is a measure of the variability in a set of scores (i.e., frequency distribution). The standard deviation is the square root of the squared deviations around the mean (i.e., the square root of the variance for the set of scores).

The normal distribution curve is a bell-shaped curve derived from the assumption that variations from the mean are by chance, as determined through repeated occurrences in the frequency distributions of sets of measurements of human characteristics in the behavioral sciences. Scores are symmetrically distributed above and below the mean, with the percentage of scores decreasing in equal amounts (standard deviation units) as the scores progress away from the mean.

Skewness is the degree to which a distribution curve with one mode departs horizontally from symmetry, resulting in a positively or negatively skewed curve. A positive skew is when the “tail” of the curve is on the right and the “hump” is on the left. A negative skew is when the “tail” of the curve is on the left and the “hump” is on the right.

Kurtosis is the degree to which a distribution curve with one mode departs vertically from symmetry . A leptokurtic distribution is one that is more “peaked” than the normal distribution. A platokurtic distribution is one that is “flatter” than the normal distribution.

Percentiles result from dividing the (normal) distribution into one hundred linearly equal parts. A percentile rank is the proportion of scores that fall below a particular score. Two different percentiles may represent vastly different numbers of people in the normal distribution, depending on where the percentiles are in the distribution.

Standardization, sometimes called “normalizing,” is the conversion of a distribution of scores so that the mean equals zero and the standard deviation equals 1.0 for a particular sample or population. “Normalizing” a distribution is appropriate when the sample size is large and the actual distribution is not grossly different from a normal distribution.

Standardization, or normalizing, is an intermediate step in the derivation of standardized scores, such as T scores, SAT scores, or Deviation IQs. Stanines are a system for assigning a score of one through nine for any particular score. Stanines are derived from a distribution having a mean of five and a standard deviation of two.

A correlation coefficient is a measure of relationship between two or more variables or attributes that ranges in value from -1.00 (perfect negative relationship) through 0.00 (no relationship) to +1.00 (perfect positive relationship). A regression coefficient is a measure of the linear relationship between a dependent variable and a set of independent variables.

The probability (also known as the alpha) level is the likelihood that a particular statistical result occurred simply on the basis of chance. The coefficient of determination is the square of a correlation coefficient. It is used in the interpretation of the percentage of shared variance between two sets of test scores.

Error of measurement is the discrepancy between the value of an observed score and the value of the corresponding theoretical true score. The standard error of measurement is an indicator of how closely an observed score compares with the true score. This statistic is derived by computing the standard deviation of the distribution of errors for the given set of scores.

Measurement error variance is the portion of the observed score variance that is attributed to one or more sources of measurement error (i.e., the square of the standard error of measurement). Random error is an error associated with statistic analyses that is unsystematic, often indirectly observed, and appears to be unrelated to any measurement variables.

Differential item functioning is a statistical property of a test item in which, conditional upon total test score or equivalent measure, different groups of test takers have different rates of correct item response. The item difficulty index is the percentage of a specified group that answers a test item correctly.

The item discrimination index is a statistic that indicates the extent to which a test item differentiates between high and low scorers. Extrapolation is the process of estimating values of a function beyond the range of the available data.

A confidence interval is the interval between two points on a scale within which a score of interest lies, based on a certain level of probability. The error of estimate (standard or probable) is the degree to which test scores estimated from a criterion correspond with actual scores.

The regression effect is the tendency of a predicted score to be nearer to the mean of its series of scores than was predicted. A factor is a hypothetical dimension underlying a psychological construct that is used to describe the construct and intercorrelations associated with it.

Factor analysis is a statistical procedure for analyzing intercorrelations among a group of variables, such as test scores, by identifying a set of underlying hypothetical factors and determining the amount of variation in the variables that can be accounted for by the different factors. The factorial structure is the set of factors resulting from a factor analysis.

Reliability

Reliability is the degree to which an individual would obtain the same score on a test if the test was re-administered to the individual with no intervening learning or practice effects. The reliability coefficient is an index that indicates the extent to which scores are free from measurement error. It is an approxi-mation of the ratio of true variance to observed score variance for a particular population of test takers.

The coefficient of equivalence is a correlation between scores for two forms of a test given at essentially the same time; also referred to as alternate-form reliability, a measure of the extent to which two equivalent or parallel forms of a test are consistent in what they measure. The coefficient of stability is a correlation between scores on two administrations of a test, such as test administration and retest with some intervening time period.

The coefficient of internal consistency is a reliability index based on interrelationships of item responses or of scores on sections of a test obtained during a single administra-tion. The most common examples include the Kuder-Richardson and split-half. Coefficient Alpha is a coefficient of internal consistency for a measure in which there are more than dichotomous response choices, such as in the use of a Likert scale.

The split-half reliability coefficient is a reliability coefficient that estimates the internal consistency of a power test by correlating the scores of two halves of the test (usually the even-numbered items and the odd-numbered items, if their representative means and variances are equal). The Spearman-Brown Prophecy Formula projects the reliability of a test that has been reduced from the calculated reliability of the test. It is a “correction” appropriate for use only with a split-half reliability coefficient.

Interrater reliability is an index of the consistency of two or more independent raters’ judgments in an assessment situation. Intrarater reliability is an index of the consistency of each independent rater’s judgments in an assessment situation.

Validity

Validity is the extent to which a given test measures or predicts what it purports to measure or predict. The two basic approaches to the determina-tion of validity include logical analysis, which applies to content validity and item structure, and empirical analysis, which applies to predictive validity and concurrent validity. Construct validity falls under both logical and empirical analyses.

Validity is application specific, not a generalized concept. That is, a test is not in and of itself valid, but rather is valid for use for a specific purpose for a specific group of people in a specific situation. Validation is the process by which the validity of an instrument is measured. Face validity is a measure of the acceptability of a given test and test situation by the examinee or user, in terms of the apparent uses of the test.

Concurrent validity is a measure of how well a test score matches a measure of criterion performance. Example applications include comparing a distribution of scores for men in a given occupation with those for men in general, correlating a personality test score with an estimate of adjustment made in a counseling interview, and correlating an end-of-course achievement or ability test score with a grade-point average.

Content validity is a measure of how well the content of a given test represents the subject matter (domain or universe) or situation about which conclusions are to be drawn. A construct is a grouping of variables or behaviors considered to vary across people. A construct is not directly observable but rather is derived from theory. Construct validity is a measure of how well a test score yields results in line with theoretical implications associated with the construct label.

Predictive validity is a measure of how well predictions made from a given test are confirmed by data collected at a later time. Example applications of predictive validity include correlating intelligence test scores with course grades or correlating test scores obtained at the beginning of the year with grades earned at the end of the year.

Factorial validity is a measure of how well the factor structure resulting from a factor analysis of the test matches the theoretical framework for the test. Cross-validation is the process of determining whether a decision resulting from one set of data is truly effective when used with another relevant and independent data set.

Convergent evidence is validity evidence derived from correlations between test scores and other types of measures of the same construct and in which the relationships are in predicted directions. Discriminant evidence is validity evidence derived between test scores and other forms of assessment for different constructs and in which the relationships are in predicted directions.

Appraisal of Intelligence

A very general definition of intelligence is that it is a person’s global or general level of mental (or cognitive) ability. However, there is considerable debate as to what intelligence is, and a corresponding amount of debate about how it should be measured.

Perhaps the biggest debate in the assessment of intelligence is how to use intelligence tests effectively. Given that intelligence is a “global” construct, what are the implications of intelligence test results for relatively specific circumstances and/or sets of behaviors? In general, intelligence test results have been most useful for interpretation in contexts calling for use of mental abilities, such as in educational processes.

Another argument concerns whether intelligence is “a (single) thing,” which is reflected in unifactor theories of intelligence, or a unique combination of things, which is reflected in multifactor theories of intelligence. The measurement implications from this debate result in some intelligence tests at-tempting to measure a single construct and some attempting to measure a unique set of interrelated constructs.

Another debate centers on what proportion of intelligence is genetic or inherited and what proportion is environmentally determined. This is the so-called “nature-nurture” controversy. So-called “fluid” intelligence (theoretically a person’s inherent capacity to learn and solve problems) is largely nonverbal and is a relatively culture-reduced form of mental efficiency.

So-called “crystallized” intelligence (theoretically) represents what a person has already learned, is most useful in circumstances calling for learned or habitual responses, and is heavily culturally laden. The nature-nurture concern has significant implications for how intelligence is assessed (e.g., what types of items and/or tasks are included), but there has not been full or consensual resolution of the debate.

A fourth major debate concerns the extent to which intelligence tests are racially, culturally, or otherwise biased. Although evidence of such biases were found in some “early” intelligence tests, improvements in psychometry have done much to alleviate such biases, at least in regard to resultant psychometric properties of “newer” intelligence tests.

In light of these and other considerations, the primary focus for the assessment of intelligence is on the construct validity of intelligence tests. In general, individually administered intelligence tests have achieved the greatest credibility. Individual intelligence tests typically are highly verbal in nature, i.e., necessitate command of language for effective performance.

Individual intelligence tests typically include both verbal (e.g., response selection or item completion) and performance (e.g., manipulation task) subsets of items. However, nonverbal and nonlanguage intelligence tests have been developed. Group administered intelligence tests, such as those commonly used in schools, are typically highly verbal and non-performance in nature.

Appraisal of Aptitudes

An aptitude is a relatively clearly defined cognitive or behavioral ability. An aptitude is a much more focused ability than general intelligence, and the measurement of aptitudes also has been more focused. Literally hundreds of aptitude tests have been developed and are available for a substantial number of rather disparate human abilities.

Theoretically, aptitude tests are intended to measure “innate” abilities (or capacities) rather than learned behaviors or skills. There remains considerable debate as to whether this theoretical premise is actually achieved in practice. However, this debate is lessened in importance IF the relationship between a current aptitude test result and a future performance indicator is meaningful and useful.

Aptitude tests are used primarily for prediction of future behavior, particularly in regard to the application of specific abilities in specific contexts. Predictive validity is usually the foremost concern in aptitude appraisal and is usually established by determining the correlation between test results and some future behavioral criterion.

Although there are many individual aptitude tests, aptitude appraisal is much more commonly achieved through use of multiple-aptitude test batteries. There are two primary advantages to the use of multiple-aptitude batteries (as opposed to a collection of individual aptitude tests from different sources):

First, the subsections of multiple-aptitude test batteries are designed to be used as a collection; therefore, there is usually a common item and response format, greater uniformity in score reporting, and generally better understanding of subsection and overall results. Second, the norms for the various subtests are from a common population; therefore, comparison of results across subtests is facilitated.

Statistical Concepts for Appraisal: Understanding Measures, Distributions, and Relationships

Statistical Concepts for Appraisal: Understanding Measures, Distributions, and Relationships

Presentation Transcript

Comprehensive Exam Review (cont’d.)

Comprehensive Exam Review

comprehensive exam review

Comprehensive exam review

Comprehensive Exam Review

Comprehensive Exam Review (cont’d.)

Comprehensive Exam Review

Comprehensive Exam Review

Comprehensive Exam Review

Comprehensive Exam Review

Comprehensive Exam Review

Comprehensive Exam Review

Comprehensive Exam Review

Comprehensive Exam Review

Comprehensive Exam Review

Comprehensive Exam Review (cont’d.)

Comprehensive Exam Review

Comprehensive Exam Review (cont’d.)

Comprehensive Exam Review

Comprehensive Exam Review

Comprehensive Exam Review

Comprehensive Exam Review