Unit 2: Test Worthiness and Making Meaning out of Raw Scores

Unit 2: Test Worthiness and Making Meaning out of Raw Scores and Common Assessment Instruments for Today’s World

Test Worthiness: What Does it Take Four requirements of test worthiness: • Validity: measures what it is supposed to • Reliability: Score is an accurate measure of his/her true score • Cross-Cultural Fairness: Test is true reflection of the individual & not a function of cultural bias inherent in test • Practicality: Test is appropriate for situation

Correlation Coefficient • Correlation Coefficient: Relationship between two sets of test scores. Range from -1.0 to +1.0 • Positive Correlation: Tendency for scores to be related in the same direction • Negative Correlation: Tendency for scores to trend toward opposite direction (inverse)

Strong Correlation (Relationship) • Indication of Strong Relationship: -1.0 and +1.0 indicates strong relationship • Weak or No Relationship: 0 • Scatterplot: Graph showing two or more sets of test scores • Positive correlation: Diagonal line rises from left to right • Negative correlation: Diagonal line rises from right to left

Scatterplot: Positive Correlation Positive Correlation Negative Correlation

Scatterplot: Weak or No Correlation

Coefficient of Determination:Shared Variance • Coefficient of Determination: Common factors that account for a relationship. • Correlation Coefficient² • Example: On tests of depression & anxiety, a .85 correlation was found in these two tests. Square .85: .85 x .85 = .7225 .7225 x 100 = 72.25 or 72% • This shows that anxiety & depression share a large number of factors - but not all factors.

Test Worthiness: Validity • Validity: The degree to which a test measures what it’s supposed to measure • Forms of Validity: • Content Validity • Criterion-related Validity • Concurrent Validity • Predictive Validity • Construct Validity • Experimental Design Validity • Convergent Validity • Discriminant Validity

Validity: Content Validity • Content Validity: The content of the test is appropriate for what the test intends to measure • Face Validity: The superficial appearance of the test. A valid test may or may not have face validity. *Face validity is not a true measure of validity

Validity: Criterion-related Validity • Criterion-related Validity: Relationship between test scores and another standard • Concurrent Validity: Relationship between test scores & another currently obtainable benchmark • Predictive Validity: Relationship between test scores & a future standard • Standard Error of Estimate: Range where a predicted score might lie False Positive: A test incorrectly predicts a test- taker will have an attribute or be successful False Negative: A test incorrectly predicts a test- taker will not have an attribute or be successful

Validity: Construct Validity • Construct Validity: Evidence that an idea or concept is actually being measured by the test (Is the test for intelligence truly measuring intelligence?) • Evidence used to measure construct validity: • a) Experimental design: Using experimentation to show that a test measures a concept • b) Factor analysis: Statistically examining relationship between subscales and larger construct (between individual subject areas and the test as a whole)

Validity: Construct Validity • Convergent Validity: Relationship between a test and other similar tests (highly correlated - say .75 range) • Discriminant Validity: Showing a lack of relationship between a test and tests of unrelated concepts (test between depression and anxiety)

Reliability • Reliability: The degree to which test scores are free from errors of measurement “Perfect world” scenario: Test is well-made, the environment is optimal, & the test taker is at his/her best • Reliability Coefficient: Are test scores consistent and dependable?

Reliability: Measuring Reliability • Test-retest Reliability: Relationship between test scores from one test given at two different administrations to the same people • The closer the two sets of scores, the more reliable the test • Test-retest reliability is more effective in areas that are less likely to change over time

Reliability: Measuring Reliability • Alternate Forms Reliability: Relationship between scores from two similar versions of the same test • Examiner designs alternate, parallel, or equivalent forms of the original test and administers this alternate form as the second test • One of the problems is to insure that both tests are truly equal

Reliability: Internal Consistency • Internal Consistency: Reliability measured statistically by going “within” the test (how scores on individual items relate to each other or the test as a whole) • Types of Internal Consistency: 1) Split-half (odd-even) 2) Cronbach’s Coefficient Alpha 3) Kuder-Richardson

Reliability: Internal Consistency • Split-half Reliability: Correlating one half of a test against the other half • Advantages of Split-half: 1) Having to give only one test 2) Not having to create a separate alternate form • Disadvantages of Split-half: 1) False reliability if two halves are not parallel or equivalent 2) Make test half as long (shortening test may decrease reliability

Reliability: Internal Consistency • Spearman-Brown Equation: Mathematical compensation for shortening the number of correlations by using the split-half reliability test • Spearman-Brown Equation: Spearman = Brown reliability = 2ʳʰʰ 1 + ʳʰʰ - Where ʳʰʰ is the split-half reliability estimate *If a test manual states that split-half was used, check to see if the Spearman-Brown formula was used. If not, the test may be more reliable than is noted.

Reliability: Internal Consistency • Cronbach’s Coefficient Alpha and Kuder-Richardson: • Methods that attempt to estimate the reliability of all the possible split-half combinations by correlating the scores for each item on the test with the total score on the test and finding the average correlation for all of the items Kuder-Richardson can only be used with tests that have right and wrong answers (achievement) Coefficient Alpha can be used with tests with various types of responses (rating scales)

Reliability: Item Response Theory • Item Response Theory: Examines each item individually for its ability to measure the trait being examined • Item Characteristic Curve: Assumes that as people’s abilities increase, their probability of answering an item correctly increases

Reliability: Item Characteristic Curve • If “S” flattens out: Less ability to discriminate or provide a range of probabilities of getting a correct or incorrect response • If “S” is tall: Item is creating strong differentiation across ability 1.0 .75 Probability of Correct Answer .50 .25 0.0 55 70 85 100 115 130 145 IQ Ability

Cross-cultural Fairness • Cross-cultural Fairness: Degree to which cultural background, class, disability, and gender do not affect test results • Tests must be carefully selected to prevent bias • Test scores must be interpreted in light of the cultural, ethnic, disability, or linguistic factors that may impact scores

Practicality • Practicality: Feasibility considerations in test selection and administration • Major Practical Concerns: 1) Time: Amount of time to administer 2) Cost: Budgeting issues 3) Format: Print, type of questions 4) Readability: Understandability 5) Ease of Administration, Scoring, & Interpretation

Selecting& Administering a Good Test 1) Determine goals of your client 2) Choose instrument to reach client goals 3) Access information about possible instruments a) Source books on testing 1) Buros Mental Measurements Yearbook 2) Tests in Print 4) Examine Validity, Reliability, Cross-cultural Fairness, & Practicality of the Possible Instruments 5) Choose an Instrument Wisely

Unit 2: Statistical Concepts Making Meaning Out of Raw Scores

Raw Scores are Meaningless • Raw Scores: Untreated score before manipulation or processing • Norm Group Comparisons Are Helpful: 1) Tells us relative position within the norm group 2) Allows us to compare the results among test- takers 3) Allows us to compare test results on two or more different tests taken by same person

Procedures for Normative Comparisons • Frequency Distribution: List of scores & number of times a score occurred • Orders a set of scores from highest to lowest & lists corresponding frequency of each score • Allows identification of most frequent scores and helps identify where an individual’s score falls relative to the rest of the group

Histograms & Frequency Polygons • Histogram: Bar graph of class intervals & frequency of a set of scores Class Intervals: Grouping scores by a pre-determined range • Frequency Polygon: Line graph of class intervals & frequency of a set of scores

Cumulative Distributions(Ogive Curve) • Cumulative Distribution: Line graph to examine percentile rank of a set of scores • Applications: Good for conveying information about percentile rank

Normal Curves & Skewed Curves • Normal Curve: Bell-shaped curve that human traits tend to fall along • Predictable pattern that occurs whenever we measure human traits and abilities • Skewed Curves: Test scores that do not fall along a normal curve • Negatively Skewed Curve: Majority of scores at the upper end • Positively Skewed Curve: Majority of scores at the lower end

Measures of Central Tendency • Central Tendency: Give you a sense of how close a score is to the middle of the distribution • Three Measures of Central Tendency: 1) Mean: Arithmetic average of all scores: add all scores and divide by # of scores 2) Median: Middle score: 50% fall above; 50% fall below 3) Mode: Most frequently occurring score *In a skewed distribution, median is a better measure of central tendency.

Measures of Variability • Measures of Variability: How much scores vary in a distribution • Three Measures of Variability: 1) Range: Difference between highest & lowest score plus 1 2) Interquartile Range: Middle 50% of scores around the median 3) Standard Deviation: How scores vary from the mean

Measures of Variability: Range • Range: Tells you the distance from the highest to lowest score • Calculated by subtracting the lowest score from the highest score and adding 1

Measures of Variability: Interquartile Range • Interquartile Range: Provides the range of the middle 50% of scores around the median • Useful with skewed curves because it offers a more representative picture of where a large percentage of scores fall • Calculate: Subtract the score that is 1/4 of the way from the bottom from the score that is 3/4 of the way from the bottom & divide by 2. Next, add & subtract this number to the median

Measures of Variability: Standard Deviation • Standard Deviation: Describes how scores vary from the mean • In all normal curves, the percentage of scores between standard deviation units is the same • 99.5% of people fall within the first three standard deviations *Adequate scores are in the “eye of the beholder”

Common Assessments:Situation Specific • Developmental Disabilities: Impairment in Cognitive, Communication, Social/Emotional, & Adaptive (daily living skills) Functioning • Assessments Used: 1) Bayley Scales of Infant Development 2) Wechsler Preschool & Primary Scales of Intelligence, 3rd Edition 3) Wechsler Intelligence Scale for Children, 4th Ed. 4) Autism Diagnostic Observation Scale 5) Vineland Adaptive Behavior Scale, 2nd Ed.

Common Assessments: Situation Specific • Learning Disabilities: Disorders that affect a broad range of academic & functional skills, i.e., speaking, listening, reading, writing, spelling, & completing math calculations. Deficit in one or more ways the brain processes information • Assessments 1) Wechsler Preschool & Primary Scale of Intelligence 2) Wechsler Intelligence Scale for Children, 4th Ed. 3) Wechsler Adult Intelligence Scale, 3rd Ed. 4) Wechsler Individual Achievement Test, 2nd Ed.

Learning Disabilities Assessments,Continued 5) Wechsler Memory Scale, 3rd Ed. 6) Woodcock-Johnson Test of Achievement, 3rd Ed. 7) Comprehensive Test of Phonological Processing 8) Attention Deficit Disorder Evaluation Scale (Home, Self-report, & School version) 9) Beck Depression Inventory, 2nd Ed. 10) Beck Anxiety Inventory

Common Assessments: Situation Specific • Attention Deficit/Hyperactivity Disorder 1) Wechsler Intelligence Scale for Children, 4th Ed. 2) Processing Speed Index 3) Wechsler Adult Intelligence Scale, 3rd Ed. 4) Woodcock-Jackson Test of Achievement, 3rd Ed. 5) Understanding Directions Subset 6) Attention Deficit Disorder Evaluation Scale (Home, Self-report, & School version) 7) Behavior Assessment System for Children, 2nd Ed. (Parent report, Teacher report, Self-report)

Common Assessments: Situation Specific • Gifted and Talented Evaluation: Individuals who are so gifted or advanced, they need special provisions to meet their educational needs • Assessments 1) Wechsler Preschool & Primary Scale of Intelligence (3rd Ed.) 2) Wechsler Intelligence Scale for Children, 4th Ed. 3) Wechsler Adult Intelligence Scale, 3rd Ed.

Unit 2: Test Worthiness and Making Meaning out of Raw Scores

Unit 2: Test Worthiness and Making Meaning out of Raw Scores

Presentation Transcript

A Fox and a Kit

Do Now Title: Shades of Meaning and Multiple Meaning Words

Complaining, Making Suggestions and Requests

Alternatives to Difference Scores: Polynomial Regression and Response Surface Methodology

Mammals

TABLE OF CONTENTS

Test of Significance

Meaning

Table of Contents – pages iv-v

Table of Contents – pages iii

Table of Contents – pages iii

Table of Contents – pages iv-v

Table of Contents – pages iii

Table of Contents – pages iv-v

Table of Contents – pages iv-v

Table of Contents – pages iv-v

Table of Contents – pages iv-v

Table of Contents – pages iv-v

Table of Contents – pages iv-v

Table of Contents – pages iii

AP World Review

Table of Contents – pages iv-v