490 likes | 680 Views
Our Purpose. In constructing locally developed tests, assessment professional frequently collect judgments concerning item properties from SMEs.Use of such judgments is increasingSMEsInstructorsSupervisors. SME Ratings. Ratings are collected on TasksImportanceTime SpentKSAs, CompetenciesCri
E N D
1. Assessing the Items: Using SME judgments for Linkages, Sensitivity, and Cut Off Scores MAPAC Presentation
Dennis Doverspike, PhD, ABPP
University of Akron
Akron Ohio
2. Our Purpose In constructing locally developed tests, assessment professional frequently collect judgments concerning item properties from SMEs.
Use of such judgments is increasing
SMEs
Instructors
Supervisors In constructing locally developed tests, assessment professional frequently collect judgments concerning item properties from SMEs. A locally developed test can be defined as a tailored or customized instrument based on local information (i.e., locally validated tests, includes tests such as job knowledge and work sample tests).
How many of you develop this type of test – hope it is all of you or my talk may not mean much?
In recent years, the use of such judgments appears to have been increasing. This increase has been a result of pressure from federal agencies (including the OFCCP and EEOC) for the greater quantification and documentation of procedures. This has led to increased use of questionnaires completed by SMEs, incumbents and supervisors. Thus, there has been a transfer in decision making from experts, us, me and you, to the mathematics of SMEs, incumbents and supervisors. And that is the general topic we will be discussing today.
Now, how many of you use SMEs? Raise hands?
As a side note, I have also asked that two papers be distributed to you. The first one is on job analysis for test development. That is a chapter that Winfred Arthur and I wrote for a new book on job analysis. I know you had a separate workshop on this here at MAPAC. I am not going to go over the material in the chapter, as you did have that separate workshop, but wanted to pass it out to you. We are going to be dealing with the more specific topic of SME judgments.
Second, since some of you may be involved in public safety selection, I have passed out a paper on Fire and Police selection that was written by myself, Gerald Barrett, and Candace Young. We also will not be discussing that paper, but again, I wanted you to have it.
This is a little like Oprah, you come for the talk and you get free gifts.
In constructing locally developed tests, assessment professional frequently collect judgments concerning item properties from SMEs. A locally developed test can be defined as a tailored or customized instrument based on local information (i.e., locally validated tests, includes tests such as job knowledge and work sample tests).
How many of you develop this type of test – hope it is all of you or my talk may not mean much?
In recent years, the use of such judgments appears to have been increasing. This increase has been a result of pressure from federal agencies (including the OFCCP and EEOC) for the greater quantification and documentation of procedures. This has led to increased use of questionnaires completed by SMEs, incumbents and supervisors. Thus, there has been a transfer in decision making from experts, us, me and you, to the mathematics of SMEs, incumbents and supervisors. And that is the general topic we will be discussing today.
Now, how many of you use SMEs? Raise hands?
As a side note, I have also asked that two papers be distributed to you. The first one is on job analysis for test development. That is a chapter that Winfred Arthur and I wrote for a new book on job analysis. I know you had a separate workshop on this here at MAPAC. I am not going to go over the material in the chapter, as you did have that separate workshop, but wanted to pass it out to you. We are going to be dealing with the more specific topic of SME judgments.
Second, since some of you may be involved in public safety selection, I have passed out a paper on Fire and Police selection that was written by myself, Gerald Barrett, and Candace Young. We also will not be discussing that paper, but again, I wanted you to have it.
This is a little like Oprah, you come for the talk and you get free gifts.
3. SME Ratings Ratings are collected on
Tasks
Importance
Time Spent
KSAs, Competencies
Criticality
When Learned In locally developed tests, job incumbents or SMEs may be asked to make a lot of ratings. These ratings include:
Tasks – that would include
Importance
Time Spent
KSAs, Competencies – that would include
Criticality
When Learned
In locally developed tests, job incumbents or SMEs may be asked to make a lot of ratings. These ratings include:
Tasks – that would include
Importance
Time Spent
KSAs, Competencies – that would include
Criticality
When Learned
4. SME Ratings Items
Difficulty (cutoff scores)
Linkages (Forward and reverse)
Cultural sensitivity
Importance (for weighting)
Correctness - Appeals Other very important or critical ratings SMEs may be asked to make include
Difficulty (cutoff scores – angoff method)
Linkages (Forward and reverse)
Cultural sensitivity
Importance (for weighting)
Correctness - Appeals
Other very important or critical ratings SMEs may be asked to make include
Difficulty (cutoff scores – angoff method)
Linkages (Forward and reverse)
Cultural sensitivity
Importance (for weighting)
Correctness - Appeals
5. The Big Question Are we asking the correct questions, especially in the context of cut off scores and cultural sensitivity. In this talk, we will look at the questions we ask in making such judgments and whether we are asking the correct questions, especially in the context of cut off scores. In addition to a discussion of philosophical and psychological issues, we will also look at practical methods or forms for collecting judgments regarding items.
In part, we will also be asking the question as to whether SMEs can really make these ratings. Surprisingly, or not surprisingly, we really don’t know a lot about whether SMEs can make these ratings or under what are optimal conditions or training for such ratings.
With all the research we do, and appears in journals, it is surprising we don’t do more research on these topics. Or maybe it is not surprising, but there are different groups of researchers trying to make progress – although a lot of the research tends to be in Education rather than Adult Employment settings.
In this talk, we will look at the questions we ask in making such judgments and whether we are asking the correct questions, especially in the context of cut off scores. In addition to a discussion of philosophical and psychological issues, we will also look at practical methods or forms for collecting judgments regarding items.
In part, we will also be asking the question as to whether SMEs can really make these ratings. Surprisingly, or not surprisingly, we really don’t know a lot about whether SMEs can make these ratings or under what are optimal conditions or training for such ratings.
With all the research we do, and appears in journals, it is surprising we don’t do more research on these topics. Or maybe it is not surprising, but there are different groups of researchers trying to make progress – although a lot of the research tends to be in Education rather than Adult Employment settings.
6. What We Know - SMEs The mathematics we apply to SME judgments (especially to arrive at weights) are not technically permissible.
The weights probably do not matter much anyway.
SMEs can make reasonable estimates of item difficulty.
SMEs cannot tell us how they make those judgments Before we get more into the topic. Lets look at some of the things we do know. Although be forewarned, like learning how sausage is made, a lot of the detail regarding research on such judgments is not that pretty. You might be better off not knowing.
The mathematics we apply to SME judgments (especially to arrive at weights) are not technically permissible. That is, we frequently multiply together numbers which should not and can not be multiplied together. We treat our data as being measured on a ratio scale when at best it is measured on an interval or ordinal basis. So, we should not be multiplying the numbers together.
We frequently come up with weights, where the weights really do not matter that much. That is, unit weights tend to give us the same outcomes as relative weights or judgment based weights.
This is a good thing. SMEs can make reasonable estimates of item difficulty.
However, SMEs cannot tell us how they make those judgments. Or maybe we just have not asked.
Before we get more into the topic. Lets look at some of the things we do know. Although be forewarned, like learning how sausage is made, a lot of the detail regarding research on such judgments is not that pretty. You might be better off not knowing.
The mathematics we apply to SME judgments (especially to arrive at weights) are not technically permissible. That is, we frequently multiply together numbers which should not and can not be multiplied together. We treat our data as being measured on a ratio scale when at best it is measured on an interval or ordinal basis. So, we should not be multiplying the numbers together.
We frequently come up with weights, where the weights really do not matter that much. That is, unit weights tend to give us the same outcomes as relative weights or judgment based weights.
This is a good thing. SMEs can make reasonable estimates of item difficulty.
However, SMEs cannot tell us how they make those judgments. Or maybe we just have not asked.
7. What We Know – Test Taker Assumption Test Takers would like tests to have no error
Test Takers would like items to have no error Beyond SMEs the assumptions that test takers make are important. Test takers tend to make some assumptions that are high unrealistic. So do civil service commissions, public policy people, and lawyers.
Test Takers, and civil service commissions and public policy people, would like tests to have no error
Test Takers, and civil service commissions and public policy people, would like items to have no error
We know both of the above are impossibleBeyond SMEs the assumptions that test takers make are important. Test takers tend to make some assumptions that are high unrealistic. So do civil service commissions, public policy people, and lawyers.
Test Takers, and civil service commissions and public policy people, would like tests to have no error
Test Takers, and civil service commissions and public policy people, would like items to have no error
We know both of the above are impossible
8. What We Know Tests have error
X = T + E
Items have error
X = T + S + E As psychometricians, we know that tests have error and items have error. Tests have random error and items have specific content error (which we assume to be random over other items) and also random error.
So, we tend to have lots of errors or mistakes when we make predictions. Even if our tests were a lot more perfect than they are, even if they were near perfect, we would still have lots of errors.
An interesting question is whether we could we have items without errors. Yes we could, if you remember your testing or psychometrics class. There are items, Guttmann items, that are deterministic, have no error. But it turns out there are two problems with Guttmann type items
1 they are next to impossible to write such items, especially a whole test of them
2 they tend to give accurate estimates of ability only in a narrow range. They do not help us discriminate or rank order people. We are usually interested in not only knowing if someone has passes a test, but how they rank on that test compared to other people. Who scored highest? Who scored lowest? For answering the whole set of questions across the board, we are better off using items with error.
So, why not just discriminate at point? The passing point? Because – we are interested in information at other points
As psychometricians, we know that tests have error and items have error. Tests have random error and items have specific content error (which we assume to be random over other items) and also random error.
So, we tend to have lots of errors or mistakes when we make predictions. Even if our tests were a lot more perfect than they are, even if they were near perfect, we would still have lots of errors.
An interesting question is whether we could we have items without errors. Yes we could, if you remember your testing or psychometrics class. There are items, Guttmann items, that are deterministic, have no error. But it turns out there are two problems with Guttmann type items
1 they are next to impossible to write such items, especially a whole test of them
2 they tend to give accurate estimates of ability only in a narrow range. They do not help us discriminate or rank order people. We are usually interested in not only knowing if someone has passes a test, but how they rank on that test compared to other people. Who scored highest? Who scored lowest? For answering the whole set of questions across the board, we are better off using items with error.
So, why not just discriminate at point? The passing point? Because – we are interested in information at other points
9. What We Know: The Beauty of .70 At SIOP, discussant criticized Angoff method for always coming out to .70, however there is a certain beauty in this. As another side note, I think we should comment on the .70 level. At SIOP, we had a session on cutoff scores. It was a very good discussion and a lot of advances are being made on cutoff scores. One of the discussants criticized Angoff method for always coming out to .70; however there is a certain beauty in this. In a way, it is a good think that Angoff always comes out to .70, and suggests that there is some deeper meaning to all of this.
As another side note, I think we should comment on the .70 level. At SIOP, we had a session on cutoff scores. It was a very good discussion and a lot of advances are being made on cutoff scores. One of the discussants criticized Angoff method for always coming out to .70; however there is a certain beauty in this. In a way, it is a good think that Angoff always comes out to .70, and suggests that there is some deeper meaning to all of this.
10. The Beauty of .70 Corresponds to a C or passing in Academic circles
Public Sector – often used as a cutoff
Item Writers – seem to sort of naturally write items that have p values around .70
Best point for discrimination
Corresponds to a C or passing in Academic circles
Public Sector – often used as a cutoff
Item Writers – seem to sort of naturally write items that have p values around .70
Best point for discrimination, can be shown from theory and from data, that given the typical multiple choice test, around .70 is the optimal item difficulty level
Corresponds to a C or passing in Academic circles
Public Sector – often used as a cutoff
Item Writers – seem to sort of naturally write items that have p values around .70
Best point for discrimination, can be shown from theory and from data, that given the typical multiple choice test, around .70 is the optimal item difficulty level
11. Ricci Ricci v Destefano
Before Supreme Court
Brief of I-O Psychologists
Case deals with 2003 New Haven Fire Department Promotional Examination Nevertheless, there is a certain danger in beauty, including the .70.
As a result of the Ricci case I decided to concentrate more on cutoffs and the Angoff method.
More of a speech and interactive idea session.Nevertheless, there is a certain danger in beauty, including the .70.
As a result of the Ricci case I decided to concentrate more on cutoffs and the Angoff method.
More of a speech and interactive idea session.
12. Criticism of Cutoff IOS exacerbated the problem of imbalance in its response to another predetermined feature of the NHFD exams – the 70% cutoff score mandated by the City’s civil service rules.
Concession by Mr. Legel that IOS was unable to validate the cutoff score
Contributed to adverse impact of exams
13. SIOP Great deal of interest in Cutoffs and Angoff
So – concentration today on some of the ongoing issues with cutoffs
14. First Type of Judgment
CUTOFFS: ANGOFF The first type of SME judgment we are going to deal with is The setting of cutoffs. We may use any of several methods, but we tend to use a technique known as Angoff. The first type of SME judgment we are going to deal with is The setting of cutoffs. We may use any of several methods, but we tend to use a technique known as Angoff.
15. Standard Setting Cutoff scores – score to get in or receive some treatment (Point at which a person Passes or Fails a test)
Critical scores – minimum performance
Certification standards – certify for a profession
Psychologist
Teacher In standard setting, we have some sort of specific definitions.
Cutoff scores – score to get in or receive some treatment (Point at which a person Passes or Fails a test)
Critical scores – minimum performance
Certification standards – certify for a profession
Psychologist
Teacher
In developing a locally validated test, perhaps the most difficult decisions are those regarding the appropriate cutoffs and weighting for each test. Cutoffs may be the most difficult decision. Other than the use of cognitive ability tests, no topic generates as much controversy as the setting of cutoffs. The establishment of appropriate cutoffs is likely to add another step or questionnaire to the job analysis. Although it is possible to set cutoffs based on statistical methods, in many cases the appropriate cutoff score is arrived at using a judgmental method (Cizek & Bunch, 2007), the most common being the Angoff method.
In standard setting, we have some sort of specific definitions.
Cutoff scores – score to get in or receive some treatment (Point at which a person Passes or Fails a test)
Critical scores – minimum performance
Certification standards – certify for a profession
Psychologist
Teacher
In developing a locally validated test, perhaps the most difficult decisions are those regarding the appropriate cutoffs and weighting for each test. Cutoffs may be the most difficult decision. Other than the use of cognitive ability tests, no topic generates as much controversy as the setting of cutoffs. The establishment of appropriate cutoffs is likely to add another step or questionnaire to the job analysis. Although it is possible to set cutoffs based on statistical methods, in many cases the appropriate cutoff score is arrived at using a judgmental method (Cizek & Bunch, 2007), the most common being the Angoff method.
16. Angoff Method Judgmental Methods Frequently Used
Angoff Question:
What percentage of minimally qualified people will get the item correct?
Advantages
Quick
Reliable
Valid
Easy to Understand
In using the Angoff method, expert judges are asked to provide ratings on individual items. The judges are instructed to imagine a minimally competent person and determine the percentage of minimally competent persons who would answer each item on a test correctly. Thus, for each test item, judges estimate the passing rate among minimally competent persons. Once all judges have completed their ratings, the ratings for each item are summed and averaged, creating an estimate of item difficulty for the minimally competent person. Based upon each item difficulty score, an overall test difficulty score is determined by the summed average of all test items. This score, expressed as a percentage, is then interpreted as the appropriate cutoff score for the test in terms of the percentage of items that must be answered correctly in order to pass the test. This is actually a simpler process than the above description might suggest and judges seem to be both able and willing to make such difficulty judgments.
Judgmental methods to setting cutoffs require an additional questionnaire, usually completed by the SMEs. In using the Angoff method, expert judges are asked to provide ratings on individual items. The judges are instructed to imagine a minimally competent person and determine the percentage of minimally competent persons who would answer each item on a test correctly. Thus, for each test item, judges estimate the passing rate among minimally competent persons. Once all judges have completed their ratings, the ratings for each item are summed and averaged, creating an estimate of item difficulty for the minimally competent person. Based upon each item difficulty score, an overall test difficulty score is determined by the summed average of all test items. This score, expressed as a percentage, is then interpreted as the appropriate cutoff score for the test in terms of the percentage of items that must be answered correctly in order to pass the test. This is actually a simpler process than the above description might suggest and judges seem to be both able and willing to make such difficulty judgments.
Judgmental methods to setting cutoffs require an additional questionnaire, usually completed by the SMEs.
18. But Question We Want Answered Of the people who get the item right (or pass test), what percentage will be minimally competent? (Public - maximize)
Of the people who get the item wrong (or fail test), what percentage will be minimally competent? (Plaintiff - minimize)
The problem is. Does Angoff answer the right question. Or the question we want answered. There are other questions we may want answered for example.
Of the people who get the item right (or pass test), what percentage will be minimally competent? (Public - maximize)
Of the people who get the item wrong (or fail test), what percentage will be minimally competent? (Plaintiff - minimize)
Of course, again you can think back to your testing class and think, doesn’t the Taylor Russell tables give me some of the answers. The answer of course is yes, if we already know the test cutoff, the performance cutoff and the validity of the test.
The problem is. Does Angoff answer the right question. Or the question we want answered. There are other questions we may want answered for example.
Of the people who get the item right (or pass test), what percentage will be minimally competent? (Public - maximize)
Of the people who get the item wrong (or fail test), what percentage will be minimally competent? (Plaintiff - minimize)
Of course, again you can think back to your testing class and think, doesn’t the Taylor Russell tables give me some of the answers. The answer of course is yes, if we already know the test cutoff, the performance cutoff and the validity of the test.
19. Other Questions Maximize over all utility from testing (Company)
Is procedure perceived as fair?
Especially by those who fail
20. Research Question?
We know people can rate – What is the probability a minimally competent person will get this question right?
Can they rate?
Of the people who get the item right (or pass test), what percentage will be minimally competent? (Public - maximize)
Of the people who get the item wrong (or fail test), what percentage will be minimally competent? (Plaintiff - minimize)
21. Answer Research Study at University of Akron.
Answer – Yes, we can with approximately equal reliability and accuracy.
Question – How then do we combine these ratings?
22. Bigger Question Can we combine Angoff with Normative data?
23. Program New Angoff program from Assessment Systems Corporation
Allows you to combine Normative estimates with other data
Normative estimates are interesting in and of themselves in that they represent value judgments
Ignoring normative judgments can get us into trouble
24. The following article appeared in the New York Post.
The BIG 'F' ON EMT ADVANCEMENT EXAM (GINGER ADAMS OTIS, 4.19.2009)
It was a massive medical failure for hundreds of FDNY medics who hoped to get promoted, as a measly eight out of 721 Emergency Medical Service workers passed the most recent lieutenant exam.
The 1.1 percent pass rate for the 2008 test is about 38 percentage points lower than the last time the exam was given, in 2004, when 1,044 medics took the test and 409, or 39 percent, passed, The Post has learned.
The poor results could leave the FDNY short on supervisors.
25. Our Case Situation:
You are the HR Director and have to set the cutoff on the new Police Sergeant Promotional Test. All promotions are internal and any police officer with more than two years of experience may take the test.
The cutoff will be used to determine who passes or fails the test.
Promotions from those who pass will be made based upon Seniority. There will be no rank ordering.
The test is 100 items. It includes job knowledge questions, such as laws and detective work, and also police leadership and administration items.
You expect that the average score on the test will be around 70%, but that is based upon past experience and you do not have data on expected scores.
There will be at least 200 police officers taking the test.
There are currently no police sergeant openings. However, you expect that during the life of the test, anywhere from 1 to 20 openings may have to be filled, depending upon retirements and other personnel events.
26. Ratings Think of a minimally competent police sergeant applicant, what should be the minimal score they would receive on this test (Can be estimated from Angoff)?
Given the type of applicant taking this type of test, what passing rate should be expected; that is, what percentage of the test takers should pass this test?
Given the type of applicant taking this type of test, what is the maximum acceptable failure rate? That is, what is the highest percentage of failures that would be tolerated?
Given the type of applicant taking this type of test, what is the minimum acceptable failure rate? That is, what is the lowest percentage of failures that would be tolerated?
What is the highest cut score on this type of test that you think you could defend? That is, even if everyone who took the test passed, what is the highest or maximum possible cutoff score you think you could possibly set on this type of test?
What is the lowest cut score on this type of test that you think you could defend? That is, even if everyone who took the test failed, what is the lowest or minimum possible cutoff score you think you could possibly set on this type of test?
27. Program Input
29. My Point We have long recognized that there is a difference between cutoff scores and critical scores, but typically have no easy method for incorporating value judgments and information on likely selection ratios.
30. My Point Hofstee and Beuk methods allow you to do that in a logical fashion.
Program from Assessment Systems Corporation allows you to do that.
We can collect these judgments and use them to set cutoff scores.
Values of different stakeholders are of interest.
31. Stakeholders for Safety Forces Test Mayor, City Manager and/or Safety Director
City Council
Chief of Safety Forces
Civil Service Commission
Non-minority Union
Minority Union
Court Supervision of Process Including Court Orders, Quotas, etc.
Citizen Oversight Committee
Internal Human Resources Staff
Local Media
Federal Agencies
The Public
Add to that
Number of Applicants
Number of Openings
32. Sensitivity Reviews
33. Sensitivity Reviews A sensitivity review refers to "the process of having a diverse group of professionals review tests to flag material that may unintentionally interact with demographic characteristics of test takers” (Sireci 2004, p. 22).
Begins with ETS in 1980 Very important - but know very little about them
Very important - but know very little about them
34. Sensitivity Reviews Utilizes trained expert judges to “identify and remove invalid aspects of test questions that might hinder people in various groups from performing at levels that allow appropriate inferences about their relevant knowledge and skills” (ETS, 2003, p. 1).
35. Why are Sensitivity Reviews Important May be seen as required
Improve our testing
Increased use of CONTEXT
36. What Is Context? Traditional items overemphasize memory
The definition of murder is:
Solution – Context
Officer Jones is investigating a crime. Mary killed her husband using a knife in the library. She admits she did it but claims she was drunk at the time. Officer Jones should recommend to the prosecutor that Mary be charged with:
37. Advantages of Context Correlated with test-taking motivation
Organizational attractiveness
Provides realistic job preview
Managers more satisfied
Easier to defend in court
But maybe not before appeals committees
Uses richness of available media
(Shotland et al 1998) Shotland et al. (1998) highlight five specific advantages to improving the face validity of a test. First, face validity has been shown to be positively correlated with test-taking motivation, which in turn has been reliably linked to greater test performance Second, face validity is positively related to organizational attractiveness. A test whose content is transparent and appears job relevant to respondents can act as a signal to the test taker that the employer is not attempting to hide the purpose of the testing instrument in any way. Third, face valid assessments can also serve as realistic job previews in the selection process. A fourth, often overlooked, advantage to using face valid assessment techniques is that managers within the organization typically report greater levels of comfort with and support for using more contextually relevant testing tools (Shotland et al., 1998). Finally, face valid tests are typically less susceptible to legal challenge and are often easier to defend should they be brought to court (Seymour, 1988). Shotland et al. (1998) highlight five specific advantages to improving the face validity of a test. First, face validity has been shown to be positively correlated with test-taking motivation, which in turn has been reliably linked to greater test performance Second, face validity is positively related to organizational attractiveness. A test whose content is transparent and appears job relevant to respondents can act as a signal to the test taker that the employer is not attempting to hide the purpose of the testing instrument in any way. Third, face valid assessments can also serve as realistic job previews in the selection process. A fourth, often overlooked, advantage to using face valid assessment techniques is that managers within the organization typically report greater levels of comfort with and support for using more contextually relevant testing tools (Shotland et al., 1998). Finally, face valid tests are typically less susceptible to legal challenge and are often easier to defend should they be brought to court (Seymour, 1988).
38. Face Validity Paradox Context leads to
Greater reading
More of a g loading
Potential stereotype threat
Potential cultural bias and insensitivity
Why is the murderer female?
The police officer male?
39. Sensitivity Reviews Seek To Identify Culturally Inappropriate Content
Identify Cultural Imbalance
Avoid cueing stereotype threat Identify content (e.g., wording, illustrations, situational contexts, etc.) that is sexist, racist, stereotypic, controversial, potentially offensive and upsetting, ethnocentric, or construct irrelevant stimuli
the items included in that test appropriately reflect the diversity of gender, race and ethnicity represented in our society
Identify content (e.g., wording, illustrations, situational contexts, etc.) that is sexist, racist, stereotypic, controversial, potentially offensive and upsetting, ethnocentric, or construct irrelevant stimuli
the items included in that test appropriately reflect the diversity of gender, race and ethnicity represented in our society
40. Research on Sensitivity Reviews We know almost nothing about sensitivity reviews
Old days – just find someone in the office from various demographic groups
Today – find culturally diverse SMEs
41. Research Subjective evaluations of bias do not tend to agree with objective data
Could be a number of reasons for that
42. Current Research Ann Marie Ryan, Neal Schmitt, James Grand, Juliya Golubovich (Michigan State University)
Candice Young, Dennis Doverspike, Gerald Barrett (University of Akron)
43. What Makes for a Good Reviewer? Race, Ethnicity, Gender
Racial or Cultural Identity
Test Wiseness
Motivation
Subject Matter Knowledge
Cognitive Ability
Belief About Testing
Training
In What?
44. What Do Reviewer’s Look For?
45. ETS Guideline 1. Treat people with respect.
Guideline 2. Minimize the effects of construct-irrelevant knowledge or skills.
Guideline 3. Avoid material that is unnecessarily controversial, inflammatory, offensive, or upsetting.
Guideline 4. Use appropriate terminology to refer to people.
Guideline 5. Avoid stereotypes.
Guideline 6. Represent diversity in depictions of people.
46. Other Variables? Reading level
SES
47. Other Questions What makes for effective sensitivity reviews?
Do you have guidelines?
How do you choose reviewers?
Are sensitivity reviews effective?
Should gender/culture/etc. be
Removed?
Balanced?
Overrepresented to combat stereotype threat?
48. Sensitivity Reviews Need more research
Practical work
Training?
Guidelines and practices?
For non-ETS type organizations
Increased diversification of society will probably make the process even more difficult.
49. Thank You!