1 / 60

The Long and Winding Road: Researching the Validity of the SAT

The Long and Winding Road: Researching the Validity of the SAT. Wayne Camara, Jennifer Kobrin, Krista Mattern, Brian Patterson, & Emily Shaw Ninth Annual Maryland Assessment Conference: The Concept of Validity: Revisions, New Directions & Applications October 9 th and 10 th , 2008.

sorcha
Download Presentation

The Long and Winding Road: Researching the Validity of the SAT

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Long and Winding Road: Researching the Validity of the SAT Wayne Camara, Jennifer Kobrin, Krista Mattern, Brian Patterson, & Emily Shaw Ninth Annual Maryland Assessment Conference: The Concept of Validity: Revisions, New Directions & Applications October 9th and 10th, 2008

  2. Outline of Presentation Planning the Journey. Mapping out the research agenda. Targeting the sample of institutions. Making Connections with Institutions. Validity evidence is only as good as the data we collect. Issues and lessons learned in initiating and maintaining contact with institutions to get good quality data. Detours and Statistical Fun. Cleaning the data. All institutions are not the same. How to aggregate and compare SAT validity coefficients across diverse institutions. “To correct or not to correct?” Restriction of Range. Deciding how to get from Point A to Point B. There are numerous ways to look at the relationship between the SAT, HSGPA and other variables, and college grades and each may give a different picture. A Bumpy Road. The fairness issue: differential validity and differential prediction

  3. Planning the Journey • Mapping out the research agenda. Targeting the sample of institutions.

  4. Sampling Plan • The population of colleges: 726 institutions receiving 200 or more SAT score reports in 2005. • The target sample of colleges: stratified target sample was 150 institutions on various characteristics (public/private, region, admission selectivity, and size)  • Size • Small (750 to 1,999 undergrads) • Medium to Large (2,000 to 7,499) • Large (7,500 to 14,999) • Very large (15,000 or more) • Selectivity • under 50% of applicants admitted • 50 to 75% • over 75% • Control • Public • Private • Region of the Country • Mid-Atlantic • Midwest • New England • South • Southwest • West

  5. Example of our Sampling Plan Guide

  6. Making Connections with Institutions • Validity evidence is only as good as the data we collect. Issues and lessons learned in initiating and maintaining contact with institutions to get good quality data.

  7. Email invites from CB staff with • relationships • Conference Exhibit Booths: • Association for Institutional Research • (AIR) • National Association of College • Admission Counseling (NACAC) • CB National Forum; 7 CB Regional Forums • American Educational Research • Association (AERA) • Print announcements in CB and AIR publications Institutions were Recruited Via:

  8. Recruitment • Recruitment took place between 2005-2007 • In order to participate, institutions had to have at least 250 first-year, first-time students that entered in the Fall of 2006 • Also, at least 75 students with SAT scores are necessary to conduct an Admitted Class Evaluation Service (ACES) study. ACES served as the data portal between the institution and the College Board. • Institutions designated a key contact who received a stipend of $2,000 - $2,500 for loading data into ACES (Direct costs =$800,000)

  9. ACES • The Admitted Class Evaluation Service (ACES) is a free online service that predicts how admitted students will perform at a college or university generally, and how successful students will be in specific classes. http://www.collegeboard.com/highered/apr/aces/aces.html Click here to request a study

  10. For Matching: SSN Last Name First Name Date of Birth Gender Optional, but recommended: College/university-assigned unique ID Necessary for the validity research: Course names for each semester The number of credits each course is worth Course semester/trimester indication Course grades for each semester First-year GPA Whether the student returned to the institution for the Fall of 2007 (submitted before 10/15/07) Required Data for Each Student

  11. Institutional Characteristics

  12. Detours and Statistical Fun • Cleaning the data • A Volkswagon is not a Hummer (or all institutions are not the same)! Necessary to logically aggregate and compare SAT validity coefficients across diverse institutions • “To correct or not to correct?”

  13. Cleaning the Data after ACES Processing Student Level Checks to Remain in the Study • Student earned enough credit to constitute completion of a full academic year • Student took the SAT after March 2005 (SAT W score) • Student indicated their HSGPA on the SAT Questionnaire (when registering for the SAT) • Student had a valid FYGPA Institution Level Checks to Remain in the Study • Check for institutions with high proportion of zero FYGPA (should some be missing or null?) • Grading system makes sense (e.g. an institution submitted a file with no failing grades) • Recoding variables for consistency (e.g. fall semester or fall trimester or fall quarter = term 1 for placement analyses) Issues: Student matching (institution to CB name, dob, ssn), loss of students who did not complete semester ( year) makes persistence difficult to track

  14. SAT Validity Study In several instances, individual institutions were contacted to attempt to remedy data issues After cleaning the data and removing cases with missing data, the final sample included: 110 colleges (of the original 114 institutions) participated in Validity Study 151,316 students (of the original 196,356) were analyzed

  15. Boxplots of Standardized Regression Coefficients for Institutions in SAT Validity Study Sample Aggregating and Comparing SAT Validity Coefficients across Diverse Institutions

  16. To account for the variability across institutions, the following procedures were followed: Compute separate correlations for each institution Apply a multivariate correction for restriction of range to each set of correlations separately; and Compute a set of average correlations, weighted by the size of the institution-specific sample.

  17. So why do we adjust a correlation? If a college admitted all students irrespective of SAT scores you would find a normal distribution of scores and FGPA and a higher correlation than you observe after selection. The more selective the college, the less likely they are to admit many students with low SAT scores – and they may have far less students with low FGPA than in a population.

  18. Restriction of Range The result is that the entering class is restricted (to higher scoring students) which makes the correlation lower than it is in a representative population. We adjust a raw correlation to account for this restriction and to get us an estimate of the true validity of any measure. The same thing occurs anytime we restrict one variable in selection. .70

  19. More on Restriction of Range • Most believe that correcting for RoR is an appropriate technique, however, some people (mistakenly) think you are manipulating the data • Others believe that if the assumptions of the correction cannot be directly verified, corrections should not be applied. • Best practice is if you do correct correlations, to report both: Standard 1.18 in the Standards for Educational and Psychological Testing (p. 21) states, “When statistical adjustments, such as those for restriction of range or attenuation, are made, both adjusted and unadjusted coefficients, as well as the specific procedure used, and all statistics used in the adjustment, should be reported.” • Ultimately, the decision to correct should be based on the purpose of the study and the types of interpretations that will be made (compare predictors, explain total variance accounted for in a model, etc.). Reporting both adjusted and unadjusted correlations is normally appropriate in selection.

  20. In the current study: We employed the Pearson-Lawley multivariate correction The population was defined as the 2006 College Bound Seniors cohort Any student graduating from HS in 2006 and took the SAT Computed the variance-covariance matrix of SAT-M, SAT-CR, SAT-W, and HSGPA scores using students with complete records

  21. Descriptive Statistics of the Restricted Sample as compared to the Population

  22. Correlations of Predictors with FYGPA Note. N=151,316. *Correlations corrected for restriction of range, pooled within-institution correlations

  23. Correlations Aggregated by Institutional Characteristics * Correlations corrected for restriction of range, pooled within-institution correlations

  24. Other Possible Corrections that were not Applied in the Current Study Criterion Unreliability (attenuation) – college grades are not perfectly reliable In order to compare with past results, we did not correct for attentuation Results would have shown even larger correlations Predictor Unreliability SAT scores are not perfectly reliable, but they are pretty close (reliability in 90s for CR & M and high 80s for W) Since admission decisions are made with imperfect measures, did not correct for predictor unreliability Course Difficulty Students don’t take all of the same courses. Courses are not all of the same difficulty (see Sackett and Berry, 2008) Placement study will examine whether or not to control for course difficulty

  25. Deciding How to Get from Point A to Point B • There are numerous ways to look at the relationship between the SAT, HSGPA and other variables, and college grades and each may give a different picture.

  26. Many ways to Examine and Visually Present the Predictive Validity of the SAT • In addition to bivariate correlations and multiple correlations which indicate the predictive power of an individual measure or multiple measures used in concert, there are other ways to analyze/present the data. • Regression analyses – examination of Beta weights (as opposed to raw regression coefficients) • Including additional predictors* • Incremental validity • Order matters* • Mean level differences by performance bands • Alternative outcomes • Individual course grades rather than FYGPA • Though some of these may be more accessible to laypersons, if used improperly, they may be misleading…

  27. 4.00 r = .37 3.00 2.00 FGPA 1.00 0.00 200 300 400 500 600 700 800 SAT-Math The slope of the regression line, which shows the expected increase in FYGPA associated with increasing SAT scores. • More readily understood than a correlation coefficient • When looking at multiple variables, Beta weights answers the question: Which of the independent variables have a greater effect on the dependent variable in multiple regression analysis? • Can look at the effect of additional variables after first taking into account other variables

  28. It should be clear now that high multicollinearity may lead not only to serious distortions in the estimations of regression coefficients but also to reversals in their signs. Therefore, the presence of high collinearity poses a serious threat to the interpretation of the regression coefficients as indices of effects (Pedhazur, 1982, p. 246). However the Results may need to be Interpreted with Caution!

  29. The SAT is Cursed: University of California Study (2001) • Examining UC data, Geiser and Studley (2001) found that SAT II scores and HSGPA together account for 22.2% of the variance in FYGPA in the pooled, 4-year data. • Adding SAT I into the equation improves the prediction by an increment of only 0.1% in the pooled, 4-year data. Support using SAT II scores and HSGPA, not SAT I scores. • However, they fail to mention that similar findings can be seen with the SAT II subject tests. • SAT I scores and HSGPA together account for 20.8% of the variance • Adding SAT II improves the prediction by an increment of 1.5% THE REASON: SAT I and SAT II scores are highly correlated (redundant) – issue of multicollinearity!

  30. Reverse the Curse: New UC Study (2007): • Agranow & Studley (2007) reached different conclusions • Examined the predictive validity of the new SAT for 33,356 students who • Completed the new SAT • Enrolled in a UC campus in the fall of 2006 • Results compared to previous UC study using the old SAT in 2004 • Comparisons based on how well each measure predicted Freshman GPA at UC (based on a model with all three SAT sections and HSGPA entered simultaneously predicting FYGPA) • SAT Critical Reading and Math slightly more predictive in 2006 than in 2004 • SAT Writing slightly more predictive than the other SAT sections • SAT Writing (in 2006) slightly more predictive than Writing Subject Test had been (in 2004) • In 2004 study, High School GPA was slightly more predictive than SAT V+M • In 2006 study, SAT CR+M+W was slightly more predictive than High School GPA

  31. The SAT is a wealth test: University of California Study (2001) • Another conclusion from the Geiser and Studley (2001) study was that after controlling for not only HSGPA and SAT II scores, but also parental education and family income, SAT I scores did not improve the prediction. • Claimed that the predictive power of the SAT I essentially drops to zero when SES is controlled for in a regression analysis. • Conclusion - SAT is a wealth test – even though its incremental validity was already essentially zero before SES variables were added! THE REASON, again: SAT I and SAT II scores are highly correlated (redundant) – issue of multicollinearity! …However, the media had a different take.

  32. “SAT scores tied to income level locally, nationally” (Washington Examiner, August 31, 2006)“Parents' education best SAT predictor”(United Press International, May 4, 2006)“SAT measures money, not minds”(Yale Herald, November 15, 2002) Sampling of SAT-Related SES Articles in the Popular Press

  33. Disproving the Myths about Testing (often perpetuated by the media) Sackett et al., 2007 Computed the correlation of college grades and SAT scores partialling out SES to determine the degree to which controlling for SES reduced the correlation. Contrary to the assertion of many critics, statistically controlling for SES only slightly reduced the estimated test-grade correlation (0.47 to 0.44) Zwick & Greif Green, 2007 The correlation of SAT scores and SES factors is smaller when computed within high school rather than across high schools. The correlation of HSGPA and SES factors is slightly larger within high schools compared to across high schools. Mattern, Shaw & Williams, 2008 Across high schools, correlations of SAT and SES were about 2.2 times larger than the correlations of high school performance and SES. Within high school and aggregated, the SAT-SES correlations were only 1.4 times larger than the high school performance-SES correlations.

  34. Whoever Sits in the Front Seat Determines the Result - Incremental Validity Example Note. Data from 2008 SAT Validity Study. Correlations corrected for restriction of range, pooled within-institution correlations Here is what the media might say: “The new SAT adds ONLY 0.08 over HSGPA - it is worthless!” “The new writing section adds ONLY 0.02 over SAT-CR & M – It’s not worth the extra time and cost!”

  35. Switching who Sits in the Front Seat –Incremental Validity Example Note. Data from 2008 SAT Validity Study. Correlations corrected for restriction of range, pooled within-institution correlations Here is what the media might say: “The HSGPA adds ONLY 0.09 over new SAT - it is worthless!” “The SAT-CR & M add ONLY 0.02 over new writing section – why didn’t we always have a writing section!?”

  36. Straight-forward Approach: Increment of SAT controlling for HSGPA and Academic Intensity Bridgeman, Pollack, & Burton (2004)

  37. Another way to think of a correlation of 0.53: Mean FYGPA by SAT Score Band FYGPA SAT SCORE BAND

  38. Using Course Grades as the Criterion rather than FYGPA • FYGPA is not always a reliable measure and it is difficult to compare across different college courses and instructors. • Sackett and Berry (2008) examined SAT validity at the individual course level. • Correlation of SAT and course grade composite = 0.58, compared to 0.51 for FYGPA. • SAT validity is reduced by 19% due to “noise” added as a result of differences in course choice. • HSGPA is not a stronger predictor than SAT when composite of individual course grades is used as criterion measure.

  39. A Bumpy Road • The fairness issue: Standardized Differences, Differential Validity and Differential Prediction

  40. Correlation of SAT scores & HSGPA w/ FYGPA by Race/Ethnicity Previous research has shown tests and grades are slightly less effective in predicting performance of African American students.

  41. Average Overprediction (-) and Underprediction (+) of FYGPA for SAT Scores and HSGPA by Ethnicity Also consistent with past research – The actual FGPA of under represented minorities average about .1 to .2 below predicted GPAs from SAT. HS grades consistently overpredict grades at a higher rate than tests. Over and underprediction are consistently reduced using both.

  42. Validity research, in conclusion: • You get out what you put in – quality of data, data matching, institutional collaboration, the criterion problem • It is always easier to argue against something than propose an alternative (tests vs grades, tests vs nothing) • Selection – Using a predictor in selection (SAT, GRE, HS grades) will result in lower validity in proportion to the selectivity used. If you then compare the validity to a ‘new predictor’ not employed in selection it is not surprising to see higher correlations that will NOT stand up to operational validities. • For more information on CB research: http://collegeboard.com/research

  43. Appendix Additional Materials Not Presented at Conference

  44. Related Roadblocks • Addressing and disproving criticisms. An equal amount of effort spent collecting evidence for what the SAT does not do as is spent collecting evidence for what it does do. • Besides the criticisms described earlier (i.e.,SAT is a “wealth test”, provides no information over HSGPA), other criticisms as well as evidence to the contrary are presented.

  45. “The SAT is to criticism as a halfback is to a football -- always on the receiving end.” Gose & Selingo (2001). The SAT's Greatest Test: Social, legal, and demographic forces threaten to dethrone the most widely used college-entrance exam. Chronicle of Higher Education website.

  46. SAT, at 3 Hours 45 Minutes, Draws Criticism Over Its Length(New York Times, December 16, 2005) • College Board Study: Investigating the Effect of New SAT Test Length on the Performance of Regular SAT Examinees (Wang, 2006) • Examined the average % of items answered correctly and the average number of items omitted for different sections of the test. • The average % items correct was consistent throughout the entire test, and the results were similar for gender, racial/ethnic, and language groups, and for different levels of ability as measured by total SAT score. • On average, students did not omit a larger number of items on later sections of the test. • Conclusion: any fatigue that students may have felt did not impair their performance.

  47. SAT Essay Test Rewards Length and Ignores Errors (New York Times, May 4, 2005) • College Board Study: It is What You Say and (Sometimes) How You Say It: The Association Between Prompt Characteristics, Response Features, and SAT Essay Scores (Kobrin, Deng, & Shaw, submitted for publication) • A sample of SAT essay responses was coded on a variety of features regarding their length and content, and essay prompts were coded on their linguistic complexity and other characteristics. • The correlation of number of words and essay score was 0.62, which is smaller than that reported in the media.

  48. SAT Coaching Raises Scores, Report Says (New York Times, December 18, 1991) • College Board sponsored study: Effects of Short-Term Coaching on Standardized Writing Tests (Hardison & Sackett, 2006) • Does coaching increase scores on the SAT essay? If so, does that coaching increase scores only on the specific essay, or does it also increase the test-taker’s actual writing ability that the test is intended to measure? • These results suggest that SAT essays may be susceptible to coaching, but score inflation may reflect at least some improvement in overall writing ability.

  49. A Bumpy Road Continued: Fairness Issues

  50. Previous findings… Standardized differences Males outperform females on Math and Critical Reading. African-American and Hispanic students scored significantly lower than the total group on all academic measures Differential Validity SAT and HSGPA are more predictive of FYGPA for females and white students (larger correlations) Differential Prediction SAT and HSGPA tend to underpredict FYGPA for females; however, the magnitude is larger for the SAT SAT and HSGPA tend to overpredict FYGPA for minority students; however, the magnitude is larger for HSGPA

More Related