1 / 49

Overview of Main Survey Data Analysis and Scaling

Overview of Main Survey Data Analysis and Scaling. National Research Coordinators Meeting Madrid, February 2010. Content of presentation. Scaling and analysis of test items Scaling and analysis of questionnaire items Data analysis for the reporting of ICCS data. Steps in analysis.

hagen
Download Presentation

Overview of Main Survey Data Analysis and Scaling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Overview of Main Survey Data Analysis and Scaling National Research Coordinators Meeting Madrid, February 2010

  2. Content of presentation • Scaling and analysis of test items • Scaling and analysis of questionnaire items • Data analysis for the reporting of ICCS data

  3. Steps in analysis • Preliminary analysis of first data sets received • Review at JMC data analysis meeting in Hamburg in July 2009 • Analysis of clean and uncleaned data sets from almost all participating countries • Review at PAC meeting in Tallinn (Oct 2009) and JMC data analysis meeting in Hamburg in early December 2009 • Final scaling and analysis with clean data from all 38 countries

  4. Test item analysis • Review of missing data • Analysis of item dimensionality • Review of item statistics (international) • Analysis of differential item functioning by gender • Analysis of item-by-country interaction • Measurement equivalence • Item adjudication

  5. Scaling model • Rasch one-parameter model • Pi() is the probability for person n to score 1 on item i • n is the estimated ability of person n and i

  6. Probability curves

  7. Partial credit model • For open-ended items (and questionnaire items) with more than two categories the Partial Credit model was used: • Here, tij denotes an additional step parameter

  8. Threshold curves

  9. Response probabilities

  10. Missing data issues • Different categories of missing data • Omitted responses • Somewhat higher percentages for open response items • Invalid responses • Generally very low percentages • Not reached responses • Omitted items at end of test booklets • Generally low, in few countries more considerable

  11. Not reached % by region

  12. Test characteristics • Test items were generally a little easier than the average student abilities (pooled across countries) • Test reliability was 0.84 (similar to CIVED assessment) • Very high latent correlations between possible sub-dimensions • Decision not to pursue sub-scales

  13. Mapping of test items to abilities

  14. Review of item scaling properties • Most items had excellent scaling properties • Weighted mean square item fit • Item-total correlation • Item characteristic curves • Only on test item (CI2HRM2) was omitted from scaling

  15. Item statistics

  16. Item characteristic curves

  17. Scoring reliabilities - 1 • Open-ended items were scored according to international scoring guidelines • Double-scoring of sub-samples • On average, percentages of scorer agreement ranged between 84 and 92 across participating countries

  18. Scoring reliabilities - 2 • Only items accepted where scorer agreement was 70% or more • Data for items where this criterion was not met were not included in scaling • In two countries open-ended items were consistently easier than other items • Omitted from scaling and database

  19. Gender DIF • DIF estimates reflect the differences between item difficulties for males and females of equal ability • This may cause bias in favour of one group • Generally, only few items with gender DIF were found

  20. Cross-national measurement equivalence • Occurrence of item-by-country interaction • Items relatively much harder in some countries but much easier in others • In ICCS, national item calibrations were compared with those for the international calibration sample • Standard errors were adjusted for sample design effects and multiple comparisons

  21. Example for CI2HRM2

  22. Item-by-country interaction • Generally, items tended to behave in a similar way • Number of items with parameter variance • Sometimes due to translation errors • Often due to other factors (national context, curricula) • Occurrence of some parameter variation across countries • Similar results as in other cross-national studies

  23. Item adjudication • Based on results from scaling analysis (item statistics, item curves, item-by-country interaction etc.) • International item adjudication • Omission of CI2HRM2 from scaling • National item adjudication • Re-verification for items with larger discrepancies in item difficulty • Omission of item for national scaling with translation or scoring issues

  24. Calibration of items • Based on international calibration sample with 500 randomly selected students from each of the 36 participating countries that met sampling requirements • ACER ConQuest was used for estimation • Booklet effects adjusted by including booklet as a facet in the scaling model

  25. Scaling methodology • Plausible values were generated as student ability estimates • More information at workshop! • Dummy indicators for classroom and all student level variables (international and regional) were included in the conditioning model • Scale scores set to international metric with mean of 500 and SD of 100 for equally weighted countries

  26. Estimation of changes in cognitive knowledge - 1 • 17 test items from CIVED included as intact cluster • 17 countries with comparable data • Three countries with grade 9 in CIVED and additional grade 9 samples in ICCS • Small number of items in some countries had to be discarded due to translation errors or differences between ICCS and CIVED

  27. Estimation of changes in cognitive knowledge - 2 • Comparison of item parameters showed high similarity (correlation of 0.95) • Slight positioning effect due to different test designs • CIVED: One booklet • ICCS: CIVED link cluster in each of the three positions • CIVED items at beginning slightly easier, at end slightly harder than in ICCS

  28. Estimation of changes in cognitive knowledge - 3

  29. Estimation of changes in cognitive knowledge - 4 • Framework broadened since CIVED • Re-scaling CIVED data to equate with ICCS not appropriate • Selection of CIVED items not representative for overall CIVED test • Equating link items with CIVED scale (or sub-scale) also not appropriate • Solution: Establish new comparison scale based only on 17 link items

  30. Estimation of changes in cognitive knowledge - 5 • Concurrent calibration of item parameters based on calibration samples with 34 samples from 17 countries (CIVED and ICCS) • Establishing a metric with a mean of 500 and SD of 100 for equally weighted 17 CIVED countries • For results in tables, weighted likelihood estimates were used • Usually unbiased for country averages

  31. Questionnaire item analysis • Missing data issues • Item dimensionality and scaling review • Item/scale adjudication • Scaling procedures

  32. Missing data - 1 • On average about 3 percent of students have missing scale scores • Only in two countries there are percentages of 18 and 12 percent • Teacher survey data relatively low missing percentages were found (about 2 percent) • Very low percentages of missing data in school questionnaire

  33. Missing data - 2 • Concerns about missing data for socio-economic indicators • Highest parental occupation: 5% • Highest parental education: 3% • Books at home: 1% • However, in a few countries higher percentages of missing data were found (up to 15% for parental education)

  34. Analysis of item dimensionality • Exploratory and confirmatory factor analyses showed generally very similar results to those from the field trial • These analyses will be described in detail in the ICCS technical report

  35. Scaling analysis • Scale reliabilities (Cronbach’s alpha) • Over 0.7 satisfactory internal consistency • Item-total correlations: • Useful for reviewing translation errors • Scaling with IRT Partial Credit Model • Item fit • Category characteristic curves

  36. Item and scale adjudication • Only three scales with median scale reliabilities below 0.7 • Democratic value beliefs, civic participation in community and at school • Adjudication for student, teacher, school and each regional questionnaire • Some items were removed from scale • In some cases, single-item reporting

  37. Scaling procedures - 1 • IRT scaling with Partial Credit Model • So-called weighted likelihood estimates as scale scores • International metric with mean of 50 and a standard deviation of 10

  38. Scaling procedures - 2 • Item parameter calibration with ACER ConQuest • Calibration samples: • 500 students per country • 250 teachers per country • All school data with equal weights for each country • Only data from countries that met sampling requirements (categories 1 or 2) included in calibration

  39. Questionnaire scales • Advantages of IRT scales • Inclusion of students with at least two item responses per scale • Possibility to describe scale • From IRT Partial Credit Model it is possible to map scale scores to expected item responses • Item maps will be provided in appendix to international report

  40. Example of item map

  41. Data analysis for reporting • Estimation of sampling variance • Estimation of measurement variance • Reporting of differences

  42. Estimation of sampling variance • Data from cluster samples are not simple random samples • Standard formula for estimating sampling error not appropriate • Jackknife repeated replication technique used for ICCS • IDB Analyser, WESVAR or SPSS/SAS macros may be used for applying this methodology

  43. Estimation of measurement variance • Using plausible values allows estimating the measurement error • The variation between the five PVs can be used for estimation • IDB Analyser, WESVAR or SPSS macros (ACER replicates module) include features to do this • More information will be provided at the training workshop on Wednesday

  44. Reporting of differences - 1 • The following types of significance tests will be reported: • For differences in population estimates between countries • For differences between a country and the international • in population estimates between subgroups within countries. • For differences between population estimates in ICCS and in CIVED (trend estimation)

  45. Reporting of differences - 2 • Adjustment for multiple comparisons with Dunn-Bonferroni method • increasing critical value (p> .05) from 1.96 to 3.189 • SE for differences between samples • Estimation of SE for sub-group differences with JRR

  46. Reporting of differences - 3 • For the SE of trend differences it is important to take the equating error into account • The estimation of SE for differences between CIVED and ICCS can be computed as • The equating error in the international metric is 3.31

  47. Multivariate analysis • Multiple regression models were used for the tables in draft Chapter 7 • Bivariate regression • Multiple regression • Multi-level models were used for the analysis in draft Chapter 8 • Students nested within classrooms • Classrooms mostly equivalent to schools

  48. Questions or comments?

More Related