RICCARDO PECCEI Department of Management King’s College London Email: riccardo.peccei@kcl.ac.uk Tel: 0207-848 4094

**Please note that this is work in progress, results are preliminary** Random and Non-Random Measurement Error in HRM Research: Measuring and Explaining Differences in Management-Employee Representative Responses in WERS2004 RICCARDO PECCEI Department of Management King’s College London Email: riccardo.peccei@kcl.ac.uk Tel: 0207-848 4094 The author acknowledges the Department of Trade and Industry, the Economic and Social Research Council, the Advisory, Conciliation and Arbitration Service and the Policy Studies Institute as the originators of the 2004 Workplace Employment Relations Survey data, and the Data archive at the University of Essex as the distributor of the data. None of these organizations bears any responsibility for the author’s analysis and interpretation of the data.

Rationale of the Study 1. Unquestionable value of large-scale surveys like WERS2004. • One potential problem with these surveys, however, is that they typically rely on the responses of a limited number of key informants in each unit to describe organisational properties of interest (e. g. what HR practices are in place). • But organisational members’ (e.g. managers’ and employee representatives’) perceptions of organisational properties are often subject to considerable error and distortion. Not unusual for different respondents to disagree with each other and provide substantially different answers to the same survey items (Payne & Pugh, 1976; Starbuck & Mezias, 1996, 2000). • Problems of measurement error due to low interrater agreement and reliability of this kind are well documented in the areas of OB and HRM (Gerhart et al., Ostroff & Schmidt, 1993). • They have also been noted in relation to previous WERS surveys in terms of managers’ and employee representatives’ responses to a range of specific items (Cully & Marginson, 1995; Peccei & Benkhoff, 2001).

Rationale of the Study 4. These problems, in turn, raise serious questions about the reliability and validity of measures used in analysis and, therefore, of study results more generally. As a result, they make cumulative research progress much more difficult to achieve. 5. These problems are particularly acute in research based on large-scale surveys using a single-respondent design format, i.e. where information about organisational properties of interest (e.g. HR practices) comes from a single key (commonly management) respondent. • This is the case, for example, with most WERS-based HRM studies (Ramsey et al., 2000). • Also true of much of the American survey-based strategic HRM research looking at the link between human resource (HR) practices and firm performance. • Much of this research uses a single senior management respondent to rate and describe the HR practices of an entire organisation (Huselid, 1995; Delery and Doty, 1996; Huselid et al., 1997).

Rationale of the Study 6. Despite the continued reliance of much HRM research on a single-respondent design format, there is substantial evidence to suggest that single respondent measures of HR practices have low reliability, i.e. they contain large amounts of rater-related measurement error (Gerhart et al., 2000; Wright et al., 2001). “Single raters provide unreliable measures of HR practices” (Wright et al., 2001: 899). • In the HRM literature, measurement error due to raters, assessed in terms of interater reliability, is commonly assumed to be random. • Some researchers (e.g. Wright et. al. 2001) have considered whether error due to raters in the measurement of HR practices may be patterned, rather than being purely random (whether interrater reliability may be lower, for example, in large than in small organisations). • But little systematic research has been done in this area. • As a result, little is known about the factors that might influence interrater reliability and the extent to which the measurement error that is commonly associated with single respondent measures of HR practices is random or non-random in nature.

Rationale of the Study 7. This is a key issue since whether the error involved is random or systematic/patterned can make a major difference to the interpretation of observed links between HR practices and various outcomes of interest. In particular, it affects the corrections for attenuation due to measurement error that might be applied to observed regression and correlation coefficients for specific HR practices-outcomes relationships.

Aims of the Study • Key aim is to contribute to this area of inquiry through a detailed analysis and comparison of management and employee representative responses to a set of 45 matched primary (i.e. non-filtered) questions in a sample of 459 WERS2004 establishments. • The specific aims are: • To identify and describe the extent of interrater reliability, as well as agreement and bias, between management respondents (MR) and employee representative respondents (RR) across the set of matched WERS2004 survey items. • To develop a general model of interrater reliability/agreement that is then used to explore the extent to which observed measurement error in the WERS2004 data is random or patterned. • To consider the analysis and survey design implications of the results, both in general terms and in terms of WERS2004 in particular.

Conceptualisation of Interrater Reliability, Agreement and Bias 1. There is an extensive statistical, psychometric and psychological literature dealing with issues of validity and reliability of measurement in the social sciences, including methods for the analysis of interrater agreement and reliability. • Following Agresti (1992), Uebersax (1992) and Bliese (2000), three main components or dimensions of interrater reliability/agreement can be distinguished: (a) Interrater Reliability (ICC(1) and ICC(2)) • Interrater reliability refers to the degree of association of raters’ ratings, or the relative consistency of responses among raters (i.e. the extent to which raters’ ratings correlate with each other) (Bliese, 2000). • When each target (e.g. organisational unit) is rated on a particular item/property by two or more raters (e.g an MR and an RR), interrater reliability is most commonly assessed by means of two major forms of the interclass correlation coefficient (ICC): the ICC(1) and the ICC(2) (or what Shrout & Fleiss (1979) refer to as the ICC(1,1) and the ICC (1,k) respectively – where k refers to the number of raters).

Conceptualisation of Interrater Reliability, Agreement and Bias (i) The ICC(1) can be interpreted as a measure of thereliability associated with a single assessment of the group mean. • ICC(1) is an aggregate level measure based on the ratio of between to within mean squares across sets (pairs) of raters – i.e. it is a function of the extent of both within and between group variation in ratings. • ICC(1) normally ranges between 0 - 1 and corresponds quite closely to the Pearson correlation coefficient between pair of raters (0 = low interrater reliability / no association between pair of raters, 1 = high interrater reliability / association between pairs of raters). (ii) The ICC(2), on the other hand, provides an estimate of the reliability of the group means, i.e. of the reliability of an aggregate group level measure (Bartko, 1976; Bliese, 2000). • ICC(2) is related to the ICC(1) but is mainly a function of within group variation in ratings and group size. It increases as a function of group size. • ICC2 also varies from 0 - 1 (0 = low reliability of group mean, 1 = high reliability of group mean).

Conceptualisation of Interrater Reliability, Agreement and Bias (b) Interrater Agreement/Consensus • Interrater agreement proper refers to the extent to which raters make essentially the same ratings. • The most commonly used measure of within-group interrater agreement in the OB and HRM literature is James et al.’s (1984, 1993) rwg index. • Here I use the rwg* index (Lindell et al. 1999) which is a variation of the rwg. For dichotomous measures, the rwg* is equivalent the the rwg, as well as to other unit level agreement indexes that have been proposed in the literature (e.g. Burke et al.’s (1999) Average Deviation index). • The rwg*, like the rwg, is an organisationally specific or unit level coefficient that essentially measures the degree of within-group variance in (or absolute level of disagreement between) raters’ scores. Separate rwg* estimates can be calculated for each pair of raters on each separate target/item that is being rated. • The rwg* ranges from 0-1 (0 = high variance / no consensus in ratings, 1 = low variance / high consensus in ratings). • At aggregate level of analysis I also use the average raw MR and RR agreement scores across the set of 459 establishments, as well as Yule’s Q. • For dichotomous items, Yule’s Q provides an aggregate omnibus measure of the degree of MR and RR agreement adjusted for chance. Yule’s Q ranges from 0 (low agreement) to 1 (high agreement).

Conceptualisation of Interrater Reliability, Agreement and Bias (c) Interrater Bias • Rater bias refers to the tendency of given raters to make ratings that are generally higher or lower than those of other raters (Uebersax,1988). • This component, therefore, refers not to the absolute level of disagreement between raters (i.e. the absolute difference or variance in raters’ scores) as such, but to the direction of disagreement between them. • The simplest measure of bias is just the difference in scores between two raters (positive or negative bias). • Bias measures (i.e. differnce scores) can be calculated at either unit or aggregate level. At aggregate level can then use t-tests to assess the degree of difference between mean scores of two groups of raters.

Model of Interrater Agreement/Reliability 1. There are a number of factors that have been suggested to affect the extent to which raters are likely to agree about particular targets/elements that they are rating, such as the contentiousness of the phenomenon involved (Green & James, 2003). 2. The best known model of rater agreement/consensus is that proposed by Kenny (1991) in the social psychological domain (see also Hoyt & Kerns, 1999). 3. In the present study I draw on this model but adapt it and extend it to the HRM domain, linked specifically to WERS2004.

Model of Interrater Agreement/Reliability • Specifically, MR-RR interrater agreement/reliability in WERS2004 can be expected to be affected by five main factors (see Table 1 for details): • The nature of the attributes/items being rated (e.g. objective vs. subjective; HR practices vs. non-HR practices). • The nature (complexity, heterogeneity and stability) of the target system/organisation being rated (e.g. size, age and stability of the establishment). • The nature of the MR and RR raters, especially in terms of their knowledge and experience of the target system they are rating (e.g. their level of seniority, formal position and length of tenure in the establishment). • Relational factors designed to capture common/shared MR and RR experiences (e.g. frequency of contact between MR and RR raters, IiP accreditation). • Shared world view and understanding of MR and RR raters (e.g. whether RR are union or non-union representatives, extent of mutual trust between MR and RR).

Table 1 –Model of Interrater Agreement/Reliability in WERS2004

Table 1 continued –Model of Interrater Agreement/Reliability in WERS2004 MR = Management respondent RR = Employee representative respondent Note that the effect of the nature of items on interrater agreement/reliability can only be tested at the aggregate level of analysis. In contrast, the effect of the other factors can, in principle, be tested at both aggregate and unit (i.e. establishment) level of analysis.

Sample and Data 1. Establishments = 459 establishments for which both MR and RR data were available on a matched set of 45 questions in WERS2004. 2. Respondents = 918 respondents (459 pairs of MR and RR – one MR-RR pair per establishment). 3. Overall Rating Design = nested design (two raters – one MR and one RR – nested within each target / establishment, see Hoyt, 2000). 4. Questions = 45 non-filtered matched questions from the management and employee representative questionnaires in WERS2004. The majority of questions involved a dichotomous (yes/no) response format. A number of the questions, however, used categorical (non-continuous or non-interval) response scales (e.g. whether management at the establishment did not inform, informed, consulted, or negotiated with employee representatives over a range of specific issues). Response categories on all categorical scales were treated as separate dichotomous (yes/no) items, making for a total of 74 matched dichotomous items for use in the main analysis. 5. Total items used in the analysis = 74 matched dichotomous items based on the 45 dichotomous and categorical matched MR and RR questions.

Sample and Data 4. Type of Items Covered in the Analysis a. HR Practices (47 items) (i) Presence of formal ‘due process’ procedures (5 items) - (e.g. whether there are formal grievance and disciplinary procedures at the establishment) (ii) Information-sharing practices (3 items) - (i.e. whether management shares information on a range of strategic issues at the establishment) (iii) Representative voice practices (39 items) - (i.e. whether management informs, consults or negotiates with employee representatives on a range of 13 issues) b. Events / HR Outcomes (25 items) (i) Occurrence of various form of industrial conflict at the establishment (10 items) (ii) Threat of various forms of industrial action at the establishment (8 items) (iii) Occurrence of various types of changes at the workplace (7 items) – (e.g. introduction of new technology)

Sample and Data c. Attitudes (2 items) (i) Management more or less favourable to union membership at the establishment (2 items) (see Table H1 in handout for more detailed listing of items) 5. Dependent Variables (see above) • Interrater Agreement Measures a) % MR & RR agreement (aggregate level only) • Yule’s Q (aggregate level only) • rwg* (aggregate and unit level) • Interrater Bias a) MR-RR difference scores (aggregate level only) • Interrater Reliability • ICC(1) (aggregate level only) • ICC(2) (aggregate level only) • (3) Pearson correlation (aggregate level only) 6. Independent Variables (see Table 1 above)

Results A. Aggregate Level Analysis 1. Descriptives (Table H1 in handout and Tables 2a and 2b) 2. Correlations (Table 3) 4. Corrections for Attenuation (Table 4) 5. Regressions for items (Table 5) 6. Bivariate test of model (Table 6) B. Unit/Establishment Level of Analysis 1. Multivariate test of model (Table 7) C. Summary of Aggregate and Unit Level of Analysis 1. Test of model: Summary results (Table 8)

Table 2a - Aggregate level analysis: Aggregate measures of interrater agreement and reliability - Mean scores by type of item (N = 74 items across 459 pairs of MR and RR raters in 459 establishments) MR = Management respondent RR = Employee representative respondent

Table 2b – Aggregate level analysis: Comparison of mean scores by type of item for aggregate measures of interrater agreement and reliability ns Difference between means not significant at < .05 level * Difference between means significant at < .05 level ** Difference between means significant at < .01 level *** Difference between means significant at < .001 level

Table 3 – Aggregate level analysis: Correlations between aggregate measures of interrater agreement and reliability (N = 74 items) MR= Management respondent RR = Employee representative respondent * p < .05 ** p < .01 *** p < .001

Table 4 – Correlations between selected outcomes and HR practice measures based on MR & RR respondents – Observed uncorrected correlations vs. correlations corrected for attenuation (average interater error/unreliability scores) • a Corrected correlations inparentheses. • MR = Management respondent RR = Employee representative respondent • p < .05 ** p < .01 *** p < .001 • Assumptions: • (1) Average (interrater) reliability of HRP measure = .10 (see Table 1a) • (2) Average reliability of objective absence and turnover measures = 1.00 • (3) Average reliability of financial, productivity and quality performance • measures based on MR subjective assessments = .70

Table 5 – Aggregate level regression analysis: Effect of type of item on aggregate measures of interrater agreement and reliability (N = 74 items) a Standardised beta coefficients MR= Management respondent RR = Employee representative respondent * p < .05 ** p < .01 *** p < .001

Table 6 – Aggregate level analysis: Bivariate test of rater agreement model – Comparison of mean correlations between MR & RR ratings for each independent variable by type of item

Table 6 Continued – Aggregate level analysis: Bivariate test of rater agreement model – Comparison of mean correlations between MR & RR ratings for each independent variable by type of item MR = Management respondent RR = Employee representative respondent * p < .05 ** p < .01 *** p < .001

Table 7 – Establishment level analysis: Test of rater agreement model - Regression results for Rwg* by type of item and overall MR = management respondent RR= Employee representative respondent a Standardised beta coefficients * p < .05 ** p < .01 *** p < .001

Table 8 – Test of interrater agreement model: Summary of aggregate level bivariate and establishment level multivariate results For detailed results see Table 2 and Table 3. MR = Management respondent RR = Employee representative respondent + = positive observed effect; - = negative observed effect ns p > .05 * p < .05 ** p < .01 *** p < .001

Conclusions 1. Interrater measurement error (i.e. error due to raters) in WERS2004 is not completely random. Rather, it is mildly patterned/predictable (e.g. it is greater for HRP items than for non-HRP items, greater in newer than in older establishments, etc.) 2. This patterning, however, is not very marked. 3. Therefore, treating rater error as random is not likely to significantly distort/affect coefficients of attenuation that might be applied to the WERS2004 data (e.g. to estimated correlation and regression coefficients). 4. In principle, therefore, it is likely to be acceptable to use average or overall ICC(1) and ICC(2) values to correct for attenuation in WERS2004 data/results.

Conclusions • But important to note two points in this context: • First, more research is required to determine the extent of generalisability of the interrater reliability estimates (i.e. ICC(1) and ICC(2) values) obtained in the present study. The key question here is whether rater-related measurement error (and rater-related error structures more generally) is/are (a) rater-pair specific, (b) domain, sub-domain and item specific. • Second, even if the present interrater reliability estimates turn out to be generalisable, correcting for attenuation in multiple independent variables (i.e. in complex multivariate models using large numbers of predictors and control variables) is likely to prove extremely difficult, if not impossible, to do in practice.

Conclusions 6. Implications of Results for: (a) Analysis of WERS2004 data • Correct for attenuation? • Do nothing / business as usual? • Other? (b) Survey Design • Increase number or raters/respondents? • Have mixed designs? • Other? • Costs and benefits of different options?

RICCARDO PECCEI Department of Management King’s College London Email: riccardo.peccei@kcl.ac.uk Tel: 0207-848 4094