870 likes | 1.27k Views
Agreement Indices in Multi-Level Analysis. Ayala Cohen Faculty of Industrial Engineering& Management Technion-Israel Institute of Technology May 2007. Outline. Introduction ( on Interrater agreement-IRA) r WG(J) Index of agreement AD ( Absolute Deviation),
E N D
Agreement Indices in Multi-Level Analysis Ayala Cohen Faculty of Industrial Engineering& Management Technion-Israel Institute of Technology May 2007
Outline • Introduction ( on Interrater agreement-IRA) • rWG(J)Index of agreement • AD ( Absolute Deviation), Alternative measure of agreement -------------------------------- Review Our work (2001) (2007) Etti Doveh Etti Doveh Uri Eick Inbal Shani
INTRODUCTION Why we need a measure of agreement In recent years there has been a growing number of studies based on multi-level data in applied psychology, organizational behavior, clinical trials. Typical data structure: Individuals within groups ( two levels) Groups within departments (three levels)
Constructs • Constructs are our building blocks in developing and in testing theory. • Group-level constructs describe the group as a whole and are of three types (Kozlowski & Klein, 2000): • Global, shared, or configural.
Global Constructs • Relatively objective, easily observable, descriptive group characteristics. • Originate and are manifest at the group level. • Examples: • Group function, size, or location. • No meaningful within-group variability. • Measurement is generally straightforward.
Shared Constructs • Group characteristics that are common to group members • Originate in group members’ attitudes, perceptions, cognitions, or behaviors • Which converge as a function of socialization, leadership, shared experience, and interaction. • Within-group variability predicted to be low. • Examples: Group climate, norms.
Configural Group-Level Constructs • Group characteristics that describe the array, pattern, dispersion, or variability within a group. • Originate in group member characteristics (e.g., demographics, behaviors, personality) • But no assumption or prediction of convergence. • Examples: • Diversity, team star or weakest member.
Justifying Aggregation • Why is this essential? • In the case of shared constructs, our construct definitions rest on assumptions regarding within- and between-group variability. • If our assumptions are wrong, our construct “theories,” our measures, are flawed and so are our conclusions. • So, test both: Within group agreement The construct is supposed to be shared, is it really? • Between group variability (reliability) Groups are expected to differ significantly, do they really?
Chen, Mathieu & Bliese ( 2004) proposed a framework for conceptualizing and testing multilevel constructs. This framework includes the assessment of inter-group agreement Assessment of agreement is a pre-requisite for arguing that a higher level construct can be operationalized .
Distinction should be made between: Interrater reliability (IRR= Interrater Reliability) and Interrater agreement (IRA= Interrater Agreement) Many past studies wrongly used the two terms interchangeably in their discussions.
The term interrater agreement refers to the degree to which ratings from individuals are interchangeable ; namely, it reflects the extent to which raters provide essentially the same rating. (Kozlowski & Hattrup,1992;Tinsley&Weiss,1975( .
Interrater reliability refers to the degree to which ratings of different judges are proportional when expressed as deviations from their means
Interrater reliability (IRR) refers to the relative consistency and assessed by correlations Interrater agreement (IRA) refers to the absolute consensus in scores assigned by the raters and is assessed by measures of variability.
Scale of Measurement • Questionnaire with J parallel items on a Likert scale with A categories e.g. A=5 1 2 3 4 5 Strongly Disagree Indifferent Agree Strongly disagree agree
Example k=3 raters Likert scale A=7 categories J= 5 items
Prior to aggregation , we assessed within unit agreement on…… To do so, we used two complementary approaches (Kozlowski & Klein, 2000) A consistency based approach ,computation of the intra class correlation coefficient ,ICC(1) A consensus based approach ( index of agreement)
How can we assess agreement ? • Variability measures: e.g. Variance MAD( Mean Absolute Deviation) Problem: What are “small / large” values ?
The most widely used index of interrater agreement on Likert type scales has been rWG(J),introduced by James ,Demaree & Wolf (1984). J stands for the number of items in the scale
Examples when rWG(J)was used to assess interrater agreement Group cohesiveness Group socialization emphasis Transformational and transactional leadership Positive and negative affective group tone Organizational climate
This index compares the observed within group variances to an expected variance from “random responding “ In the particular case of one item (stimulus) , (J=1) this index is denoted as rWG and is equal to
is the variance of ratings for the single stimulus is the variance of some “null distribution” corresponding to no agreement
Problem: A limitation of rWG(J) is that there is no clear-cut definition of a random response and the appropriate specification of the null distribution which models no agreement is debatable If the null distribution used to define fails to model properly a random response, then the interpretabilityof the index is suspect.
The most natural candidate to represent non agreement is the uniform (rectangular) distribution, which implies that for an item with number of categories which equals to A, the proportion of cases in each category will be equal to 1/A
For a uniform null For an item with A number of categories
How to calculate the sample variance ? We have n ratings and suppose n is “small”
Example A=5 k=9 raters: 3 3 3 3 5 5 5 5 4 ( With ( n-1) in the denominator),
James et al. (1984): “ The distribution of responses could be non-uniform when no genuine agreement exists among the judges. The systematic biasing of a response distribution due to a common response style within a group of judges be considered. This distribution may be negatively skewed, yielding a smaller variance than the variance of a uniform distribution”.
Slight Skewed Null 1 = .05 2 = .15 3 = .20 4 = .35 5 = .25 Yielding = 1.34 Used as a “null distribution” in several studies (e.g., Schreisheim et al., 1995; Shamir et al., 1998). Their justification for choosing this null distribution was that it appears to be a reasonably good approximation of random response to leadership and attitude questionnaires.
James et al(1984) suggested several skewed distributions , (which differ in their skewness and variance) to accommodate for systematic bias.
Often, several null distributions (including the uniform) could be suitable to model disagreement.Thus, the following procedure is suggested. Consider the subset of likely null distributions and calculate the largest and smallestnull variance specified by this subset.
Additional “problem” The index can have negative values Larger variance than expected from random response
Bi-modal distribution: ( extreme disagreement) Example: A=5 Half answer 1 , Half answer 5 Variance: 4 Uniform variance
What to do when rWG is negative? James et al ( 1984) recommended replacing a negative value by zero. Criticized by Lindell et al. ( 1999)
For a scale of J items Is the average variance over the J items
For a scale of J items Spearman Brown Reliability :
Example 3 raters 7 categories Likert scale 5 items Var calculated with n in denominator
Since its introduction, the use of rWG(J)has raised several criticisms and debates. It was initially described by James et al. (1984) as a measure of interrater reliability. Schmidt & Hunter (1989) criticized this index claiming that an index of reliability cannot be defined on a single item
In response, Kozlowski and Hattrup (1992) argued that it is an index of agreement not reliability. James, Demaree & Wolf (1993) concurred with this distinction, and it has now been accepted that rWG(J)) is a measure of agreement.
Lindell, Brandt and Whitney (1999) suggested, as an alternative to , rWG(J) a modified indexwhich is allowed to obtain negative values (even beyond minus 1)
The modified index r*WG(J)provides corrections to two of the criticisms which were raised againstrWG(J). First, it can obtain negative values, when the observed agreement is less than hypothesized. Secondly, unlikerWG(J)it does not include a Spearman-Brown correction and thus it does not depend on the number of items (J(
Academy of management Journal2006 Does Ceo Carisma matter…..Agle et al. • Ceo’s 770 team members “Because of strengths and weaknesses of various interrater agreement measures, we computed both the intraclass correlation statistics ICC(1) and ICC(2), and the interrater agreement statistics r*WG(J)……………”
Agle et al.2006 Overall, the very high interrater agreement justified the combination of individual manager’s responses into a single measure of charisma for each CEO…. ------------------ They displayICC(1)= ICC(2)= r*WG(J) = One number ?
Ensemble of groups(RMNET) Shall we report median, mean? ….” Observed distributions of rWG(J) are often wildly skewed ….medians are the most appropriate summary statistic”…..
Ehrhart,M.G.(PersonnelPsychology, 2004) Leadership and procedural justice climate –as antecedents of unit-level organizational citizenship behavior Grocery store chain 3914 employees in 249 departments
….”The median rwg values across the 249 departments were : 0.88 for servant leadership, ………… WHAT TO CONCLUDE ??????
Rule-Of-Thumb The practice of viewing rWG in the 0.70’s and higher as representing acceptable convergence is widespread. For example: Zohar (2000) cited rWG values in the .70’s and mid .80’s as proof that judgments “were sufficiently homogeneous for within group aggregation”
Benchmarking rWG Interrater Agreement Indices: Let’s Drop the .70 Rule-Of-Thumb Paper presented in the Annual Conference of the Society for Industrial and Organizational Psychology Chicago April 2004 R.J. Harvey and E. Hollander
“It is puzzling why many researchers and practitioners continue to rely on arbitrary rules-of-thumb to interpret rWG, especially the popular rule-of-thumb stating that rWG≥0.70 denotes acceptable agreement”…..