250 likes | 354 Views
Indexing Consistency Across Multiple Indexers/Taggers. Dietmar Wolfram, Hope A. Olson & Raina Bloom University of Wisconsin-Milwaukee SIS Research Forum October 14, 2008. Indexing Consistency. Indexing key to retrieval Consistency deemed essential for effective retrieval
E N D
Indexing Consistency Across Multiple Indexers/Taggers Dietmar Wolfram, Hope A. Olson & Raina Bloom University of Wisconsin-Milwaukee SIS Research Forum October 14, 2008
Indexing Consistency • Indexing key to retrieval • Consistency deemed essential for effective retrieval • People (including indexers) interpret content differently • Typically agreement on core topics, but with wide dispersion • Same is true for newer tagging environments
Previous Research • Long history of consistency research with small number of indexers • Olson & Wolfram (2006) mapped distribution of multiple indexers’ terms (n=33) • Diverged into two paths: • Co-occurrences and syntagmatic relationships (JDoc forthcoming) • Measuring consistency
Measuring consistency • Medelyan & Witten (2006) have proposed a measure based on the cosine measure of similarity used in IR, but uses controlled vocabularies • Wolfram & Olson (2007) introduced vector-based Inter-indexer Consistency Density (ICD) of multiple indexers (n=64)
Distribution of assigned term frequency Strong inverse relationship; not necessarily Zipfian Similar relationship seen for co-occurrence of assigned terms 89% of co-occurring term pairs appear once 2006 study: Distribution of Student-assigned Terms
2007 study: Vector Technique tested on student indexing data t test outcome (assuming unequal variances) t = 0.7288 α = .05 p = 0.471
Study Purpose • To apply the indexing consistency method developed by Wolfram & Olson (2007) on a large data set • To determine if vocabulary usage in an emerging area is significantly different than for established areas as measured by inter-tagger consistency across documents in different fields
Measures of Inter-Indexer Consistency • Usually only permit comparisons of two indexers (or a few more) • Hooper (1965) H(I1, I2) = C___ A + B – C • Rolling (1981) R(I1, I2) = 2 C A + B • Where • A and B are the size of I1 and I2’s term set • C is the number of common terms
Simplifying the Process by Applying IR Modelling • See CAIS 07 presentation (Wolfram & Olson)for presentation of concept • Indexing is central to IR theory and models (e.g., vector space model) • Usually, the document is the focal point • The same principles can be applied to indexers & taggers, who now serve as the focal point
Defining an Indexer/Tagger Space • Traditional vector space model d1 d2 … dm • Same approach can be applied to a multiple indexer environment, where Documents = Indexers
Indexing/Tagging Space Dist(I1,C) I1 Dist(I2,C) Centroid Dist(I3,C) I2 I3 Calculating Distances
Document Space vs. Indexer / Tagger Space Characteristics • Characteristic of overall space measured as a density using inter-document/tagger distances • Document space • Low density space => easier to distinguish documents => better for retrieval • Indexer/Tagger space • The opposite is desirable • High density space => more similarity & higher consistency
Calculating Inter-Indexer (Tagger) Consistency Density Where m is the number of indexers/taggers
Applying the ICD Measure to a Large Dataset • Used tagging data available from CiteULike (www.citeulike.org) • 800,000 tagged documents • 29,000 taggers • Identified scholarly documents that have been tagged by a large number of taggers, which served as the basis for the comparison • Viable documents were categorized into 3 topical areas • Average ICDs for groups of documents compared across the topical areas
Data Characteristics • Less than .03% of articles have been tagged by at least 10 people • ~ 2/3rds of highly tagged documents represent spam (e.g., links to commercial websites) • 78 viable articles tagged by at least 20 taggers were identified • 3 subject areas were identified • Science • Social Science • Social Software
Potential Challenge for Comparing Outcomes • Densities are influenced by distances in the tagger space • Distances are influenced by the dimensionality of the space • Dimensionality is influenced by number of unique tags and taggers • Therefore, outcomes should first be checked for significant correlations
Relationships between Taggers and Density Outcomes Pearson’s r correlation = 0.033 p = .782 … therefore, it’s not an issue
Analyzing the Data • Outliers were removed • 74 documents remained • 1-way ANOVA (parametric & non-parametric equivalent) used to compare average densities across the three topic areas
ANOVA Outcome Parametric Kruskal Wallis Test Non-parametric
Discussion • No significant differences in average density outcomes • Therefore, no significant difference in vocabulary usage • Could it be a reflection of the tagger population? • Investigated topic areas are closely related, so differences might not be apparent • Limited to what is being tagged => most items related to social software and allied science & social science areas
Research Limitations • Only takes distances into account, not semantics or contexts • Different sets of terms with similar tagging specificity and exhaustivity patterns will result in similar densities • Method can be computationally more challenging than traditional approaches • But with more taggers this is to be expected
Research Limitations • Findings are only as good as the data • Spam is common • Tagger motivation and intentions for contributing documents and tags may differ
Conclusion • ICD method is viable and usable even on larger datasets • Vocabulary consistency does not appear to be significantly different across the three broad topic areas • Future research will examine further applications of the regularities found