Indexing Consistency Across Multiple Indexers/Taggers

Indexing Consistency Across Multiple Indexers/Taggers Dietmar Wolfram, Hope A. Olson & Raina Bloom University of Wisconsin-Milwaukee SIS Research Forum October 14, 2008

Indexing Consistency • Indexing key to retrieval • Consistency deemed essential for effective retrieval • People (including indexers) interpret content differently • Typically agreement on core topics, but with wide dispersion • Same is true for newer tagging environments

Previous Research • Long history of consistency research with small number of indexers • Olson & Wolfram (2006) mapped distribution of multiple indexers’ terms (n=33) • Diverged into two paths: • Co-occurrences and syntagmatic relationships (JDoc forthcoming) • Measuring consistency

Measuring consistency • Medelyan & Witten (2006) have proposed a measure based on the cosine measure of similarity used in IR, but uses controlled vocabularies • Wolfram & Olson (2007) introduced vector-based Inter-indexer Consistency Density (ICD) of multiple indexers (n=64)

Distribution of assigned term frequency Strong inverse relationship; not necessarily Zipfian Similar relationship seen for co-occurrence of assigned terms 89% of co-occurring term pairs appear once 2006 study: Distribution of Student-assigned Terms

2007 study: Vector Technique tested on student indexing data t test outcome (assuming unequal variances) t = 0.7288 α = .05 p = 0.471

Study Purpose • To apply the indexing consistency method developed by Wolfram & Olson (2007) on a large data set • To determine if vocabulary usage in an emerging area is significantly different than for established areas as measured by inter-tagger consistency across documents in different fields

Measures of Inter-Indexer Consistency • Usually only permit comparisons of two indexers (or a few more) • Hooper (1965) H(I1, I2) = C___ A + B – C • Rolling (1981) R(I1, I2) = 2 C A + B • Where • A and B are the size of I1 and I2’s term set • C is the number of common terms

Simplifying the Process by Applying IR Modelling • See CAIS 07 presentation (Wolfram & Olson)for presentation of concept • Indexing is central to IR theory and models (e.g., vector space model) • Usually, the document is the focal point • The same principles can be applied to indexers & taggers, who now serve as the focal point

Defining an Indexer/Tagger Space • Traditional vector space model d1 d2 … dm • Same approach can be applied to a multiple indexer environment, where Documents = Indexers

Indexing/Tagging Space Dist(I1,C) I1 Dist(I2,C) Centroid Dist(I3,C) I2 I3 Calculating Distances

Document Space vs. Indexer / Tagger Space Characteristics • Characteristic of overall space measured as a density using inter-document/tagger distances • Document space • Low density space => easier to distinguish documents => better for retrieval • Indexer/Tagger space • The opposite is desirable • High density space => more similarity & higher consistency

Calculating Inter-Indexer (Tagger) Consistency Density Where m is the number of indexers/taggers

Applying the ICD Measure to a Large Dataset • Used tagging data available from CiteULike (www.citeulike.org) • 800,000 tagged documents • 29,000 taggers • Identified scholarly documents that have been tagged by a large number of taggers, which served as the basis for the comparison • Viable documents were categorized into 3 topical areas • Average ICDs for groups of documents compared across the topical areas

Data Characteristics • Less than .03% of articles have been tagged by at least 10 people • ~ 2/3rds of highly tagged documents represent spam (e.g., links to commercial websites) • 78 viable articles tagged by at least 20 taggers were identified • 3 subject areas were identified • Science • Social Science • Social Software

Potential Challenge for Comparing Outcomes • Densities are influenced by distances in the tagger space • Distances are influenced by the dimensionality of the space • Dimensionality is influenced by number of unique tags and taggers • Therefore, outcomes should first be checked for significant correlations

Relationships between Taggers and Density Outcomes Pearson’s r correlation = 0.033 p = .782 … therefore, it’s not an issue

Analyzing the Data • Outliers were removed • 74 documents remained • 1-way ANOVA (parametric & non-parametric equivalent) used to compare average densities across the three topic areas

Descriptive Statistics

ANOVA Outcome Parametric Kruskal Wallis Test Non-parametric

Discussion • No significant differences in average density outcomes • Therefore, no significant difference in vocabulary usage • Could it be a reflection of the tagger population? • Investigated topic areas are closely related, so differences might not be apparent • Limited to what is being tagged => most items related to social software and allied science & social science areas

Research Limitations • Only takes distances into account, not semantics or contexts • Different sets of terms with similar tagging specificity and exhaustivity patterns will result in similar densities • Method can be computationally more challenging than traditional approaches • But with more taggers this is to be expected

Research Limitations • Findings are only as good as the data • Spam is common • Tagger motivation and intentions for contributing documents and tags may differ

Conclusion • ICD method is viable and usable even on larger datasets • Vocabulary consistency does not appear to be significantly different across the three broad topic areas • Future research will examine further applications of the regularities found

Indexing Consistency Across Multiple Indexers/Taggers

Indexing Consistency Across Multiple Indexers/Taggers

Presentation Transcript

Manager Self Service Start HCM Human Capital Management

Exercise: Indexing of the electron diffraction patterns

Indexing and Hashing

Reading and Review Chapter 12: Indexing and Hashing

Multiple Regression

INFO624 -- Week 9 Effective Information Retrieval

Multiple Alleles

Chapter 11: Indexing and Hashing

Concurrent Revisions A strong alternative to sequential consistency

Shared Memory Consistency Models: A Tutorial

Multiple pregnancies

BHS Practices/Procedures/Consistency

Multiple choice questions in:

Analysis of Multiple Experiments TIGR Multiple Experiment Viewer (MeV)

Multimedia Indexing and Dimensionality Reduction

Multiple pregnancy

Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Chapter4: Spatial Storage and Indexing

Consistency-based diagnosis

Indexing

Chapter 2 Modeling

Chapter 2 Modeling