Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

Building Topic Models in a Federated Digital Library Through Selective Document Exclusion Miles Efron Peter Organisciak Katrina Fenlon Graduate School of Library & Information Science University of Illinois, Urbana-Champaign ASIST 2011 New Orleans, LA October 10, 2011 Supported by IMLS LG-06-07-0020.

The Setting: IMLS DCC … collection(s) collection(s) collection(s) Data providers (IMLS NLG & LSTA) OAI-PMH metadata metadata metadata DCC Service provider: DCC services

High-Level Research Interest • Improve “access” to data harvested for federated digital libraries by enhancing: • Representation of documents • Representation of document aggregations • Capitalizing on the relationship between aggregations and documents. • PS: By “document” I mean a single metadata (usually DC) record.

Motivation for our Work • Most empirical approaches to this type of problem rely on some kind of analysis of term counts. • Unreliable for our data: • Vocabulary mismatch • Poor probability estimates

The Setting: IMLS DCC

The Problem: Supporting End-User Experience • Full-text search • Browse by “subject” • Desired: • Improved browsing • Support high-level aggregation understanding and resource discovery • Approach: Empirically induced “topics” using established methods--e.g. latent Dirichlet allocation(LDA).

Research Question • Can we improve induced models by mitigating the influence of noisy data, common in federated digital library settings? • Hypothesis: Harvested records are not all useful for training a model of corpus-level topics. • Approach: Identify and remove “weakly topical” documents during model training.

Latent Dirichlet Allocation • Given a corpus of documents, C, and an empirically chosen integer k • Assume that a generative process involving k latent topics generated word occurrences in C. • End result: for a given word w and a given document D: • Pr(w|Ti) • Pr(D|Ti) • Pr(Ti) For each topic T1 … Tk

Latent Dirichlet Allocation • Given a corpus of documents, C, and an empirically chosen integer k • Assume that a generative process involving k latent topics generated word occurrences in C. • End result: for a given word w and a given document D: • Pr(w|Ti) • Pr(D|Ti) • Pr(Ti) • Choose doc length N ~ Poisson(mu). • Choose probability vector Theta ~ Dir(alpha). • For each word wiin 1:N: • Choose topic zi ~ Multinomial(Theta). • Choose word wn from P(wn | wn, Beta). For each topic T1 … Tk

Latent Dirichlet Allocation • Given a corpus of documents, C and an empirically chosen integer k. • Assume that a generative process involving k latent topics generated word occurrences in C. • End result: for a given word w and a given document D: • Pr(w|Ti) • Pr(D|Ti) • Pr(Ti) Calculate estimates via iterative methods: MCMC / Gibbs Sampling. For each topic T1 … Tk

Full Corpus

Full Corpus Proposed algorithm

Reduced Corpus Train the Model Pr(w | T) Pr(D | T) Pr(T)

Full Corpus Pr(w | T) Pr(D | T) Pr(T) Inference Pr(w | T) Pr(D | T) Pr(T)

Sample Topics Induced from “Raw” Data

Documents’ Topical Strength • Hypothesis: Harvested records are not all useful for training a model of corpus-level. topics.

Documents’ Topical Strength • Hypothesis: Harvested records are not all useful for training a model of corpus-level. • Proposal: Improve induced topic model by removing “weakly topical” documents during training. • After training, use the inferential apparatus of LDA to assign topics to these “stop documents.”

Identifying “Stop Documents” • Time at which documents enter a repository is often informative (e.g. bulk uploads). where MC is the collection language model and di is the words comprising the ith document log Pr(di | MC)

Identifying “Stop Documents” • Our paper outlines an algorithm for accomplishing this. • Intuition: • Given a document di decide if it is part of a “run” of near-identical records. • Remove all records that occur within a run. • The required amount of homogeneity to identify a run is guided by a parameter tol which is the cumulative normal: e.g. 95%, 99% confidence.

Sample Topics Induced from Groomed Data

Experimental Assessment • Question: Are topics built from “sampled” corpora more coherent than topics induced from raw corpora? • Intrusion detection: • Find the 10 most probable words for topic Ti • Replace one of these 10 with a word chosen from the corpus with uniform probability. • Ask human assessors to identify the “intruder” word.

Experimental Assessment • For each topic Tihave 20 assessors try to find an intruder (20 different intruders). Repeat for both “sampled” and “raw” models. • i.e. 20 * 2* 100 = 4,000 assessments • Asiis the percent of workers who correctly found the intruder in the ith topic of the sampled model and Ari is analogous for the raw model • H0: Asi> Ariyields p<0.001

Experimental Assessment • For each topic Tihave 20 workers subjectively assess the topic’s “coherence,” reporting on a 4-point Likert scale.

Current & Future Work • Testing breadth of coverage • Assessing the value of induced topics • Topic information for document priors in the language modeling IR framework [next slide] • Massive document expansion for improved language model estimation [under review]

Weak Topicality and Document Priors

Thank You Miles Efron Peter Organisciak Katrina Fenlon Graduate School of Library & Information Science University of Illinois, Urbana-Champaign ASIST 2011 New Orleans, LA October 10, 2011

Building Topic Models in a Federated Digital Library Through Selective Document Exclusion