190 likes | 286 Views
Data mining, interactive semantic structuring, and collaboration: A diversity-aware method for sense-making in search. Mathias Verbeke, Bettina Berendt , Siegfried Nijssen Dept. Computer Science, KU Leuven. Agenda. Motivation Diversity Diversity-aware tools (our) Context
E N D
Data mining, interactive semantic structuring, and collaboration: A diversity-aware method for sense-making in search Mathias Verbeke, Bettina Berendt, Siegfried Nijssen Dept. Computer Science, KU Leuven
Agenda • Motivation Diversity Diversity-aware tools (our) Context • Main part Measures of diversity Tool • Outlook
Motivation (1): Diversity is ... • Speaking different languages (etc.) localisation / internationalisation • Having different abilities accessibility • Liking different things collaborative filtering • Structuring the world in different ways ?
Motivation (2): Diversity-aware applications ... • Must have a (formal) notion of diversity • Can follow a • “personalization approach“ adapt to the user‘s value on the diversity variable(s) transparently? Is this paternalistic? • “customization approach“ show the space of diversity allow choice / semi-automatic!
(Our) Context • Diversity and Web usage: language, culture • Family of tools focussing on interactive sense-making helped by data mining • PORPOISE: global and local analysis of news and blogs + their relations • STORIES: finding + visualisation of “stories” in news • CiteseerCluster: literature search + sense-making • Damilicious: CiteseerCluster + re-use/transfer of semantics + diversity
Measuring grouping diversity Diversity = 1 – similarity = 1 - Normalized mutual information By colour & NMI = 0 NMI = 0.35
Measuring user diversity • “How similarly do two users group documents?“ • For each query q, consider their groupings gr: • For various queries: aggregate • “How similarly do two users group documents?“ • For each query q, consider their groupings gr:
... and now: the application domain ... that‘s only the 1st step!
Workflow • Query • Automatic clustering • Manual regrouping • Re-use • Learn + present way(s) of grouping • Transfer the constructed concepts
Concepts • Extension • the instances in a group • Intension • Ideally: “squares vs. circles“ • Pragmatically: defined via a classifier
Step 1: Retrieve • CiteseerX via OAI • Output: set of • document IDs, • document details • their texts
Step 2: Cluster • “the classic bibliometric solution“ • CiteseerCluster: • Similarity measure: co-citation, bibliometric coupling, word or LSA similarity, combinations • Clustering algorithm: k-means, hierarchical • Damilicious: phrases Lingo • How to choose the “best“? • Experiments: Lingo better than k-means at reconstruction and extension-over-time
Steps 4+5: Re-use • Basic idea: • learn a classifier from the final grouping (Lingo phrases) • apply the classifier to a new search result “re-use semantics“ • Whose grouping? • One‘s own • Somebody else‘s • Which search result? • “ the same“ (same query, structuring by somebody else) • “ More of the same“ (same query, later time more doc.s) • “ related“ (... Measured how? ...) • arbitrary
Visualising user diversity (1) Simulated users with different strategies • U0: did not change anything (“System“) • U1: tried produce a better fit of the document groups to the cluster intensions; 5 regroupings • U2: attempted to move everything that did not fit well into the remainder group “Other topics”, & better fit; 10 regroupings • U3: attempted to move everything from „Other topics“ into matching real groups; 5 regroupings • U4: regrouping by author and institution; 5 regroupings 5*5 matrix of diversities gdiv(A,B,q) multidimensional scaling
Data mining RFID Visualising user diversity (2) aggregated using gdiv(A,B) Web mining
Evaluating the application • Clustering only: Does it generate meaningful document groups? • yes (tradition in bibliometrics) – but: data? • Small expert evaluation of CiteseerCluster • Clustering & regrouping • End-user experiment with CiteseerCluster • 5-person formative user study of Damilicious
Summary and (some) open questions • Damilicious: a tool that helps users in sense-making, exploring diversity, and re-using semantics • diversity measures when queries and result sets are different? • how to best present of diversity? • How to integrate into an environment supporting user and community contexts (e.g., Niederée et al. 2005)? • Incentives to use the functionalities? • how to find the best balance between similarity and diversity? • which measures of grouping diversity are most meaningful? • Extensional? • Intensional? Structure-based? Hybrid? (cf. ontology matching) • which other sources of user diversity? Thanks!