Mathias Verbeke, Bettina Berendt , Siegfried Nijssen Dept. Computer Science, KU Leuven

Data mining, interactive semantic structuring, and collaboration: A diversity-aware method for sense-making in search Mathias Verbeke, Bettina Berendt, Siegfried Nijssen Dept. Computer Science, KU Leuven

Agenda • Motivation Diversity  Diversity-aware tools  (our) Context • Main part Measures of diversity  Tool • Outlook

Motivation (1): Diversity is ... • Speaking different languages (etc.)  localisation / internationalisation • Having different abilities  accessibility • Liking different things  collaborative filtering • Structuring the world in different ways  ?

Motivation (2): Diversity-aware applications ... • Must have a (formal) notion of diversity • Can follow a • “personalization approach“  adapt to the user‘s value on the diversity variable(s)  transparently? Is this paternalistic? • “customization approach“  show the space of diversity  allow choice / semi-automatic!

(Our) Context • Diversity and Web usage: language, culture • Family of tools focussing on interactive sense-making helped by data mining • PORPOISE: global and local analysis of news and blogs + their relations • STORIES: finding + visualisation of “stories” in news • CiteseerCluster: literature search + sense-making • Damilicious: CiteseerCluster + re-use/transfer of semantics + diversity

Measuring grouping diversity Diversity = 1 – similarity = 1 - Normalized mutual information By colour & NMI = 0 NMI = 0.35

Measuring user diversity • “How similarly do two users group documents?“ • For each query q, consider their groupings gr: • For various queries: aggregate • “How similarly do two users group documents?“ • For each query q, consider their groupings gr:

... and now: the application domain ... that‘s only the 1st step!

Workflow • Query • Automatic clustering • Manual regrouping • Re-use • Learn + present way(s) of grouping • Transfer the constructed concepts

Concepts • Extension • the instances in a group • Intension • Ideally: “squares vs. circles“ • Pragmatically: defined via a classifier

Step 1: Retrieve • CiteseerX via OAI • Output: set of • document IDs, • document details • their texts

Step 2: Cluster • “the classic bibliometric solution“ • CiteseerCluster: • Similarity measure: co-citation, bibliometric coupling, word or LSA similarity, combinations • Clustering algorithm: k-means, hierarchical • Damilicious: phrases  Lingo • How to choose the “best“? • Experiments: Lingo better than k-means at reconstruction and extension-over-time

Step 3 (a): Re-organise & work on document groups

Step 3 (b): Visualising document groups

Steps 4+5: Re-use • Basic idea: • learn a classifier from the final grouping (Lingo phrases) • apply the classifier to a new search result  “re-use semantics“ • Whose grouping? • One‘s own • Somebody else‘s • Which search result? • “ the same“ (same query, structuring by somebody else) • “ More of the same“ (same query, later time  more doc.s) • “ related“ (... Measured how? ...) • arbitrary

Visualising user diversity (1) Simulated users with different strategies • U0: did not change anything (“System“) • U1: tried produce a better fit of the document groups to the cluster intensions; 5 regroupings • U2: attempted to move everything that did not fit well into the remainder group “Other topics”, & better fit; 10 regroupings • U3: attempted to move everything from „Other topics“ into matching real groups; 5 regroupings • U4: regrouping by author and institution; 5 regroupings  5*5 matrix of diversities gdiv(A,B,q)  multidimensional scaling

Data mining RFID Visualising user diversity (2) aggregated using gdiv(A,B) Web mining

Evaluating the application • Clustering only: Does it generate meaningful document groups? • yes (tradition in bibliometrics) – but: data? • Small expert evaluation of CiteseerCluster • Clustering & regrouping • End-user experiment with CiteseerCluster • 5-person formative user study of Damilicious

Summary and (some) open questions • Damilicious: a tool that helps users in sense-making, exploring diversity, and re-using semantics • diversity measures when queries and result sets are different? • how to best present of diversity? • How to integrate into an environment supporting user and community contexts (e.g., Niederée et al. 2005)? • Incentives to use the functionalities? • how to find the best balance between similarity and diversity? • which measures of grouping diversity are most meaningful? • Extensional? • Intensional? Structure-based? Hybrid? (cf. ontology matching) • which other sources of user diversity? Thanks!

Mathias Verbeke, Bettina Berendt , Siegfried Nijssen Dept. Computer Science, KU Leuven

Mathias Verbeke, Bettina Berendt , Siegfried Nijssen Dept. Computer Science, KU Leuven

Presentation Transcript

Marc Moonen Dept. E.E./ESAT-STADIUS, KU Leuven marc.moonen@esat.kuleuven.be

Bettina Berendt

Computer Science Dept. KAIST Education

Presented by Bettina Berendt, K.U. Leuven

Selmer Bringsjord, Bettina Schimanski Cognitive Science / Computer Science Depts.

Bettina Berendt â€“ thanks for joint work with and support from Ilija Subasi Ä‡ Mathias Verbeke

Selmer Bringsjord, Bettina Schimanski Cognitive Science / Computer Science Depts.

Riina Vuorikari European Schoolnet / Open Univ. of the Netherlands Bettina Berendt KU Leuven

Bettina Berendt Humboldt University Berlin, Institute of Information Systems

OpenCourseWare KU Leuven

Marc Moonen Dept. E.E./ESAT-STADIUS, KU Leuven marc.moonen@esat.kuleuven.be

Bettina Berendt Department of Computer Science KU Leuven, Belgium

Riina Vuorikari European Schoolnet / Open Univ. of the Netherlands Bettina Berendt KU Leuven

Presented by Bettina Berendt, K.U. Leuven

Internships Faculty of Science KU Leuven