Diversity in search: what, how, and what for?

Diversity in search: what, how, and what for? Bettina Berendt Dept. Computer Science, KU Leuven

Thanks to • Sebastian Kolbe-Nusser • Anett Kralisch • Siegfried Nijssen • Ilija Subašić • Mathias Verbeke • Hugo Zaragoza • ...

Diversity in natural language diverse (s#2), various : distinctly dissimilar or unlike ..., diversity (s#1), ..., variety : noticeable heterogeneity (Wordnet) • “the fact that members of a set are different from one another“

Why is diversity interesting for search? “People like to see a range of different, non-redundant things/views/etc.“ “Different people search differently.“  How?  When / under what conditions?  (What) can we do?

What is diverse? • Documents • the relevance of a document must be determined considering the documents appearing before it (Goffman, 1964) • E.g. MMR (Carbonell & Goldstein, 1998) • Many further developments, e.g. for images • Presentation choices, e.g. re-ranking or clustering?

What is diverse? • Documents • People • “The term diversity is a form of euphemistic shorthand to describe differences in racial or ethnic classifications, age, gender, religion, philosophy, physical abilities, socioeconomic background, sexual orientation, gender identity, intelligence, mental health, physical health, genetic attributes, behavior, attractiveness, place of origin, cultural values, or political view as well as other identifying features.” http://en.wikipedia.org/wiki/Diversity_(politics)

What is diverse? • Documents • People Knowledge and its articulations (= documents in a wider sense?!) • “Knowledge and its articulations are strongly influenced by diversity in, e.g., cultural backgrounds, schools of thought, geographical contexts.” • “LivingKnowledge will study the effect of diversity and time on opinions and bias.” • “The goal [is] to improve navigation and search in very large multimodal datasets (e.g., the Web itself).”

How we got here

Why this talk?

Why this talk? Towards an integrated understanding of diversity

The impact of linguistic diversity on Web usage and thereby on the Web Or: • Why are non-English languages under-represented on the Web? • A web-analysis approach asking for underlying • cognitive-linguistic • behavioural • attitude factors

A simple expectation of how much content exists in which language

But: Dynamics of content creation, link setting, link following, attitudes, and use

But: Dynamics of content creation, link setting, link following, attitudes, and use People create less content People link less to content People use links less People think the content is bad ... and use it less

But: Dynamics of content creation, link setting, link following, attitudes, and use  Under-representation !

Underlying data and methods • Database of countries and official languages • Distribution comparisons between • worldwide proportions of native speakers of different languages • worldwide distribution of servers registered by country • crawler analysis of links to a multilingual site S • log analysis assigning each session a native language • log analysis of (user native language) – (S-entry-page language) • Questionnaire/TAM analysis of native and non-native users of S: • usability, ease of use, competence in English, beliefs about availability of content in native language

Some questions • Does one find such dynamics also in search engines? • What factors stop or reverse such language-marginalisation trends? • Critical mass? • Laws? • Volunteers? • Did / can Web 2.0/3.0 change this? • (When) is it better to work without pre-defined labels for users?

 Part 2: An approach that ... • Does one find such dynamics also in search engines? • What factors stop or reverse such language-marginalisation trends? • Critical mass? • Laws? • Volunteers? • Did / can Web 2.0/3.0 change this? • (When) is it better to work without pre-defined labels for users?

Motivation (1): Diversity of people is ... • Speaking different languages (etc.)  localisation / internationalisation • Having different abilities  accessibility • Liking different things  collaborative filtering • Structuring the world in different ways  ?

Motivation (2): Diversity-aware applications ... • Must have a (formal) notion of diversity • Can follow a • “personalization approach“  adapt to the user‘s value on the diversity variable(s)  transparently? Is this paternalistic? • “customization approach“  show the space of diversity  allow choice / raise awareness / semi-automatic!

Measuring grouping diversity Diversity = 1 – similarity = 1 - Normalized mutual information By colour & NMI = 0 NMI = 0.35

Measuring user diversity • “How similarly do two users group documents?“ • For each query q, consider their groupings gr: • For various queries: aggregate • “How similarly do two users group documents?“ • For each query q, consider their groupings gr:

... and now: the application domain ... that‘s only the 1st step!

Workflow • Query • Automatic clustering • Manual regrouping • Re-use • Learn + present way(s) of grouping • Transfer the constructed concepts

Concepts • Extension • the instances in a group • Intension • Ideally: “squares vs. circles“ • Pragmatically: defined via a classifier

Step 1: Retrieve • CiteseerX via OAI • Output: set of • document IDs, • document details • their texts

Step 2: Cluster • “the classic bibliometric solution“ • CiteseerCluster: • Similarity measure: co-citation, bibliometric coupling, word or LSA similarity, combinations • Clustering algorithm: k-means, hierarchical • Damilicious: phrases  Lingo • How to choose the “best“? • Experiments: Lingo better than k-means at reconstruction and extension-over-time

Step 3 (a): Re-organise & work on document groups

Step 3 (b): Visualising document groups

Steps 4+5: Re-use • Basic idea: • learn a classifier from the final grouping (Lingo phrases) • apply the classifier to a new search result  “re-use semantics“ • Whose grouping? • One‘s own • Somebody else‘s • Which search result? • “ the same“ (same query, structuring by somebody else) • “ More of the same“ (same query, later time  more doc.s) • “ related“ (... Measured how? ...) • arbitrary

Visualising user diversity (1) Simulated users with different strategies • U0: did not change anything (“System“) • U1: tried produce a better fit of the document groups to the cluster intensions; 5 regroupings • U2: attempted to move everything that did not fit well into the remainder group “Other topics”, & better fit; 10 regroupings • U3: attempted to move everything from „Other topics“ into matching real groups; 5 regroupings • U4: regrouping by author and institution; 5 regroupings  5*5 matrix of diversities gdiv(A,B,q)  multidimensional scaling

Data mining RFID Visualising user diversity (2) aggregated using gdiv(A,B) Web mining

Evaluating the application • Clustering only: Does it generate meaningful document groups? • yes (tradition in bibliometrics) – but: data? • Small expert evaluation of CiteseerCluster • Clustering & regrouping • End-user experiment with CiteseerCluster • 5-person formative user study of Damilicious

The Damilicious tool: Summary and (some) open questions • A tool that helps users in sense-making, exploring diversity, and re-using semantics • diversity measures when queries and result sets are different? • how to best present of diversity? • How to integrate into an environment supporting user and community contexts? • Incentives to use the functionalities? • how to find the best balance between similarity and diversity? • which measures of grouping diversity are most meaningful? • Extensional? • Intensional? Structure-based? Hybrid? (cf. ontology matching) • which other sources of user diversity? • Diversity and relevance: can we learn from user-dependent relevance judgements?

Some lessons learned (or questions raised?) • We need to embrace diversity. • We need to take into account • The diversity of documents / knowledge • The diversity of people • The diversity of diversity . • We need to be clear about what we mean. • We need to ask whether / when „striving for diversity“ is in itself A Good Thing. • We need to ask whether / when „raising awareness of diversity“ is in itself A Good Thing. Thanks!

Diversity in search: what, how, and what for? Bettina Berendt Dept. Computer Science, KU Leuven

... and now: the application domain ... that‘s only the 1st step!

Diversity in search: what, how, and what for?