1 / 40

Diversity in search: what, how, and what for?

This talk explores the concept of diversity in search and its importance. It discusses the impact of linguistic diversity on web usage and looks at factors that contribute to language marginalization trends in search engines. The talk also examines the need for diversity-aware applications and methods for measuring grouping diversity.

rossb
Download Presentation

Diversity in search: what, how, and what for?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Diversity in search: what, how, and what for? Bettina Berendt Dept. Computer Science, KU Leuven

  2. Thanks to • Sebastian Kolbe-Nusser • Anett Kralisch • Siegfried Nijssen • Ilija Subašić • Mathias Verbeke • Hugo Zaragoza • ...

  3. Diversity in natural language diverse (s#2), various : distinctly dissimilar or unlike ..., diversity (s#1), ..., variety : noticeable heterogeneity (Wordnet) • “the fact that members of a set are different from one another“

  4. Why is diversity interesting for search? “People like to see a range of different, non-redundant things/views/etc.“ “Different people search differently.“  How?  When / under what conditions?  (What) can we do?

  5. What is diverse? • Documents • the relevance of a document must be determined considering the documents appearing before it (Goffman, 1964) • E.g. MMR (Carbonell & Goldstein, 1998) • Many further developments, e.g. for images • Presentation choices, e.g. re-ranking or clustering?

  6. What is diverse? • Documents • People • “The term diversity is a form of euphemistic shorthand to describe differences in racial or ethnic classifications, age, gender, religion, philosophy, physical abilities, socioeconomic background, sexual orientation, gender identity, intelligence, mental health, physical health, genetic attributes, behavior, attractiveness, place of origin, cultural values, or political view as well as other identifying features.” http://en.wikipedia.org/wiki/Diversity_(politics)

  7. What is diverse? • Documents • People Knowledge and its articulations (= documents in a wider sense?!) • “Knowledge and its articulations are strongly influenced by diversity in, e.g., cultural backgrounds, schools of thought, geographical contexts.” • “LivingKnowledge will study the effect of diversity and time on opinions and bias.” • “The goal [is] to improve navigation and search in very large multimodal datasets (e.g., the Web itself).”

  8. How we got here

  9. How we got here

  10. How we got here

  11. How we got here

  12. Why this talk?

  13. Why this talk? Towards an integrated understanding of diversity

  14. The impact of linguistic diversity on Web usage and thereby on the Web Or: • Why are non-English languages under-represented on the Web? • A web-analysis approach asking for underlying • cognitive-linguistic • behavioural • attitude factors

  15. A simple expectation of how much content exists in which language

  16. But: Dynamics of content creation, link setting, link following, attitudes, and use

  17. But: Dynamics of content creation, link setting, link following, attitudes, and use People create less content People link less to content People use links less People think the content is bad ... and use it less

  18. But: Dynamics of content creation, link setting, link following, attitudes, and use  Under-representation !

  19. Underlying data and methods • Database of countries and official languages • Distribution comparisons between • worldwide proportions of native speakers of different languages • worldwide distribution of servers registered by country • crawler analysis of links to a multilingual site S • log analysis assigning each session a native language • log analysis of (user native language) – (S-entry-page language) • Questionnaire/TAM analysis of native and non-native users of S: • usability, ease of use, competence in English, beliefs about availability of content in native language

  20. Some questions • Does one find such dynamics also in search engines? • What factors stop or reverse such language-marginalisation trends? • Critical mass? • Laws? • Volunteers? • Did / can Web 2.0/3.0 change this? • (When) is it better to work without pre-defined labels for users?

  21.  Part 2: An approach that ... • Does one find such dynamics also in search engines? • What factors stop or reverse such language-marginalisation trends? • Critical mass? • Laws? • Volunteers? • Did / can Web 2.0/3.0 change this? • (When) is it better to work without pre-defined labels for users?

  22. Motivation (1): Diversity of people is ... • Speaking different languages (etc.)  localisation / internationalisation • Having different abilities  accessibility • Liking different things  collaborative filtering • Structuring the world in different ways  ?

  23. Motivation (2): Diversity-aware applications ... • Must have a (formal) notion of diversity • Can follow a • “personalization approach“  adapt to the user‘s value on the diversity variable(s)  transparently? Is this paternalistic? • “customization approach“  show the space of diversity  allow choice / raise awareness / semi-automatic!

  24. Measuring grouping diversity Diversity = 1 – similarity = 1 - Normalized mutual information By colour & NMI = 0 NMI = 0.35

  25. Measuring user diversity • “How similarly do two users group documents?“ • For each query q, consider their groupings gr: • For various queries: aggregate • “How similarly do two users group documents?“ • For each query q, consider their groupings gr:

  26. ... and now: the application domain ... that‘s only the 1st step!

  27. Workflow • Query • Automatic clustering • Manual regrouping • Re-use • Learn + present way(s) of grouping • Transfer the constructed concepts

  28. Concepts • Extension • the instances in a group • Intension • Ideally: “squares vs. circles“ • Pragmatically: defined via a classifier

  29. Step 1: Retrieve • CiteseerX via OAI • Output: set of • document IDs, • document details • their texts

  30. Step 2: Cluster • “the classic bibliometric solution“ • CiteseerCluster: • Similarity measure: co-citation, bibliometric coupling, word or LSA similarity, combinations • Clustering algorithm: k-means, hierarchical • Damilicious: phrases  Lingo • How to choose the “best“? • Experiments: Lingo better than k-means at reconstruction and extension-over-time

  31. Step 3 (a): Re-organise & work on document groups

  32. Step 3 (b): Visualising document groups

  33. Steps 4+5: Re-use • Basic idea: • learn a classifier from the final grouping (Lingo phrases) • apply the classifier to a new search result  “re-use semantics“ • Whose grouping? • One‘s own • Somebody else‘s • Which search result? • “ the same“ (same query, structuring by somebody else) • “ More of the same“ (same query, later time  more doc.s) • “ related“ (... Measured how? ...) • arbitrary

  34. Visualising user diversity (1) Simulated users with different strategies • U0: did not change anything (“System“) • U1: tried produce a better fit of the document groups to the cluster intensions; 5 regroupings • U2: attempted to move everything that did not fit well into the remainder group “Other topics”, & better fit; 10 regroupings • U3: attempted to move everything from „Other topics“ into matching real groups; 5 regroupings • U4: regrouping by author and institution; 5 regroupings  5*5 matrix of diversities gdiv(A,B,q)  multidimensional scaling

  35. Data mining RFID Visualising user diversity (2) aggregated using gdiv(A,B) Web mining

  36. Evaluating the application • Clustering only: Does it generate meaningful document groups? • yes (tradition in bibliometrics) – but: data? • Small expert evaluation of CiteseerCluster • Clustering & regrouping • End-user experiment with CiteseerCluster • 5-person formative user study of Damilicious

  37. The Damilicious tool: Summary and (some) open questions • A tool that helps users in sense-making, exploring diversity, and re-using semantics • diversity measures when queries and result sets are different? • how to best present of diversity? • How to integrate into an environment supporting user and community contexts? • Incentives to use the functionalities? • how to find the best balance between similarity and diversity? • which measures of grouping diversity are most meaningful? • Extensional? • Intensional? Structure-based? Hybrid? (cf. ontology matching) • which other sources of user diversity? • Diversity and relevance: can we learn from user-dependent relevance judgements?

  38. Some lessons learned (or questions raised?) • We need to embrace diversity. • We need to take into account • The diversity of documents / knowledge • The diversity of people • The diversity of diversity . • We need to be clear about what we mean. • We need to ask whether / when „striving for diversity“ is in itself A Good Thing. • We need to ask whether / when „raising awareness of diversity“ is in itself A Good Thing. Thanks!

  39. Diversity in search: what, how, and what for? Bettina Berendt Dept. Computer Science, KU Leuven

  40. ... and now: the application domain ... that‘s only the 1st step!

More Related