1 / 43

Title

Title. Bernhard Rieder Université de Paris VIII - Vincennes Saint-Denis Laboratoire Paragraphe Democratizing Search Concepts and Challenges Deep Search World-Information Institute 8 / 11 / 2008. Search engine basics rehashed

Download Presentation

Title

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Title Bernhard Rieder Université de Paris VIII - Vincennes Saint-Denis Laboratoire Paragraphe Democratizing Search Concepts and Challenges Deep Search World-Information Institute 8 / 11 / 2008

  2. Search engine basics rehashed Search engines have emerged as the dominant pathways into the depths of the Web for 1.5 billion Internet users. After email, search is the second most frequent activity online. Search engines play an important role in shaping which sites are visible on the Web and which sites are not. I - Search engine basics

  3. The problem with search • "Search is broken." [ Jimmy Wales 2008 ] • The most common points of critique: • Crawling and ranking are not transparent ( black box ) • Might favor "commercial" sites • Smaller sites have little visibility • Susceptible to manipulation ( SEO ) • Results are "read only" • Google as monopoly gatekeeper • A "quick fix" is quite improbable. I - The problem with search

  4. Search as a strange object • Web search is a phenomenon that is not easy to categorize and conceptualize. • The Web is an information space unlike any other • Search can be done using different techniques • It is part of a variety of practices • What is the closest antecedent? The library catalogue? Mass media? Guidebooks? Domain experts? I - Search as a strange object

  5. Search is not search!? Web search is part of a larger shift from information scarcity to abundance. "The task is not to design information-distributing systems but intelligent information-filtering systems." [ H. Simon 1969 ] Search engines are not systems of classification, they are machines that make judgments on the importance of pieces of information relative to a query. I - Searching or Filtering?

  6. I - SCREEN: Yahoo 1997

  7. I - SCREEN: AltaVista 1996

  8. The search pipeline A search engine includes several distinct stages: I - The search pipeline Crawler Index GUI Search & Rank

  9. Some basic ranking principles ( content ranking ) query: "house" rank by: number of occurrences query: "house AND hill" rank by: closeness "there is a house on the hill" "from my house I can see a beautiful hill" query: "house" rank by: location in document "<title>house</title>" "<p>house<p>" query: "house" rank by: URL "http://www.house.com" "http://www.villa.com" I - Some basic ranking principles

  10. The dominant paradigm: recursive link analysis I - Link analysis

  11. The Web as scale-free network II - The Web as scale-free network

  12. Link analysis and the logic of the hit • Link analysis projects the hypertext graph as a hierarchical list that strongly favors hubs and networks of hubs. • Growth principle: "preferential attachment" • "cumulative advantage" • "the rich get richer" • "logic of the hit" • "We will have to realize that hierarchies fulfill a semantic function and that semantic systems are hierarchic by principle." [ Hartmut Winkler 1997 ] II - Link analysis and the logic of the hit

  13. "So what’s our straightforward definition of the ideal search engine? Your best friend with instant access to all the world’s facts and a photographic memory of everything you’ve seen and know. That search engine could tailor answers to you based on your preferences, your existing knowledge and the best available information." - Marissa Mayer, Google VP II - CITATION: Best friend

  14. Current guiding principles • The two dominant guiding principles currently are: • popularity ( the logic of the hit ) • convenience ( personalization ) II - Current guiding principles

  15. Where can we look for alternative principles? • Web search is a new phenomenon; it can nonetheless be compared to adjacent domains. • Libraries and documentation ( freedom of access ) • Media and journalism ( neutrality, plurality ) • Cultural policy - "exception culturelle" ( diversity ) • Community organization ( participation ) • Liberal democracy ( transparency, accountability ) II - Where to look for alternative principles?

  16. III - CITATION: Democracy! "Democracy! Bah! When I hear that word I reach for my feather Boa!" - Allen Ginsberg

  17. Democracy as community "The second big element of Web 2.0 is democracy. We now have several examples to prove that amateurs can surpass professionals, when they have the right kind of system to channel their efforts. [ … ] Another place democracy seems to win is in deciding what counts as news. I never look at any news site now except Reddit." [ Paul Graham 2005 ] "Democratizing search" would mean letting users rank results. The community decides which information is best ( markers: votes, clicks, pageviews, etc. ). III - Two concepts of democracy: community

  18. III - CITATION: Wales on bias "The idea that all 'selection' is equally 'biased' is fallacious. We intuitively understand this when we talk about other forms or writing or journalism; we need to understand it for *this* form of journalism as well." - Jimmy Wales

  19. Wikia Search • Wikia Search tries to apply the Wikipedia principle to ranking search results, following the NPOV principle. • All technology is open source • Crawling is distributed using GRUB • Currently in an experimental stage • Wikia Search follows a series of explicit principles: • Transparency • Community • Quality • Privacy III - Wikia Search

  20. III - SCREEN: Wikia Search Abortion

  21. III - SCREEN: Wikia Search McCain

  22. Democracy as society Large-scale collective governance based on bureaucratic institutions limited by checks and balances. "Democratizing search" could mean adapting search to the requirements of liberal democracy. Web search would serve the goal of informing citizens on the different courses of action. III - Two concepts of democracy: society

  23. What should we strive for? • Reforming the search landscape is a normative project that would produce winners and losers. • Transparency => Plurality of opinion • Community => Society • Quality • Privacy • The goal would be having a variety of high-quality search applications that deliver different sets of results. III - What should we strife for?

  24. Democratizing search: main challenges • Market entry into the search market has become difficult. • Cost for infrastructure / datacenter • Difficulty finding quantifiable markers for ranking • Changing user habits / software defaults • Every part of the search pipeline has specific costs and specific engineering challenges. In order to have very fast end-user performance, there has to be sophisticated load balancing and an elaborate datacenter architecture. III - Democratizing search: main challenges

  25. Democratizing search: overview User side education Provider side antitrust measures financial aid Interaction between user and service interface / algorithm additions search APIs search sandbox III - Overview

  26. III - CITATION: Mind of god "The perfect search engine would be like the mind of God." - Sergey Brin

  27. User side: education • Information access is driven by informational practices as much as technology itself. User education can include: • General information on search engines and how they work • Using a search engine to its full potential • Learning about alternatives to the dominant player • Understanding that linking is not an innocent practice • General informational ecology • These points could easily be included into teaching curricula. III - A: User side

  28. III - A: SCREEN: Cheat sheet

  29. comScore European Search Properties March 08 III - B: SCREEN: Monopoly

  30. Provider side: antitrust measures Ownership is commonly an issue in the world of media. Google is politically quite active. But how to split up http://google.com? III - B: Antitrust measures

  31. Provider side: financial aid A series of countries grant direct or indirect subsidies to newspapers. France taxes cinema tickets and redistributes the money to level the playing field. Countries can offer targeted R&D grants ( e.g. Quaero ). There could be public search engines or a public datacenter infrastructure. III - B: Financial aid

  32. Provider side: building a public infrastructure III - B: A public infrastructure Crawler Index GUI Search & Rank

  33. III - B: SCREEN: exalead

  34. Between user and provider: interaction possibilities III - C: Empower the user through interaction Crawler Index GUI Search & Rank

  35. III - C: SCREEN: exalead

  36. III - C: SCREEN: exalead

  37. III - C: SCREEN: msn sliders

  38. III - C: SCREEN: clusty

  39. Between user and provider: better Web APIs Search APIs allow external applications to download a limited number of results ( Google ~8, Yahoo BOSS 50, Live API 50 ). With larger result sets, effective reranking or more powerful user interaction would be possible. III - C: Opening the results

  40. Between user and provider: the search sandbox III - C: Opening the index Crawler Index GUI Search & Rank

  41. Between user and provider: the search sandbox • A search sandbox would have the following elements: • Run on corporate infrastructure • A safe execution environment for untrusted code • A limited set of API calls to access the index • Users and institutions could propose alternative ranking methods • Quota rules for processing time • This might allow an ecosystem of search methods to develop in a situation that is both technically and economically viable. III - C: Opening the index

  42. Conclusions We will have to put humans "back into the loop", render search configurations hybrid and more complex. In order to open up the search landscape and get closer to the goal of plurality, we will have to combine all three levels. We need more large scale empirical data on search habits and consequences of ranking. Without better conceptual grasp on search engines, regulatory efforts are highly improbable. Conclusions

  43. Thank you for your attention! bernhard.rieder@univ-paris8.fr http://bernhard.rieder.fr http://thepoliticsofsystems.net The End

More Related