320 likes | 446 Views
Search Engines A Test Report Wolfgang Dalitz Zuse Institute Berlin (ZIB). 16th International Congress of the Austrian Mathematical Society (ÖMG) Annual Meeting of the German Mathematical Society (DMV) MATHEMATIK 2005 KLAGENFURT September 18 – 23, 2005. Contents. History and Motivation
E N D
Search EnginesA Test ReportWolfgang DalitzZuse Institute Berlin (ZIB) 16th International Congress of theAustrian Mathematical Society (ÖMG)Annual Meeting of the German Mathematical Society (DMV) MATHEMATIK 2005KLAGENFURT September 18 – 23, 2005
Contents • History and Motivation • Test Scenario • Search Engines • Results • Outlook
Math-Net • Concept for a distributed service for mathematics • Service by and for a community … but … • „Give and take“ do not work properly today
Community Driven Services • The concept of cooperative, open, and public domain-oriented services has some boundaries: • Manpower and resources • No scientific merits • People do not consider this an important service • There is not sufficient backing by your (scientific) environment
Nevertheless • Math-Net has been a successful project for a long time, at least in Germany: • Personal infrastructure • Combination of decentralized and central components has been working for a long time with small resources • Spin-off for other networks • Internationalization
Own Services? „We do have Google!“
tagesspiegel Apr 23, 2005 „We were able to fully benefit from the growth of online advertising." Eric Schmidt Investors were able … to realize a capital gain of nearly 160 percent. (Issue price of $ 85, now: $ 216, forecast: $ 270)
tagesspiegel Jul 23, 2005
www.heute.de Sep 16, 2005
c't 9/2005Apr 18, 2005 Manipulation attempts to upgrade ranking "… AltaVista was loaded with keyword-packed spam to such an extent, that it was hardly of any use any longer at the end of 1997 – a problem that AltaVista was never able to totally overcome since, in 1998, another venture appeared on the scene that rapidly advanced to become the number one: Google." "Link farms" as one example influence Google‘s ranking
What do we learn from this? • Search engines are important tools to find relevant information. • To run a "good" search engine is a billion dollar business. • There are many attempts to manipulate important search engines.
Fundamental • Science must be independend of services that • are mainly driven by commercial interests. • do not produce verifiable results.
Completeness? • many (60-80%) HTML pages at ZIB could not be traced in Google, AlltheWeb, …
Paradigms • Science has to determine what tools are necessary and important. • Science has to run and control certain techniques and services that are needed for scientific work. • There must be a (financial and organizational) framework to ensure the implementation of these activities.
Google's new idea • 15 mill. books scanned and fulltext indexed • 150-200 mill. USD • period: 10 years (source: dw-world.de, Apr 27, 2005) • university libraries • Michigan: 7 millions • Stanford: 8 millions • Harvard: 40,000 • Oxford: published before 1900 • New York Public Library • selected older titles
… solution will come soon … The Bibliothèque Nationale de France (BNF) appealed to launch a European "counter-attack" against the project. President Jacques Chirac intends to recommend to the EU a project to digitalize the works of the great European libraries. (source: heise.de/newsticker) „This action is directed against nobody, but it would be 'of fundamental importance' for a multicultural society, said Mr. Renaud Donnedieu de Vabres, minister of education."
The European project will be an alternative to Google's online library (dw-world.de) 1 mill. Euro each year (source: ARTE-News, May 2, 2005) necessary budget: "400 mill. Euro in the next 3 years"(said Mr. Jean-Noël Jeanneney, president of the Bibliothèque Nationale de France, Paris, to the Frankfurter Rundschau, Sep 7, 2005)
Self-made job • Target: run a search engine • with small resources for techniques and manpower • with techniques controlled by the people involved • "better than Google" in the mathematics domain • Environment • open domain • community driven • topic oriented • locally operated • in the long run: community based service
Phase I: get all relevant objects: spider, crawler, gatherer Phase II: compile an index summarizer indexer Phase III: generate ("good") results ranking How do search engines work?
Candidates and strategies • Complete systems (phases I, II, III) • harvest (gatherer, broker, glimpse) • swish-e (spider.pl and indexer) • nutch (lucene) • Partial systems • phase I: wget and w3mir • phase II: lucene • phase III: ??
htdig estraier Perfect Search PHPdig TSEP namazu see test reports in different computer magazines (c't, ix, LinuxMagazin) What else?
Site I www.mathematik-21.de 7371 files 2293 HTML 1160 Images 140 Text 81 PDF 19 PS rest: tmp, harvest Site II www.zib.de 70126 files 17981 HTML 17147 Images 2024 PDF 991 PS 140 Text rest: test Test scenariolocal copies from two different sites factor 10
Completeness (phase I) Site I
Explanations • There are • views from the inside (filesystem) • symbolic links • views from the outside (webserver) • People used nonconform HTML • resulting linklists differ
c't 9/2005 Apr 18, 2005 study: only 3.9 % of German Web sites conform to the standard "… 96.1 % of the checked Web pages included illegal code"
Completeness (phase I) Site II
Indexing (phase II) • harvest/glimpse • fast • has to be tuned (summarized) • spider.pl/swish-e • very fast • nutch/lucene • very fast • incremental index
Ranking (phase III) ??? Is that what characterizes a „good“ search engine? Are there objective criteria?
(first) Results • To run search engines implies that • you have time and resources • you control each task • this is nearly a full-time job • To run search engines for a community • is really a project • is not a one-man job • requires many resources
Our proposals • harvest • is satisfactory if all tasks are controlled and corrected in detail • is our favorite for decentralized work • (wget) nutch/lucene, switch-e • are running without problems on smaller sites • but we have no experience on their functionality on really big sites (1 tbyte data)
URLs www.suma-ev.de