200 likes | 322 Views
Clustering of search engine results by Google. Wouter.Mettrop@cwi.nl CWI , Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be Vrije Universiteit Brussel , and Universiteit Antwerpen , Belgium Hanneke Smulders Infomare Consultancy , The Netherlands
E N D
Clustering of search engine results by Google • Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands • Paul.Nieuwenhuysen@vub.ac.be Vrije Universiteit Brussel, and Universiteit Antwerpen, Belgium • Hanneke Smulders Infomare Consultancy, The Netherlands Presented at Internet Librarian International 2004in London, England, October 2004
Abstract - Summary - Overview Our experimental and quantitative investigation has shed some light on the phenomenon that the Google search engine omits WWW documents from the ranked list of search results that it provides, when the documents are “very similar”. Google offers the possibility to "repeat the search with the omitted results included", on the last page with search results. All this can be considered as an additional service offered by the system to the users. However, our investigation revealed that pages are also clustered, omitted and thus hidden to some extent, even when they can be substantially different in meaning for a human reader. The system does not distinguish authentic pages from copies or more importantly from copies that were modified on purpose. Furthermore, Google selects different WWW documents over time to represent the cluster of very similar documents. A practical consequence of this system is that a search for information may lead a user to rely on the information that is presented in a WWW document that represents a cluster of documents, but that is not necessarily the most appropriate or authentic document.
contents - summary - structure • overview • of this paper • Introduction:Google omits documents from search results • Hypothesis & problem statement • Experimental procedure • Results / findings • Discussion:This may be important • Conclusion of our investigation and recommendation
Introduction: very similar documents and Google • In response to a search query, the Google Web search engine delivers a ranked list of entries that refer to documents on the WWW. • Google “omits some entries” in the case that these are “very similar” and offers the possibility to "repeat the search with the omitted results included", on the last page with search results. • All this can be considered as an additional service offered by the system to the users, because they do not have to waste time by inspecting very similar entries.
Introduction: other services offered by Google • Omitting of very similar entries should not be confused with other services / tricks performed by Google, like • offering a link to “cached” documents • offering a link to “similar pages” that are associated with an entry; this leads in fact not to similar but to very different documents on different servers; these are related to the entry and may be useful according to Google • clustering and hiding entries from the results, because they are all located on the same server computer even though they may not be similar in contents
Hypothesis and problem statement: competition for visibility • Our hypothesis was that the Google computer system cannot find out and know which entry is the best one to serve as a representative of the cluster of entries (at least not in all cases) because this depends • on the one hand, on the aims of the user • on the other hand, on the variations in the documents This is analogous to the problem of how to rank entries in the presentation of search results.
Screen shot of a Google Web search in a case with “omitted entries”
Screen shot of a Google Web search with “omitted entries included”
Experimental procedure: Test documents for the WWW (1) • To understand the clustering better, we have performed experiments with very similar test documents. • A unique test document was put on the Internet,together with a specific content of several metatags.This guaranteed that our test documents ranked highly in the results of our searches. • Differences among documents in Title tag + Body text + Filename
Experimental procedure: Test documents for the WWW (2) • We keep on the Internet 18 different samples of the test document on 8 server computers. • Documents were submitted at the end of 2002 (9 documents) and at the end of 2003 (10 documents). • Our test documents do NOT change over time.
Experimental procedure: Example of our test document on the WWW
Experimental procedure: Searching for the test documents • We submitted 55 queries simultaneously every 30 minutes in the period March – June 2004. • Total number of queries submitted is 274615. • In 99% of the test results, Google put all retrieved test documents in one cluster. This cluster was always represented by one of the test documents.
Experimental results: Changes of representative document • We found that Google chose a different test document to represent the cluster of similar test documents over time.
Experimental results:Changes of representative document • Definition • A representative period is a period that lasted more than 2 hours, in which there is no change in the representative document of the cluster with test documents for one of our queries. • Findings • The number of representative periods per query was 11 or 12. • The length of representative periods per query was between 1 day and 27 days.
Experimental results:Observations • In 99% of the results of our test, an old test document (submitted in 2002) was the representative. Does Google aim at authenticity? • The selection of representative document of a cluster is not only dependent on the documents in the cluster, but also on the query that is used to retrieve the cluster.
Experimental results:Test documents retrieved All test documents: 18 Test documents found by searching for content (not URL): 15 Test documents used as representative: 9 Test documents found by searching for URL: 15
Discussion: The importance of similar documents • Real, authentic documents on their original server computer have to compete with “very similar” versions (that can be substantially different), which are made available by others on other servers. In reality documents are not abstract items: they can be concrete, real laws, regulations, price lists, scientific reports, political programs… so that NOT finding the more authentic document can have real consequences.
Discussion: The importance of similar documents • May complicate scientometric / bibliometric studies, quantitative studies of numbers of documents retrieved. • Documents on their original server pushed away from Google search results, by very similar competing documents on 1 or several other servers?!
ConclusionRecommendation • In general, realize that Google Web search omits very similar entries from search results. • In particular, take this into account when it is important to find • the oldest, authentic, master version of a document; • the newest, most recent version of a document; • versions of a document with comments, corrections… • in general: variations of documents