160 likes | 246 Views
Fumbles in the dark: measuring the Invisible Web. Colin Reddy & Paul Wouters Networked Research and Digital Information – Nerdi NIWI-KNAW - www.nerdi.knaw.nl. What is the Invisible Web Why is it there How does it affect Web Indicators. Initial Questions. Dynamic nature of the Web
E N D
Fumbles in the dark: measuring the Invisible Web Colin Reddy & Paul Wouters Networked Research and Digital Information – Nerdi NIWI-KNAW - www.nerdi.knaw.nl
What is the Invisible Web Why is it there How does it affect Web Indicators Initial Questions
Dynamic nature of the Web Establishing how Search engine policies affect material they find Invisibility not simply related to file type (Some) Initial Problems
Information can be called invisible in a certain search context (of a specific search technology) if: that information is not part of the results of the search, and that information does meet the criteria of relevance as formulated in the search, and that information would in principle be retrievable if an observer knew the exact location on the Web. Definition of Invisibility
What characteristics of information on the Web might increase or decrease the probability that it will be invisible to a specified set of search engines at a particular point in time? New Question
the number of in-links to the page containing the information the depth at which the information is located within a subdomain the file extension and the MIME type of the file containing the information the metatags with which the Web page is marked Factors involved
the updating frequency of the Web site or page the accessibility of the information the format of the URL at which the information is located, and lastly the total of these “visibility characteristics” of the in-linking pages. More factors
The experiment Aims • To quantify the amount of information residing on websites that is ‘invisible’ • Establish the characteristics of the information that determine the probability of it being ‘invisible’
The experiment Method • establish the contents of a website independently of a search engine, and then compare this “control map” with the results returned from search engines.
The experiment • Microsoft Search Analyst used to map the entire contents of a Web site. • At the same moment in time (important to minimise the effect of changes in website content over time), search engines and indexes were used to provide the same information.
Services used Search enginesIndexes Google Yahoo Fast Open Directory AltaVista Look Smart Inktomi Galaxy
Sample • Web sites of 99 Plant genetics institutes in the European research Area • Science Citation Index used to find articles, then list of institutes compiled • Search engines used to find URLs