The Invisible Web

Introducing the The Invisible Web

a term that refers to the vast amount of information on the web that is not indexed by the search engines. Source: Online Media Glossary: http://www.onlinemedia.co.in/glossary Invisible Web – Definition

Technology has advanced over the years and search engines have overcome several barriers to indexing certain sites. Here are some sites that used to skipped by search engines, but are no longer. Pages in non-HTML formats (pdf, Word, Excel, PowerPoint), now converted into HTML. Script-based pages, whose URLs contain a ? or other script coding. Pages generated dynamically by other types of database software (e.g., Active Server Pages, Cold Fusion). These can be indexed if there is a stable URL somewhere that search engine crawlers can find. Source: UC Berkeley Library: http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/InvisibleWeb.html#Why2 Invisble Web - Definition

Some of the web is still invisible. • Specialized Database content such as the Texshare databases. • Dynamic pages of little value beyond the first view such as searches for specific area of interest to one individual. • Pages excluded by their owners. Source: UC Berkeley Library: http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/InvisibleWeb.html#Why2 Invisible Web - Definition

“In 2000, it was estimated that the deep Web contained approximately 7,500 terabytes of data and 550 billion individual documents.[1] Estimates – based on extrapolations from a study done at University of California, Berkeley – show that the deep Web consists of about 91,000 terabytes. By contrast, the surface Web (which is easily reached by search engines) is only about 167 terabytes. The Library of Congress contains about 11 terabytes.[4]” Source: Wikipedia: http://en.wikipedia.org/wiki/Deep_web Invisible Web – Size

It can be difficult to determine the exact size of the Invisible Web. UC Berkeley theorized four reasons. • Which sites replicate some of their content in static pages (hybrid of visible and invisible in some combination)? • Which replicate it all (visible in search engines if you construct a search matching terms in the page)? • Which databases replicate none of their dynamically generated pages in links and must be searched directly (totally invisible)? • Search engines can change their policies on what the exclude and include. Source: UC Berkeley Library: http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/InvisibleWeb.html#Why2 Invisible Web - Size

The specialized databases content is not indexed by the search engines. However, the companies that produce the content will have indexed websites that promote their specialized databases. In order to discover these databases, you can do a search in Google, Yahoo, or any other search engine for databases specializing in the topic of interest. Examples: Presidents databases News databases Literature databases. Invisible Web – Finding it

Subject Directories are another great source for specialized database. Examples: Yahoo Directory Google Directory Librarian’s Internet Index Invisible Web – Finding It

Some companies are beginning to realize the value of searching hidden content. Google Scholar is a great example of a site providing content otherwise unavailable. Google Scholar provides access to hundreds of licensed articles, magazines, references, news archives, and other research resources. Invisible Web – Find It

Discussion What are some unique specialized databases you have discovered? What are some ways you see the invisible web being used by librarians? Is there a subject area you would like to search this morning? Invisible Web – Group Activity

Adam Wright awright@ntrls.org http://www.ntrls.org. Invisible Web

The Invisible Web