740 likes | 893 Views
Deep Web. Mining the. Michael Hunter Reference Librarian Hobart and William Smith Colleges For Western New York Library Resources Council Member Libraries’ Staff Sponsored by the Western New York Library Resources Council. For today. From Web to Deep Web
E N D
Deep Web Mining the Michael Hunter Reference Librarian Hobart and William Smith Colleges For Western New York Library Resources Council Member Libraries’ Staff Sponsored by the Western New York Library Resources Council
For today . . . • From Web to Deep Web • Search Services: Genres and Differences • The Topography of the Internet • Mining the Deep Web: Techniques and Tips • Hands-on Session • Evaluating Deep Web Resources • Using Proprietary Software
Web to Deep Web • 1991 – Gopher • Menu-based text only • You had to KNOW the sites • 1992 – Veronica • Menus of menus • Difficult to access
Web to Deep Web • 1991 - Hyper-Text Markup Language • Linkage capability leads you to related information elsewhere • “Classic” Web Site • Relatively stable content of static, separate documents or files • Typically no larger than 1,000 documents navigated via static directory structures
Web to Deep Web • 1994 – Lycos launched • First crawler-based search engine with database of 54,000 html documents (CMU) • Growth of html documents unprecedented and unanticipated • 2000 (April) “The Web is doubling in size every 8 months” (FAST)
Web to Deep Web • 1996 – Three phenomena pivotal for the development of the Deep Web: • HTML-based database technology introduced • Bluestone’s Sapphire/Web, Oracle • Commercialization of the Web • Growth of home PC-users and e-commerce • Web Servers adapted to embrace “dynamic” serving of data • Microsoft’s ASP, Unix PHP and others
Web to Deep Web • 1998 – Deep Web comes of Age Larger sites redesigned with a database orientation rather than static directory structure • U.S Bureau of the Census • Securities and Exchange Commission • Patent and Trademark Office
Search Services:Genres and Differences • Exclusively crawler-created • Search engines • Meta search engines • Human created and/or influenced • Directories • Specialized search engines • Subject metasites • Deep Web gateway sites
WS WS WS WS WS WS WS WS WS CR CR CR CR WS DATABASE CR CR CR - Crawler WS - Web Server
User 1 User 2 User 3 User 4 User 5 User 6 User 7 Search Engine DATABASE
Search Services:Exclusively Crawler Created • Database compiled through automated, link-dependent crawling and site submission • Unable to access • Dynamically-created pages • Proprietary, non-html filetypes • Multimedia • Software • Password-protected sites • Sites prohibiting crawlers (robots.txt exclusion)
Dynamically-created Web pages • Created at the moment of the query using the most recent version of the database. • Database-driven • Require interaction • Amazon.com • What titles are available? At what price? • Are there recent reviews? What about shipping? • Used widely in e-commerce, news, statistical and other time-sensitive sites.
Dynamically-created Web pages • Why can’t crawlers download them? Technically they can interact, within limits of programming capability Very costly and time-consuming for general search services
Dynamically-created Web pages • How can a crawler detect a dynamically-created page? • From any of the following in the URL ? , % , $ , = , ASP , PHP , CFM and others
proquest.umi.com/pqdweb?Did=000000209668731&Fmt=1&Deli=1&Mtd=1&Idx=5&Sid=1&RQT=309proquest.umi.com/pqdweb?Did=000000209668731&Fmt=1&Deli=1&Mtd=1&Idx=5&Sid=1&RQT=309
Proprietary Filetypes • PDF • Spreadsheets • Word-processed documents • Google does it! Why can’t you?
Adobe Portable Document Format (pdf) Adobe PostScript (ps) Lotus 1-2-3 (wk1, wk2, wk3, wk4, wk5, wki, wk Lotus WordPro (lwp) MacWrite (mw) Microsoft Excel (xls) Microsoft PowerPoint (ppt) Microsoft Word (doc) Microsoft Works (wks, wps, wdb) Microsoft Write (wri) Rich Text Format (rtf) Text (ans, txt) Google’s Deep Web Components: Non-html filetypes (1.75%)SEARCH SYNTAX “california power shortage” filetype:pdf
Google Non-html FiletypesWarning! • FOR NON-HTML FILES • Clicking on a title in the results list opens the application as well, involving risk of a virus or worm that may be attached to the file • INSTEAD, click the “View as HTML” option; no applications will be opened and no risk of virus or worm • NOTE: Titles for non-html files are frequently not descriptive of content
Search ServicesHuman created or influenced • Directories – general and specialized • Specialized search engines • Subject metasites or gateways • Deep Web gateways
Search ServicesHuman created or influenced • Content of sites is examined and categorized or crawling is human-focused and refined • CAN include sites with dynamically created pages • CAN be limited to database-driven sites (Deep Web) • CAN include non-html files NOTE: Some specialized search engines may include little human influence eg. Search.edu
The Topography of the Internetor The Layers of the Web • Mapping the web is challenging • Unregulated in nature • Influences from all over the globe • Fulfills many purposes, from personal to commercial • Changes rapidly and unexpectedly • Divisions and terminology are inherently ambiguous eg. “Deep” vs “Invisible” Web
May I suggest a biological, nautical metaphor, perhaps the ocean? SURFACE WEB SHALLOW WEB OPAQUE WEB DEEP WEB
Surface Web • Static html documents • Crawler-accessible
Shallow Web • Static html documents loaded on servers that use ColdFusion or Lotus Domino or other similar software • A different URL for the same page is created each time it is served. • Crawlers skip these to avoid multiple copies of the same page in their database • Technically human accessible via directories, Deep Web gateways or links from other sites
Opaque Web • Static html documents • Technically crawler accessible • 2 types: • Downloaded and indexed by crawler • Not downloaded or indexed by crawler
Opaque Web • Downloaded and indexed by crawler • Buried in search results you never look at • A casualty of “relevance” ranking • Not downloaded or indexed by crawler due to programmed download limits • Document buried deep in the site • Part of a large document that did not get downloaded (Typical crawl per page is 110 K or less) • Document added since last crawler visit (Even the best revisit on an average of every 2 weeks, depending on amount of change at a site)
Opaque Web • Access to the Opaque Web • Specialized search engines • General and specialized directories • Subject metasites • These services typically index more thoroughly and more often than large, general search engines
Deep WebTwo Categories • Technicallyinaccessible to crawlers • Technicallyaccessible to crawlers
Deep Web • Technicallyinaccessible to crawlers • Dynamically created pages • Databases • Non-textual files • Password protected sites • Sites prohibiting crawlers
Deep Web • Technicallyaccessible to crawlers • Textual files in non-html formats (Google does it!) • Pages excluded from crawler by editorial policy or bias
How large is the Deep Web? • White Paper by Michael K. Bergman published in the Journal of Electronic Publishing in 2000. • http://www.brightplanet.com/deepcontent/ tutorials/DeepWeb/index.asp • Currently a scarcity of unbiased research due to its fluid nature, dynamic content and multiple points of access
How large is the Deep Web?Bergman Study • Over 150,000 databases • Over 95% publicly available • Perhaps 500 times larger than the Surface Web • Growth rate currently greater than the Surface Web
What’s in the Deep Web? • Information likely to be stored in a database • People, address, phone number locators • Patents • Laws • Dictionary definitions • Items for sale or auction • Technical reports • Other specialized data
What’s in the Deep Web? • Information that is new and dynamically changing • News • Job postings • Travel schedules and prices • Financial data • Library catalogs and databases • Topical coverage is extremely varied.
Mining the Deep WebA world different from search engines . . . Hunter’s Maxim for Searching the Deep Web Plan to first locate the category of information you want, then browse. Don’t be too specific in your searches. Cast a wide net. Brush up on your Gopher-type search skills (if you were searching the ‘Net back then). We’ve become accustomed to search engine free-text searching. This is a different world.
Basic Strategies for Mining the Deep Web • Using directories, general and specialized • Using general search engines • Using specialized (subject-focused) search engines • Using subject metasites (link-oriented) • Using Deep Web gateway sites (database-oriented) NOTE: Many sites contain elements of all of the above, in varying degrees and combinations
Using directories • Yahoo! > “web directories” > 840 category matches • Yahoo! > database > 22 categories and 7423 site matches • Google Directory > link collections > 493,000 • Databases may also be found under general subject categories • Also use research directories such as Infomine, LII, WWWVL and others
Using general search engines • Combine subject terms with one or more of these possibilities: • directory • crawler • search engine • database • webring or web ring • link collection • blog
Using general search engines • Google (11/4/02) “toxic chemicals database” > 45 “punk rock search engine” > 77 “science fiction webring” > 97 (web rings are cooperative subject metasites, maintained by experts or aficionados) • Remember, when using a search engine you must match words on the page.
Using specialized (subject-focused) search engines • AKA • Limited-area engines • Targeted search engines • Expert search services • Vertical Portals • Vortals
Using specialized (subject-focused) search engines • Non-html textual files • http://searchpdf.adobe.com/ • Google • Non-textual files • Image, MP3 search engines • Media search at Google, et. al. • Software • Blogs • Blogdex http://blogdex.media.mit.edu/
Web logs or blogs • Online personal journals • Postings are often centered around a particular topic or issue and may contain links to recent relevant information • Frequently updated • Differ from newsgroups in that they are generally by one author
Web logs or blogs • How do you search them? • Blogdex http://blogdex.media.mit.edu • Open Directory http://dmoz.org Computers / Internet / On the Web / Weblogs • Are they part of the Deep Web? • Yes and No