1 / 37

The Invisible Web - finding things that are hard to find -

The Invisible Web - finding things that are hard to find - Tefko Saracevic, PhD Rutgers University http://www.scils.rutgers.edu/~tefko ( contains also a list of sites relevant to the topic and this presentation) What is “Invisible Web?”

Pat_Xavi
Download Presentation

The Invisible Web - finding things that are hard to find -

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Invisible Web- finding things that are hard to find - Tefko Saracevic, PhD Rutgers University http://www.scils.rutgers.edu/~tefko (contains also a list of sites relevant to the topic and this presentation) © Tefko Saracevic, Rutgers University

  2. What is “Invisible Web?” • Materials that general search engines cannot or WILL not include in their collection of web pages (indexes) • You cannot find through general search engines • Contains a vast amount of information • much of it authoritative, qualitative • much of it specialized © Tefko Saracevic, Rutgers University

  3. Why search engines miss? • Size: Web is huge, cannot cover all • Economics: associated costs are high • also pay per crawl & rank • Technical: still limited capabilities • Spam: eliminating bad also looses good • Restrictions: some site do not let in • Deep structure: some sites complex © Tefko Saracevic, Rutgers University

  4. Web size - who knows? • Web Characterization Project - OCLC • provides statistics about the web • 1998: 2.8, 2002: 9.04 mill web sites (IP address) • In 2002: 35% public, 29% private, 36% provisional sites • Public sites (2002): • 55% US, 7% German, 6% Japanese, 3% each French, Spanish, 2% each Italian, Dutch, Chinese,1% each Korean, Russian, Polish, Portuguese • Adult sites (2002): 3.3% • IP address volatility - all sites (disappearance pattern): • 13% of sites in 2002were also in 1998; 51% in 2001 © Tefko Saracevic, Rutgers University

  5. How do search engines work? • Crawlers, spiders: go out to find • new & changed sites; periodic, not for each query • Databases, caches: • gather content; could be submitted, bought • Indexing:creating appropriate entries • various, mostly proprietary algorithms • Retrieval engine:searching on basis of query • Interface: gathers query, displays results • could be ordered by pay © Tefko Saracevic, Rutgers University

  6. Search engines differ • Substantial differences among search engines on each aspect • Information about search engines: • Search Engine Watch • ratings, news, statistics, charts • Search Engine Showdown • run by a librarian, news links, ratings • Extreme Searcher • update of a popular book © Tefko Saracevic, Rutgers University

  7. Search engine coverage • No engine covers more than 16% of WWW • Hard to discern & compare coverage • Many national search engines - own coverage • Many topical search engines – own coverage • Many comprehensive sources independent of search engines © Tefko Saracevic, Rutgers University

  8. Specialized sources • Meta search engines • Specialized engines & catalogs • Domain (subject) engines & catalogs • Reference sources • Libraries as web sources • Virtual libraries • Subject databases • Societies, organizations © Tefko Saracevic, Rutgers University

  9. Meta search engines • Search engines that cover search engines – • CDNET Search.com • meta engine of meta engines • Dogpile -results from a number of search engines • Surfwax -gives statistics and text sources • Search Engines Worldwide • 174 countries, over 1300 engines • Search Engine Guide – categorized by topic © Tefko Saracevic, Rutgers University

  10. meta engines … (cont.) • Vivisimo • clusters results; innovative • Complete Planet • over 100,000 databases & s engines • Invisible Web • resources and individual questions • Webbrain • results in tree structure – fun to use © Tefko Saracevic, Rutgers University

  11. Domain engines & catalogs • Cover general & specific areas • Open Directory Project– large edited catalog of the web – global, run by volunteers • Nat. Acad. of Sciences of Belarus Interesting WWW sites about science. • BUBL LINK-selected Internet resources covering all academic subject areas – UK • Profusion – search in categories © Tefko Saracevic, Rutgers University

  12. domain engines … • Exist in many domains & subjects – rich! • Psychcrawler Amer Psychological Association • web index for psychology • Entrez PubMed – Nat Library of Medicine • CiteSeer - NEC Research Center • scientific literature, citations index - free • Think Quest – an international organization • education resources, programs © Tefko Saracevic, Rutgers University

  13. domain engines … • KIRKE - Katalog der Internetressourcen für die Klassische Philologie aus Erlangen • a variety of resources • Perseus Digital Library Tufts University • covers antiquity to renaissance • Sch of Slavonic & East European Studies, University College London • includes country resources, e.g. Croatia • U Mich Document Center • official documents from all over the world © Tefko Saracevic, Rutgers University

  14. Reference services • Reference services - several models • Q&A, directories, email answers etc. • Martindale’s Reference Desk • comprehensive, amazing; also a health desk • Ask Jeeves! • most popular, commercial • Ask ERIC • education questions- email answers • Information Please • almanac type questions © Tefko Saracevic, Rutgers University

  15. reference … • Digital reference - new service area for libraries • QuestionPoint L of Congress & OCLC • project for a global reference network • Virtual Reference Desk – L of Congress • compilation of web reference sites • LiveRef - maintained at Iowa State U • a registry of real time digital reference services © Tefko Saracevic, Rutgers University

  16. Libraries as web sources • Academic libraries providing open collections & services; models vary • Rutgers libraries - big long term effort • University of California, Berkeley • a most elaborate effort together with Sun Corporation • Bibliothèque Nationale de France • includes virtual exhibitions, among others © Tefko Saracevic, Rutgers University

  17. Virtual libraries on the Web • Libraries emerging only on the Web • Virtual Library – • Switzerland, US, UK & other countries – ‘oldest virtual library on the Web’ • Internet Public Library Michigan • also a long term effort • Librarians Index of the Internet • very popular and comprehensive © Tefko Saracevic, Rutgers University

  18. virtual libraries … • Academic Info Digital Library • many links to digital collections & resources in various subjects • Gabriel • Gateway to European National Libraries • Museum of online museums • a delight © Tefko Saracevic, Rutgers University

  19. Subjects databases • Many subject specific sites • rich & often unique coverage & services • different approaches & requirements • Examples in health related domains: • WebMDHealth – news, medical information • Rxlist - The Internet Drug Index • Mayo Clinic HealthOasis – health advice © Tefko Saracevic, Rutgers University

  20. Societies, organizations • Great many rich sources for searching • differences in requirements, depth, richness Examples from variety of organizations: • Assoc. for Computing Machinery • Digital Library; subscription or registration • US State Department • about the U.S & other countries • Genealogy – Church of Later Day Saints • most comprehensive historical list of records © Tefko Saracevic, Rutgers University

  21. Language barriers on the Web • English still the major language • but declining, now slightly over 50% • Multilingual retrieval search engines • Euroseek • searches in a number of languages • All the Web • results in 45 languages © Tefko Saracevic, Rutgers University

  22. Language barriers: translations • A number of translation sites • machine aided – i.e. plug in terms, phrases, sentences in one & review in the other language , but effectiveness??? • Free Translations • from to English, & 8 other languages • Babel Fish • from to English and 9 languages, translates URLs • Travlang • great for travelers, but annoying commercials © Tefko Saracevic, Rutgers University

  23. Web news; keeping up • What is going on on the Web? Some major sources of news and evaluations: • Free Pint– newsletter, articles, links • Internet Resources Newsletter – UK based • ResearchBuzz – daily updates; many aspects • About.com Web Search – tools, Web Search Forum • Resource Shelf – newsletter with archive © Tefko Saracevic, Rutgers University

  24. keeping up … • Book Chris Sherman & Gary Price (2001). Invisible Web: Uncovering information sources search engines can’t see. Information Today • Site: Invisible Web • provides up to date information © Tefko Saracevic, Rutgers University

  25. Evaluations, ratings • Many sources evaluate web sites: • The Scout Report – • librarians’ BIBLE! Annotations. Comprehensive. • Medical Library Assoc. – ten most useful sites; • MLA user guide for health inf., recommendations • Web 100 – commercial, user ratings, news • Evaluating web pages UC Berkeley • tutorial and guide © Tefko Saracevic, Rutgers University

  26. Archiving the web • Internet Archive – a large undertaking • includes web archive & lots more publicly available & free • 10 billion web pages archived from 1996 to a few months ago • Wayback Machine – search to look at old versions of web pages • But there is more. e.g.: • Million Book Project • International Children’s Digital Library © Tefko Saracevic, Rutgers University

  27. Needed for Web searching • Knowledge & competencies on • variety of web sources & their organization • search engines • web search strategies • search dynamics, feedback • Keeping up & up & up • constant updates, changes, innovations • many domain/subject specific © Tefko Saracevic, Rutgers University

  28. Needed for Web searching by professionals • Knowledge of SOURCES in area of interest • search engines not enough • not too helpful in finding these other sources; structure hard to discern • Evaluation of sources • a key professional skill! • standard criteria & Web criteria: authority; accuracy; currency (timeliness); objectivity; coverage,persistence, usability © Tefko Saracevic, Rutgers University

  29. Needed competencies … • Knowledge of users & use • Knowledge of searching • Use of technology • Adaptability, flexibility • Integration with other resources • Teaching others • Constant learning & update • keeping up, keeping up, keeping up © Tefko Saracevic, Rutgers University

  30. But now really: How to do it? information WWW © Tefko Saracevic, Rutgers University

  31. Web is still a mystery! © Tefko Saracevic, Rutgers University

  32. hvala thank you ďakujem vám danke merci grazie gracias © Tefko Saracevic, Rutgers University

  33. P.S. a few weird sites… • SelectSmart.com • all kinds of quizzes for you • James Dean official web site • Deaducated • Dead Librarians’ Society • Livejournal • blogs & authoring tools © Tefko Saracevic, Rutgers University

  34. Sources • About.com Web Search http://websearch.about.com • Academic Info Digital Library http://www.academicinfo.net/digital.html • All the Web http://www.alltheweb.com/ • Ask Eric http://www.askeric.org/Qa/ • Ask Jeeves! http://www.ask.com/ • Assoc. for Computing Machinery http://www.acm.org/ • Babelfish http://babelfish.altavista.com/tr • Bibliothèque Nationale de France http://www.bnf.fr/ • BUBL LINK http://bubl.ac.uk/link/ • CDNET Search.com http://www.search.com/ • CiteSeer http://citeseer.nj.nec.com/ • CompletePlanet http://completeplanet.com • Deaducated http://www.geocities.com/deadlibrarians/ • Dogpile http://www.dogpile.com/ • Entrez PubMed http://www.ncbi.nlm.nih.gov/PubMed/ • Extreme Searcher http://www.extremesearcher.com/ • Free Pint http://www.freepint.com/ © Tefko Saracevic, Rutgers University

  35. sources … • Free Translations http://www.freetranslations.com • Gabriel http://www.kb.nl/gabriel/ • Genealogy http://www.familysearch.org/ • Information Please http://www.infoplease.com/ • International Children’s Digital Library http://www.icdlbooks.org/ • Internet Archive http://www.archive.org/ • Internet Archive http://www.archive.org/ • Internet Public Library, Michigan http://www.ipl.org/ • Internet Resources Newsletter. http://www.hw.ac.uk/libwww/irn/ • Invisible Web http://invisibleweb.com • James Dean http://www.jamesdean.com/ • KIRKE http://www.phil.uni-erlangen.de/~p2latein/ressourc/ressourc.html • Librarians Index to the Internet http://lii.org/ • Live Journal http://www.livejournal.com/ • LiveRef http://www.public.iastate.edu/~CYBERSTACKS/LiveRef.htm • Martindale’s Reference Desk http://www-sci.lib.uci.edu/~martindale/Ref.html • Mayo Clinic http://www.mayohealth.org/ © Tefko Saracevic, Rutgers University

  36. sources … • Medical Library Assoc. ten top sites http://www.mlanet.org/resources/medspeak/topten.html • Medical Library Assoc. user guide for health inf. http://www.mlanet.org/resources/userguide.html • Medscape http://www.medscape.com/ • Million Book Project http://www.archive.org/texts/millionbooks.php • Museum of online museums. http://www.coudal.com/moom.php • Nat Acad Sciences, Belarus http://www.ac.by/science/index.html • OCLC Web Characterization Project http://wcp.oclc.org/ • Open Directory Project http://dmoz.org • Perseus Digital Library http://www.perseus.tufts.edu/ • Profusion http://www.profusion.com/ • Psychcrawler http://www.psychcrawler.com/ • QuestionPoint http://www.questionpoint.org/ • ResearchBuzz. http://www.researchbuzz.com/index.shtml • Resource Shelf http://resourceshelf.blogspot.com/ • Rutgers Libraries http://www.libraries.rutgers.edu/ • RxList http://www.rxlist.com/ © Tefko Saracevic, Rutgers University

  37. sources … • Sch of East Eur & Slavonic Studies http://www.ssees.ac.uk/dirctory.htm • Search Engine Guide http://www.searchengineguide.com/ • Search Engine Showdown http://searchengineshowdown.com/ • Search Engine Watch http://searchenginewatch.com/ • Search Engines Worldwide http://www.twics.com/~takakuwa/search/search.html • Select Smart.com http://www.selectsmart.com/home.html • Surfwax http://www.surfwax.com/ • The Scout Report. http://scout.cs.wisc.edu/ • Think Quest http://www.thinkquest.org/ • Travlang http://www.travlang.com • U California Berkeley http://sunsite.berkeley.edu/ • U Mich Documents Center http://www.lib.umich.edu/govdocs/ • US State department http://www.state.gov/ • Virtual Library http://vlib.org • Virtual Reference Desk http://www.loc.gov/rr/askalib/virtualref.html • Vivisimo http://vivisimo.com • Web 100 http://www.web100.com • Webbrain http://www.webbrain.com/html/default_win.html • WebMD http://my.webmd.com/webmd_today/home/default © Tefko Saracevic, Rutgers University

More Related