370 likes | 813 Views
The Invisible Web - finding things that are hard to find - Tefko Saracevic, PhD Rutgers University http://www.scils.rutgers.edu/~tefko ( contains also a list of sites relevant to the topic and this presentation) What is “Invisible Web?”
E N D
The Invisible Web- finding things that are hard to find - Tefko Saracevic, PhD Rutgers University http://www.scils.rutgers.edu/~tefko (contains also a list of sites relevant to the topic and this presentation) © Tefko Saracevic, Rutgers University
What is “Invisible Web?” • Materials that general search engines cannot or WILL not include in their collection of web pages (indexes) • You cannot find through general search engines • Contains a vast amount of information • much of it authoritative, qualitative • much of it specialized © Tefko Saracevic, Rutgers University
Why search engines miss? • Size: Web is huge, cannot cover all • Economics: associated costs are high • also pay per crawl & rank • Technical: still limited capabilities • Spam: eliminating bad also looses good • Restrictions: some site do not let in • Deep structure: some sites complex © Tefko Saracevic, Rutgers University
Web size - who knows? • Web Characterization Project - OCLC • provides statistics about the web • 1998: 2.8, 2002: 9.04 mill web sites (IP address) • In 2002: 35% public, 29% private, 36% provisional sites • Public sites (2002): • 55% US, 7% German, 6% Japanese, 3% each French, Spanish, 2% each Italian, Dutch, Chinese,1% each Korean, Russian, Polish, Portuguese • Adult sites (2002): 3.3% • IP address volatility - all sites (disappearance pattern): • 13% of sites in 2002were also in 1998; 51% in 2001 © Tefko Saracevic, Rutgers University
How do search engines work? • Crawlers, spiders: go out to find • new & changed sites; periodic, not for each query • Databases, caches: • gather content; could be submitted, bought • Indexing:creating appropriate entries • various, mostly proprietary algorithms • Retrieval engine:searching on basis of query • Interface: gathers query, displays results • could be ordered by pay © Tefko Saracevic, Rutgers University
Search engines differ • Substantial differences among search engines on each aspect • Information about search engines: • Search Engine Watch • ratings, news, statistics, charts • Search Engine Showdown • run by a librarian, news links, ratings • Extreme Searcher • update of a popular book © Tefko Saracevic, Rutgers University
Search engine coverage • No engine covers more than 16% of WWW • Hard to discern & compare coverage • Many national search engines - own coverage • Many topical search engines – own coverage • Many comprehensive sources independent of search engines © Tefko Saracevic, Rutgers University
Specialized sources • Meta search engines • Specialized engines & catalogs • Domain (subject) engines & catalogs • Reference sources • Libraries as web sources • Virtual libraries • Subject databases • Societies, organizations © Tefko Saracevic, Rutgers University
Meta search engines • Search engines that cover search engines – • CDNET Search.com • meta engine of meta engines • Dogpile -results from a number of search engines • Surfwax -gives statistics and text sources • Search Engines Worldwide • 174 countries, over 1300 engines • Search Engine Guide – categorized by topic © Tefko Saracevic, Rutgers University
meta engines … (cont.) • Vivisimo • clusters results; innovative • Complete Planet • over 100,000 databases & s engines • Invisible Web • resources and individual questions • Webbrain • results in tree structure – fun to use © Tefko Saracevic, Rutgers University
Domain engines & catalogs • Cover general & specific areas • Open Directory Project– large edited catalog of the web – global, run by volunteers • Nat. Acad. of Sciences of Belarus Interesting WWW sites about science. • BUBL LINK-selected Internet resources covering all academic subject areas – UK • Profusion – search in categories © Tefko Saracevic, Rutgers University
domain engines … • Exist in many domains & subjects – rich! • Psychcrawler Amer Psychological Association • web index for psychology • Entrez PubMed – Nat Library of Medicine • CiteSeer - NEC Research Center • scientific literature, citations index - free • Think Quest – an international organization • education resources, programs © Tefko Saracevic, Rutgers University
domain engines … • KIRKE - Katalog der Internetressourcen für die Klassische Philologie aus Erlangen • a variety of resources • Perseus Digital Library Tufts University • covers antiquity to renaissance • Sch of Slavonic & East European Studies, University College London • includes country resources, e.g. Croatia • U Mich Document Center • official documents from all over the world © Tefko Saracevic, Rutgers University
Reference services • Reference services - several models • Q&A, directories, email answers etc. • Martindale’s Reference Desk • comprehensive, amazing; also a health desk • Ask Jeeves! • most popular, commercial • Ask ERIC • education questions- email answers • Information Please • almanac type questions © Tefko Saracevic, Rutgers University
reference … • Digital reference - new service area for libraries • QuestionPoint L of Congress & OCLC • project for a global reference network • Virtual Reference Desk – L of Congress • compilation of web reference sites • LiveRef - maintained at Iowa State U • a registry of real time digital reference services © Tefko Saracevic, Rutgers University
Libraries as web sources • Academic libraries providing open collections & services; models vary • Rutgers libraries - big long term effort • University of California, Berkeley • a most elaborate effort together with Sun Corporation • Bibliothèque Nationale de France • includes virtual exhibitions, among others © Tefko Saracevic, Rutgers University
Virtual libraries on the Web • Libraries emerging only on the Web • Virtual Library – • Switzerland, US, UK & other countries – ‘oldest virtual library on the Web’ • Internet Public Library Michigan • also a long term effort • Librarians Index of the Internet • very popular and comprehensive © Tefko Saracevic, Rutgers University
virtual libraries … • Academic Info Digital Library • many links to digital collections & resources in various subjects • Gabriel • Gateway to European National Libraries • Museum of online museums • a delight © Tefko Saracevic, Rutgers University
Subjects databases • Many subject specific sites • rich & often unique coverage & services • different approaches & requirements • Examples in health related domains: • WebMDHealth – news, medical information • Rxlist - The Internet Drug Index • Mayo Clinic HealthOasis – health advice © Tefko Saracevic, Rutgers University
Societies, organizations • Great many rich sources for searching • differences in requirements, depth, richness Examples from variety of organizations: • Assoc. for Computing Machinery • Digital Library; subscription or registration • US State Department • about the U.S & other countries • Genealogy – Church of Later Day Saints • most comprehensive historical list of records © Tefko Saracevic, Rutgers University
Language barriers on the Web • English still the major language • but declining, now slightly over 50% • Multilingual retrieval search engines • Euroseek • searches in a number of languages • All the Web • results in 45 languages © Tefko Saracevic, Rutgers University
Language barriers: translations • A number of translation sites • machine aided – i.e. plug in terms, phrases, sentences in one & review in the other language , but effectiveness??? • Free Translations • from to English, & 8 other languages • Babel Fish • from to English and 9 languages, translates URLs • Travlang • great for travelers, but annoying commercials © Tefko Saracevic, Rutgers University
Web news; keeping up • What is going on on the Web? Some major sources of news and evaluations: • Free Pint– newsletter, articles, links • Internet Resources Newsletter – UK based • ResearchBuzz – daily updates; many aspects • About.com Web Search – tools, Web Search Forum • Resource Shelf – newsletter with archive © Tefko Saracevic, Rutgers University
keeping up … • Book Chris Sherman & Gary Price (2001). Invisible Web: Uncovering information sources search engines can’t see. Information Today • Site: Invisible Web • provides up to date information © Tefko Saracevic, Rutgers University
Evaluations, ratings • Many sources evaluate web sites: • The Scout Report – • librarians’ BIBLE! Annotations. Comprehensive. • Medical Library Assoc. – ten most useful sites; • MLA user guide for health inf., recommendations • Web 100 – commercial, user ratings, news • Evaluating web pages UC Berkeley • tutorial and guide © Tefko Saracevic, Rutgers University
Archiving the web • Internet Archive – a large undertaking • includes web archive & lots more publicly available & free • 10 billion web pages archived from 1996 to a few months ago • Wayback Machine – search to look at old versions of web pages • But there is more. e.g.: • Million Book Project • International Children’s Digital Library © Tefko Saracevic, Rutgers University
Needed for Web searching • Knowledge & competencies on • variety of web sources & their organization • search engines • web search strategies • search dynamics, feedback • Keeping up & up & up • constant updates, changes, innovations • many domain/subject specific © Tefko Saracevic, Rutgers University
Needed for Web searching by professionals • Knowledge of SOURCES in area of interest • search engines not enough • not too helpful in finding these other sources; structure hard to discern • Evaluation of sources • a key professional skill! • standard criteria & Web criteria: authority; accuracy; currency (timeliness); objectivity; coverage,persistence, usability © Tefko Saracevic, Rutgers University
Needed competencies … • Knowledge of users & use • Knowledge of searching • Use of technology • Adaptability, flexibility • Integration with other resources • Teaching others • Constant learning & update • keeping up, keeping up, keeping up © Tefko Saracevic, Rutgers University
But now really: How to do it? information WWW © Tefko Saracevic, Rutgers University
Web is still a mystery! © Tefko Saracevic, Rutgers University
hvala thank you ďakujem vám danke merci grazie gracias © Tefko Saracevic, Rutgers University
P.S. a few weird sites… • SelectSmart.com • all kinds of quizzes for you • James Dean official web site • Deaducated • Dead Librarians’ Society • Livejournal • blogs & authoring tools © Tefko Saracevic, Rutgers University
Sources • About.com Web Search http://websearch.about.com • Academic Info Digital Library http://www.academicinfo.net/digital.html • All the Web http://www.alltheweb.com/ • Ask Eric http://www.askeric.org/Qa/ • Ask Jeeves! http://www.ask.com/ • Assoc. for Computing Machinery http://www.acm.org/ • Babelfish http://babelfish.altavista.com/tr • Bibliothèque Nationale de France http://www.bnf.fr/ • BUBL LINK http://bubl.ac.uk/link/ • CDNET Search.com http://www.search.com/ • CiteSeer http://citeseer.nj.nec.com/ • CompletePlanet http://completeplanet.com • Deaducated http://www.geocities.com/deadlibrarians/ • Dogpile http://www.dogpile.com/ • Entrez PubMed http://www.ncbi.nlm.nih.gov/PubMed/ • Extreme Searcher http://www.extremesearcher.com/ • Free Pint http://www.freepint.com/ © Tefko Saracevic, Rutgers University
sources … • Free Translations http://www.freetranslations.com • Gabriel http://www.kb.nl/gabriel/ • Genealogy http://www.familysearch.org/ • Information Please http://www.infoplease.com/ • International Children’s Digital Library http://www.icdlbooks.org/ • Internet Archive http://www.archive.org/ • Internet Archive http://www.archive.org/ • Internet Public Library, Michigan http://www.ipl.org/ • Internet Resources Newsletter. http://www.hw.ac.uk/libwww/irn/ • Invisible Web http://invisibleweb.com • James Dean http://www.jamesdean.com/ • KIRKE http://www.phil.uni-erlangen.de/~p2latein/ressourc/ressourc.html • Librarians Index to the Internet http://lii.org/ • Live Journal http://www.livejournal.com/ • LiveRef http://www.public.iastate.edu/~CYBERSTACKS/LiveRef.htm • Martindale’s Reference Desk http://www-sci.lib.uci.edu/~martindale/Ref.html • Mayo Clinic http://www.mayohealth.org/ © Tefko Saracevic, Rutgers University
sources … • Medical Library Assoc. ten top sites http://www.mlanet.org/resources/medspeak/topten.html • Medical Library Assoc. user guide for health inf. http://www.mlanet.org/resources/userguide.html • Medscape http://www.medscape.com/ • Million Book Project http://www.archive.org/texts/millionbooks.php • Museum of online museums. http://www.coudal.com/moom.php • Nat Acad Sciences, Belarus http://www.ac.by/science/index.html • OCLC Web Characterization Project http://wcp.oclc.org/ • Open Directory Project http://dmoz.org • Perseus Digital Library http://www.perseus.tufts.edu/ • Profusion http://www.profusion.com/ • Psychcrawler http://www.psychcrawler.com/ • QuestionPoint http://www.questionpoint.org/ • ResearchBuzz. http://www.researchbuzz.com/index.shtml • Resource Shelf http://resourceshelf.blogspot.com/ • Rutgers Libraries http://www.libraries.rutgers.edu/ • RxList http://www.rxlist.com/ © Tefko Saracevic, Rutgers University
sources … • Sch of East Eur & Slavonic Studies http://www.ssees.ac.uk/dirctory.htm • Search Engine Guide http://www.searchengineguide.com/ • Search Engine Showdown http://searchengineshowdown.com/ • Search Engine Watch http://searchenginewatch.com/ • Search Engines Worldwide http://www.twics.com/~takakuwa/search/search.html • Select Smart.com http://www.selectsmart.com/home.html • Surfwax http://www.surfwax.com/ • The Scout Report. http://scout.cs.wisc.edu/ • Think Quest http://www.thinkquest.org/ • Travlang http://www.travlang.com • U California Berkeley http://sunsite.berkeley.edu/ • U Mich Documents Center http://www.lib.umich.edu/govdocs/ • US State department http://www.state.gov/ • Virtual Library http://vlib.org • Virtual Reference Desk http://www.loc.gov/rr/askalib/virtualref.html • Vivisimo http://vivisimo.com • Web 100 http://www.web100.com • Webbrain http://www.webbrain.com/html/default_win.html • WebMD http://my.webmd.com/webmd_today/home/default © Tefko Saracevic, Rutgers University