700 likes | 861 Views
Introduction to Digital Libraries Distributed Searching. Web Search. Distributed Data : Documents spread over millions of different web servers. Volatile Data : Many documents change or disappear rapidly (e.g. dead links).
E N D
Web Search • Distributed Data: Documents spread over millions of different web servers. • Volatile Data: Many documents change or disappear rapidly (e.g. dead links). • Unstructured and Redundant Data: No uniform structure, HTML errors, up to 30% (near) duplicate documents. • Quality of Data: No editorial control, false information, poor quality writing, typos, etc.
Dead End Searches • Sometimes pages become either temporarily or permanently inactive. • Internet Explorer inactive page. • Netscape inactive page.
43 million web servers • 167 Terabytes of data
1 Kilobyte = a very short story “Jack and Jill went up the hill to fetch a pail of water. Jack fell down and broke his crown and Jill came tumbling after.” 35 Terabytes of text on surface Web? 1 Megabyte = a short book 35academic research libraries (with some 20,000 meters of shelved books each!) 1 Gigabyte = 20 meters of shelved books 1 Terabyte = an academic research library The Web (Corpus) by the Numbers (2)
Overlap Among 3 Major Search Engineshttp://missingpieces.dogpile.com/whitepaper.pdfhttp://comparesearchengines.dogpile.com/OverlapAnalysis.pdf
Metasearch or …… • parallel search • federated search • broadcast search • cross-database search The Hidden Web of content databases is estimated to be thousands of times larger than the Open Web.
The (non) Metasearch Search 1 - Find Search Engine - Logon - Compose search - Run search - Study results - Refine results - Find document - Get document Search 2 - Find search engine - Logon - Compose search - Run search - Study results - Refine results - Find document - GetdDocument Search 3 . . . . . . . . . . . . . . Search 1 Search 2 Search 3
The Metasearch Search 1 - Find MetaSearch Engine - Logon - Compose search - Select Sources - Run search (Metasearch engine runs Searches 1a,b,c) - Study results - Refine results - Find document - Get document Search 1 Search 1c Search 1a Search 1b
Spiders (Robots/Bots/Crawlers) • A program that automatically fetches Web pages. • Spiders are used to feed pages to search engines. • It's called a spider because it crawls over the Web. Another term for these programs is webcrawler.
Spiders (Robots/Bots/Crawlers) • Start with a comprehensive set of root URL’s from which to start the search. • Follow all links on these pages recursively to find additional pages. • Index/Process all novel found pages in an inverted index as they are encountered. • May allow users to directly submit pages to be indexed (and crawled from).
World Wide Web. WWW The Interface views the selected Database items. The Spider searches the WWW and adds sites to the Database. The Database is kept filled and updated by the Spider. Parts of the Internet Search Tool
Search Strategies Breadth-first Search
Search Strategies (cont) Depth-first Search
Search pollution Search for Subcategories
Metasearching • 3 functions of a metasearcher • choosing the sources to query • the source-metadata problem • dispatching the query to those sources • the query language problem • merging the query results • the rank-merging problem
Source-Metadata Problem • How do you choose which sources to query? • manual • applicable only for small #’s of sources • automatic • how does the metasearcher “know” the nature of the various sources? what if there are 1000s of different sources? • approaches • extract enough of their publicly available information and guess • have the source explicitly export a description of itself
Query-Language Problem • Different remote sources use different search engines with different syntaxes • boolean vs. vector • even if all support boolean, still could have different syntax • different field names for fielded searching • stemming vs. no stemming • different stop lists
Rank-Merging Problem • Each source ranks its results which is valid only locally • there is no global scale to rank against, so ranked results cannot be “shuffled” together meaningfully • Issues: • proprietary ranking algorithms (Altavista, Infoseek, etc.) • even if the algorithms are known or even homogeneous, the collection that the document comes from impacts its ranking
Document Classification “planning language proof intelligence” Testing Data: (AI) (Programming) (HCI) Classes: Planning Semantics Garb.Coll. Multimedia GUI ML Training Data: learning intelligence algorithm reinforcement network... planning temporal reasoning plan language... programming semantics language proof... garbage collection memory optimization region... ... ... (Note: in real life there is often a hierarchy, not present in the above problem statement; and you get papers on ML approaches to Garb. Coll.)
Clustering of Text • Cluster documents on basis of terms they contain • Cluster documents on basis of co-occurring citations • Cluster terms on basis of documents they occur in
Methods (1) • Manual classification • Used by Yahoo!, Looksmart, about.com, ODP, Medline • very accurate when job is done by experts • consistent when the problem size and team is small • difficult and expensive to scale • Automatic document classification • Hand-coded rule-based systems • Used by spam filter, Reuters, CIA, Verity, … • E.g., assign category if document contains a given Boolean combination of words • Commercial systems have complex query languages (everything query languages + accumulators)
Cross-Language IR • Accepting questions in one language, retrieving information in a variety of other languages • “questions”: web-style, or full-text narratives • “information”: web-sites, news, articles, speech, … • Why is it useful? • infeasible to translate collection into every language • feasible to translate selected docs from the ranked list • bilingual users prefer to see the original language • convenient to retrieve in both languages with a single query • CL-IR provides useful insights for conventional IR
The General Problem Find documents written in any language • Using queries expressed in a single language
Top Ten Languages on the Web internetworldstats.com, March, 2011
Supply Side: Internet Hosts Guess – What will be the most widely used language on the Web in 2010? Source: Network Wizards Jan 99 Internet Domain Survey
Web Pages Global Internet Users
Search Technology Chinese Feature Assignment Monolingual Chinese Matching 1: 0.72 2: 0.48 Language Identification Chinese Feature Assignment Chinese Query English Feature Assignment Cross- Language Matching 3: 0.91 4: 0.57 5: 0.36
Language Identification • Can be specified using metadata • Included in HTTP and HTML • Can be determined using word-scale features • Which dictionary gets the most hits? • Can be determined using subword features 24
Dealing with Morphology • Consider Arabic: Root + patterns+suffixes+prefixes=word ktb+CiCaC=kitab All verbs and nouns derived from fewer than 2000 roots Roots too abstract for information retrieval ktb→kitaba book kitabi my book alkitabthe book kitabuki your book (f) kataba to writekitabuka your book (m) maktaboffice kitabuhu his book maktaba library, bookstore ... Want stem=root+pattern+derivational affixes? No standard stemmers available, only morphological (root) analyzers
Citation Indexing • Premise: authors already use citations… use these to navigate the corpus • previously, subject indexes (and later, title indexes) were hand crafted to organize the literature • using citations, the literature organizes itself • each paper contains ~ 15 citations • some more (e.g., biochemistry), some less (e.g. mathematics)
full text reference linking
Who is Who extended services