660 likes | 680 Views
This article explores the use of collaborative search and intelligent crawlers in traditional information retrieval systems. It discusses the challenges and solutions in representing information needs, indexing documents, and formulating queries. The importance of ranking and factors influencing it are also analyzed. Additionally, the indexing of web pages and the role of crawlers in discovering and downloading documents from the web are examined.
E N D
Collaborative Search Zheng Zhen
Traditional IR • Web search • Crawlers parallel crawler intelligent crawler • Collaborative Search • References
Traditional IR System User Acquisition documents, objects Problem information need Representation question Representation indexing, ... Database of Indexed documents Query search formulation Matching searching Feedback Retrieved objects
Classic Information Retrieval Homogenous documents Well categorized ‘Small’ well-controlled collection Closed, static environment Controlled collection growth
Web Search • Web: - open, dynamic environment - vast uncontrolled collection of PAGES • Web page: - heterogeneous: various formats, languages … - content may change over time ! • Importance of LINKS • Existing Search Facilities: • Generic: yahoo, askjeeves, google etc. • Specialized: Pluribus,Collaborative Spider
Common operations • Indexing - identifies potential index terms in documents • Query processing - form keywords • Search - access indexed file • Ranking
Ranking • Ranking is important • Factors which influence rank • Term location or frequency • Proximity to query terms • Date of Publication • Length • Popularity • Heuristics: Proper nouns may have higher weights • WWW: Link analysis Popularity (ex. Google)
The Web: indexing • Web pages are heterogenous documents • Contain both text information and meta information • External meta information can be inferred • Must be processed before the pertinence can be established
Indexing WWW documents • Web pages require Preprocessing to get uniform data structure - Normalizes the document stream to a predefined format - Breaks the document stream into desired retrievable units - Isolates and metatags subdocument pieces Web1 page1 Uniform format Web2 page2 preprocessing Web n Page n
Computing weights • Assign weight to each descriptor for document & add to index • Weights are based on: • term frequency within the document (tf) • Global term frequency within the corpus • This will be a problem when using parallel independent agents to do indexing
IR on Web Query Search & match Indexed files Query Processor Page ranking Document Processor Responses Browse Web Crawlers Web pages
Web: Document discovery • Corpus is very large • Dynamic • Open • Documents must be discovered • …. use Web crawler
Web Crawler • What is a Crawler? initinitial urls get next urlscheduled urls Web get pagevisited urls extract urls web pages
Parallel Crawler Advantages: • Faster…. • Imperative for large-scale crawling • Can be run on cheaper machines • Network load dispersion • Network load reduction Crawler1 Crawler2 Downloaded Web pages Web CrawlerN *Parallel Crawlers by Cho, Junghooet al. University of California, WWW2002, Honolulu, Hawaii, USA
Evaluation Metrics • Overlap 1 - (# of unique pages downloaded / # of page downloaded by team of crawler) • Coverage # of pages downloaded by the parallel crawler / Total # of reachable pages • Communication overhead # of exchanged messages / # of page downloads
Assignment of search areas • Partitioning the Web • Address division: .net, .ca , UdeM.ca • Topic • Static assignment ( see next page) • Dynamic assignment (see multi-agent collaborative search)
Partition function Multitude of ways to partition the web • Site-hashing Based on the hash value of the site name of a URL • URL – hashing Based on the hash value of all the URL • Hierarchical partition the web hierarchically based on the URLs of the pages Partitionning will come up again with Agents !
a f Crawling modes (Examples) * Firewall mode, Cross-over mode, Exchange mode Site1 (Crawler1)Site2(Crawler2) *Parallel Crawlers by Cho, Junghooet al. University of California, Los Angeles WWW2002, Honolulu, Hawaii, USA b c g d h i e
Firewall mode:download within partitions Crawler1: ab, ac Crawler2:fg, gh, gi Site1 (Crawler1)Site2(Crawler2) a f g b c d h i e D and E are overlooked !
Cross over mode:download between partitions Crawler1: ab, ac; ag, gh, hd, de, gi Crawler2: fg, gh, gi; hd, de Site1 (Crawler1)Site2(Crawler2) a f g b c d h i e Duplication of work !
Exchange mode:download within partitions, exchange info. Crawler1: ab, ac; then g Crawler2 Crawler2: fg, gh, gi; then d Crawler1 Site1 (Crawler1)Site2(Crawler2) a f g b c d h i e Requires communication
Minimizing communication inExchange Mode • Batch communication • Allow replication 1) Because links to pages follows a Zipf distribution (... 20-80 factor) 2) Replicate some popular URLs at each Crawlers Zipf distribution incoming links incoming links page page
Evaluating quality • We want important pages • Quality measure:| Pages Top_k| / | Top_k| • Pages: downloaded k pages • Topk: top k most important pages* *Indication of importance: backlink count
Comparison[2] From experiments[2]: 1) firewall mode : parallel crawler number < 4 & less quality 2) exchange mode: small network traffic & maximize quality 3) replicating between 10,000 – 100,000 (sic) popular URLs reduces 40% commu. overhead
Intelligent crawling* • Indiscriminate crawlers ( i.e. for Google) • Any new page is good • Topic-oriented crawlers • I.e. Call for tenders • We just want new pages on a topic of interest • Intelligent crawler * Intelligent Crawling on the WWW with Arbitrary Predicates, C. Aggarwal,et al., IBM TJ Watson Res. Ctr., WWW10, Hong-Kong 2001
Focused Crawling • Which node to explore next ? • Depth-first ? Breadth-first ? • Best-first ! But what is best? • Focused crawling is best, how to establish focus ? -- Linkage locality -- Sibling locality topicY X topic X topicY X topicY ... Y Y ? Y Y
Focused Crawling • Objective: given a specific query, find: -- Good sources of content (authorities)... many links TO -- Good sources of links (hubs) ... many links FROM authoritieshubs • Given a arbitrary query, can we auto-focus ? -- learning capability -- learning model
Learning Model • Analyze links from pages on the search periphery • Learning how to pick good links to follow visited web page to visit page hyperlink 1 2 C 3 4
Learning Model • Clues based on - content - URL tokens - linkage info - sibling structure • Different needs require different learning - crawler need learning during the crawl - reuse learning information • The Crawler should be intelligent
Intelligent Crawling • Priority list of URLs to be explored (Plist) • User defined predicate to compute interest of page (= processed query) • KB: knowledge base
Intelligent Crawling • Algorithm Intelligent-Crawler(); • Begin • Priority-List (PList )= {Starting Seeds }; • While not (termination) do • begin • Reorder URLs on PList using KB • Drop unimportant items from PList • W <= pop the first element on PList; • Fetch the Web page W; • Parse W and add all the outlinks in W to PList; • If W satisfies the user-defined predicate, then store W; • Update KB using content and link information for W; • end • End
Intelligent Crawler During the crawling process, we can accumulate some information Like: • number of URLs crawled, N1 • number of URLs crawled which satisfy predicate , N2 • # pages in which word i occurs which satisfy the predicate, N3 • # pages with keyword in URL which satisfy (or not) predicate …. • How to create a KB? A later example will illustrate URL based learning
Intelligent Crawler Example: User is interested in ‘online malls’ BUT only 0.1% web pages contain ‘online malls’ HOWEVER if word ’eshop’ is in URL then prob of page containing ‘online malls’ = 5% Thus we should add to KB fact that ‘ eshop ’ in URL is useful criterion in choosing pages to explore.
Formal view * C: a crawled web page satisfies the given predicate P(C): probability of event C, P(C) = N2 / N1; E: a fact that we know about a candidate URL Knowledge of the event E may increase the probability P(C) thus P(C|E) = P(C E) / P(E) P(C|E) / P(C) = P(C E) / (P(C) * P(E)) Calculate the interest ratio for the event C given event E as IR(C,E) IR(C,E) = P(C|E) / P(C) = P(C E) / (P(C) * P(E)) The value of P(C E), P(E) can be calculated during the crawling * from: Intelligent Crawling on the WWW with Arbitrary Predicates, C. Aggarwal,et al.,
Mall example Example: • 0.1% web pages contain ‘online malls’ & satisfy ( P(C)) • if word ’eshop’ occur ( E ) then the probability (P(C|E)) of satisfying increase to 5% • So interest ratio = 5% / 0.1% =50 IR(C,E) = P(C|E) / P(C)
Collaborative Search • 3 ways search for information Browsing, querying and filtering • Collaborative type [10] Collaborative browsing Mediated searching Collaborative information filtering Collaborative agents Collaborative reuse of results
Collaborative Search • What do we mean by collaboration ? • Human computer Human • Human Computer • Computer agent Computer agent
Collaborative Search • Man - machine Collaborative browsing --- Ariadne system[23] Collaborative reuse of results --- Pluribus[21] (2000) Collaborative information filtering --- Collaborative filtering[25] Mediated searching --- DIAMS [22] (2000) • Machine - machine ( … Collaborative agents ) meta-search engines: Meta Crawler, Mamma, Metagopher, Copernic topic-oriented collaborative crawler [11] (2002) Collaborative spider [16] (2002) UbiCrawler[5] (2003) Collaborator [19] (under development)
Existing systems meta-search engines • Meta Crawler, Mamma, Metagopher, Copernic query --------- passes ----- to other search engines collect ------ results -------- from other search engines combine ----- results ------user
Topic-oriented collaborative crawlers[11] (2002) • Each crawler is given a specific topic • It knows the topics of its colleagues • It sends URLs of pages it doesn’t care about to the one responsible for the topic Problems: • static predefined topic categories • static assignment partition function, • controller assign sites to each crawler
Collaborative spiders[16](2002) JATLite (Java Agent Template Lite), uses KQML, User agents + ONE scheduler agent , Collaborator agent (as a mediator) search, content mining, post-retrieval analysis system group user sharing information
UbiCrawler[5](2003) consistent hashing partition function buckets are agents, keys are hosts failure detector --- only synchronous component each agent keeps track of the visited URLs in a hash table pure Java application, RMI based, multi-thread agent
Collaborator[19](under development) a shared workspace framework for virtual teams 3 tier architecture, J2EE+Agent ( BlueJADE ), client tier, middle tier, enterprise information systems tier personal agents, session management agents desktop or wireless device Jade, FIPA
Conclusion Current collaborative search: - collaborative - dynamic - adaptive exploring - intelligent - decentralized Trend Agent
Multi-agent collaborative search Challenges ? agent_1 agent_2 agent_n Query? …. DataStore …. DataStore Web …. DataStore
Challenges Partition dynamic ? - dynamic assigning the web domain to agents Load balancing ? - each cache stores roughly the same # of pages Content look up ? - an agent can easily locate the storage that storing particular content Solution: Web Cache & Consistent Hashing
Web Caching • Content (URL -> content) • For download efficiency • Indexing information (Keyword -> URL) • Search efficiency
Browser caching 1.For efficiency www.abc.com 2. Each client has own cache caches clients
Proxy caches 1.each cache stores a subset of all pages www.abc.com 2. each client knows several caches Domain caches clients
Agent’s web cache communication User User Web agent agent agent Web cache Web cache Web cache