Artificial Intelligence 15-381 Web Spidering & HW1 Preparation

Artificial Intelligence 15-381Web Spidering & HW1 Preparation Jaime Carbonell jgc@cs.cmu.edu 22 January 2002 Today's Agenda • Finish A*, B*, Macrooperators • Web Spidering as Search • How to acquire the web in a box • Graph theory essentials • Algorithms for web spidering • Some practical issues

Search Engines on the Web Revising the Total IR Scheme 1. Acquire the collection, i.e. all the documents [Off-line process] 2. Create an inverted index (IR lecture, later) [Off-line process] 3. Match queries to documents (IR lecture) [On-line process, the actual retrieval] 4. Present the results to user [On-line process: display, summarize, ...]

Acquiring a Document Collection Document Collections and Sources • Fixed, pre-existing document collection e.g., the classical philosophy works • Pre-existing collection with periodic updates e.g., the MEDLINE biomedical collection • Streaming data with temporal decay e.g., the Wall-Street financial news feed • Distributed proprietary document collections Distributed, linked, publicly-accessible documentse.g. the Web

:Properties of Graphs I (1) Definitions: Graph a set of nodes n and a set of edges (binary links) v between the nodes. Directed graph a graph where every edge has a pre-specified direction.

Properties of Graphs I (2) Connected graph a graph where for every pair of nodes there exists a sequence of edges starting at one node and ending at the other. The web graph the directed graph where n = {all web pages} and v = {all HTML-defined links from one web page to another}.

Properties of Graphs I (3) Tree a connected graph without any loops and with a unique path between any two nodes Spanning tree of graph G a tree constructed by including all n in G, and a subset of v such that G remains connected, but all loops are eliminated.

Properties of Graphs I (4) Forest a set of trees (without inter-tree links) k-Spanning forest Given a graph G with k connected subgraphs, the set of k trees each of which spans a different connected subgraphs.

Graph G = <n, v>

Directed Graph Example

Tree

Web Graph <href …> <href …> <href …> <href …> <href …> <href …> <href …> HTML references are links Web Pages are nodes

More Properties of Graphs Theorem 1:For every connected graph G, there exists a spanning tree. Proof:Depth-first search starting at any node in G builds the spanning tree.

Properties of Graphs Theorem 2:For every G with k disjoint connected subgraphs, there exists a k-spanning forest. Proof:Each connected subgraph has a spanning tree (Theorem 1), and the set of k spanning trees (being disjoint) define a k-spanning forest.

Properties of Web Graphs Additional Observations • The web graph at any instant of time contains k-connected subgraphs (but we do not know the value of k, nor do we know a-priori the structure of the web-graph). • If we knew every connected web subgraph, we could build a k-web-spanning forest, but this is a very big "IF."

Graph-Search Algorithms I PROCEDURE SPIDER1(G) Let ROOT := any URL from G Initialize STACK <stack data structure> Let STACK := push(ROOT, STACK) Initialize COLLECTION <big file of URL-page pairs> While STACK is not empty, URLcurr := pop(STACK) PAGE := look-up(URLcurr) STORE(<URLcurr, PAGE>, COLLECTION) For every URLiin PAGE, push(URLi, STACK) Return COLLECTION What is wrong with the above algorithm?

1 2 5 3 6 4 7 Depth-first Search numbers = order in which nodes are visited

Graph-Search Algorithms II (1) SPIDER1 is Incorrect • What about loops in the web graph? => Algorithm will not halt • What about convergent DAG structures? => Pages will replicated in collection => Inefficiently large index => Duplicates to annoy user

Graph-Search Algorithms II (2) SPIDER1 is Incomplete • Web graph has k-connected subgraphs. • SPIDER1 only reaches pages in the the connected web subgraph where ROOT page lives.

A Correct Spidering Algorithm PROCEDURE SPIDER2(G) Let ROOT := any URL from G Initialize STACK <stack data structure> Let STACK := push(ROOT, STACK) Initialize COLLECTION <big file of URL-page pairs> While STACK is not empty, | Do URLcurr := pop(STACK) | Until URLcurr is not in COLLECTION PAGE := look-up(URLcurr) STORE(<URLcurr, PAGE>, COLLECTION) For every URLiin PAGE, push(URLi, STACK) Return COLLECTION

A More Efficient Correct Algorithm PROCEDURE SPIDER3(G) Let ROOT := any URL from G Initialize STACK <stack data structure> Let STACK := push(ROOT, STACK) Initialize COLLECTION <big file of URL-page pairs> | Initialize VISITED <big hash-table> While STACK is not empty, | Do URLcurr := pop(STACK) | Until URLcurr is not in VISITED | insert-hash(URLcurr, VISITED) PAGE := look-up(URLcurr) STORE(<URLcurr, PAGE>, COLLECTION) For every URLiin PAGE, push(URLi, STACK) Return COLLECTION

Graph-Search Algorithms VA More Complete Correct Algorithm PROCEDURE SPIDER4(G, {SEEDS}) | Initialize COLLECTION <big file of URL-page pairs> | Initialize VISITED <big hash-table> | For every ROOT in SEEDS | Initialize STACK <stack data structure> | Let STACK := push(ROOT, STACK) While STACK is not empty, Do URLcurr := pop(STACK) Until URLcurr is not in VISITED insert-hash(URLcurr, VISITED) PAGE := look-up(URLcurr) STORE(<URLcurr, PAGE>, COLLECTION) For every URLiin PAGE, push(URLi, STACK) Return COLLECTION

Completeness Observations Completeness is not guaranteed • In k-connected web G, we do not know k • Impossible to guarantee each connected subgraph is sampled • Better: more seeds, more diverse seeds

Completeness Observations Search Engine Practice • Wish to maximize subset of web indexed. • Maintain (secret) set of diverse seeds (grow this set opportunistically, e.g. when X complains his/her page not indexed). • Register new web sites on demand New registrations are seed candidates.

To Spider or not to Spider? (1) User Perceptions • Most annoying: Engine finds nothing (too small an index, but not an issue since 1997 or so). • Somewhat annoying: Obsolete links => Refresh Collection by deleting dead links (OK if index is slightly smaller) => Done every 1-2 weeks in best engines • Mildly annoying: Failure to find new site => Re-spider entire web => Done every 2-4 weeks in best engines

To Spider or not to Spider? (2) Cost of Spidering • Semi-parallel algorithmic decomposition • Spider can (and does) run in hundreds of severs simultaneously • Very high network connectivity (e.g. T3 line) • Servers can migrate from spidering to query processing depending on time-of-day load • Running a full web spider takes days even with hundreds of dedicated servers

Current Status of Web Spiders Historical Notes • WebCrawler: first documented spider • Lycos: first large-scale spider • Top-honors for most web pages spidered: First Lycos, then Alta Vista, then Google...

Current Status of Web Spiders ) Enhanced Spidering • In-link counts to pages can be established during spidering. • Hint: In SPIDER4, store <URL, COUNT> pair in VISITED hash table. • In-link counts are the basis for GOOGLE’s page-rank method

Current Status of Web Spiders Unsolved Problems • Most spidering re-traverses stable web graph => on-demand re-spidering when changes occur • Completeness or near-completeness is still a major issue • Cannot Spider JAVA-triggered or local-DB stored information

Artificial Intelligence 15-381 Web Spidering & HW1 Preparation