Crawling, Ranking and Indexing

Crawling, Ranking and Indexing

Organizing the Web • The Web is big. Really big. • Over 3 billion pages, just in the indexable Web • The Web is dynamic • Problems: • How to store a database of links? • How to crawl the web? • How to recommend pages that match a query?

Architecture of a Search Engine • A web crawler gathers a snapshot of the Web 3. User submits a search query 4. Search engine ranks pages that match the query and returns an ordered list 2. The gathered pages are indexed for easy retrieval

Indexing the Web • Once a crawl has collected pages, full text is compressed and stored in a repository • Each URL mapped to a unique ID • A document index is created • For each document, contains pointer to repository, status, checksum, pointer to URL & title • A hit list is created for each word in the lexicon. • Occurrences of a word in a particular document, including position, font, captialization, “plain or fancy” • Fancy: occurs in a title, tag, or URL

Indexing the Web • Each word in the hit list has a wordID. • Forward index created • 64 barrels; each contains a range of wordIDs • If a document contains words for a particular barrel, the docID is added, along with a list of wordIDs and hit lists. • Maps words to documents. • Wrinkle: Can use TFIDF to only map “significant” keywords • Term Frequency * InverseDocumentFrequency

Indexing the web • An inverted index is created • Forward index sorted according to word • For every valid wordID in the lexicon, create a pointer to the appropriate barrel. • Points to a list of docIDs and hit lists. • Maps keywords to URLs • Some wrinkles: • Morphology: stripping suffixes (stemming), singular vs. plural, tense, case folding • Semantic similarity • Words with similar meanings share an index. • Issue: trading coverage (number of hits) for precision (how closely hits match request)

Indexing Issues • Indexing techniques were designed for static collections • How to deal with pages that change? • Periodic crawls, rebuild index. • Varied frequency crawls • Records need a way to be “purged” • Hash of page stored • Can use the text of a link to a page to help label that page. • Helps eliminate the addition of spurious keywords.

Indexing Issues • Availability and speed • Most search engines will cache the page being referenced. • Multiple search terms • OR: separate searches concatenated • AND: intersection of searches computed. • Regular expressions not typically handled. • Parsing • Must be able to handle malformed HTML, partial documents

Ranking • The primary challenge of a search engine is to return results that match a user’s needs. • A word will potentially map to millions of documents • How to order them?

PageRank • Google uses PageRank to determine relevance. • Based on the “quality” of a page’s inward links. • A simplified version: • Let N be the outward links of a page. • R(page) = c * Sumv 2 inward R(v) / Nv • c is a normalizing factor

PageRank • Average the PageRanks of each page that points to a given page, divided by their outdegree. • Let p be a page, with T1 – Tn linking to p. • PR(p) = (1-d) + d(SumI(Pr(TI)/outI)) • d is a ‘damping’ factor. • PR ‘propagates’ through a graph. • Defined recursively, but can be computed iteratively. • Repeat until PR does not change by more than some delta.

PageRank • Intuition: A page is useful if many popular sites link to it. • Justification: • Imagine a random surfer who keeps clicking through links. • d is the probability she starts a new search. • Pros: difficult to game the system • Cons: Creates a “rich get richer” web structure where highly popular sites grow in popularity.

HITS • HITS is also commonly used for document ranking. • Gives each page a hub score and an authority score • A good authority is pointed to by many good hubs. • A good hub points to many good authorities. • Users want good authorities.

Hubs and Authorities • Common community structure • Hubs • Many outward links • Lists of resources • Authorities • Many inward links • Provide resources, content

Hubs and Authorities Authorities Hubs Link structure estimates over 100,000 Web communities Often not categorized by portals

Issues with Ranking Algorithms • Spurious keywords and META tags • Users reinforcing each other • Increases “authority” measure • Link Similarity vs. Content similarity • Topic drift • Many hubs link to more than one topic

Crawling the web • How to collect Web data in the first place? • Spiders are used to crawl the web and collect pages. • A page is downloaded and its outward links are found. • Each outward link is then downloaded. • Exceptions: • Links from CGI interfaces • Robot Exclusion Standard

Crawling the Web • We may want to be a bit smarter about selecting documents to crawl • Web is too big • Building a special-purpose search engine • Indexing a particular site • Choosing where to go first is a hard problem.

Crawling the Web • Basic Algorithm: • Let Q be a queue, and S be a starting node • Enqueue(Q,S) • While (notempty(Q)) • W = dequeue(Q) • V1,…,Vn = outward links(Q) <- this is called the frontier • Enqueue(v1, …, Vn) • The Enqueue function is the tricky part.

Crawling the Web • BestFirst • Sorts queue according to cosine similiarity • Sim(S,V) numerator: sumw in s and v fws fwp • Sim(S,V) denominator: • Sqrt(sumw in s fw2 * sumw in v fw2) • This is a generalization of Euclidean distance • Expand documents most similar to the starting document.

Crawling the Web • PageRank can also be used to guide a crawl. • PageRank was designed to model a random walk through a web graph. • Select pages probabilistically based on their PageRank • One issue: PageRank must be recomputed frequently. • Leads to a crawl of the most “valuable” sites.

Web structure • Structure is important for: • Predicting traffic patterns • Who will visit a site? • Where will visitors arrive from? • How many visitors can you expect? • Estimating coverage • Is a site likely to be indexed?

Core • Compact • Short paths between sites • “Small world” phenomenon • Distances are small relative to average path length • Number if inward and outward links follows a power law. • Mechanism: preferential attachment • As new sites arrive, the probability of gaining an inward link is proportional to in-degree.

Power laws and small worlds • Power laws occur everywhere in nature • Distribution of site sizes, city sizes, incomes, word frequencies, incomes, business sizes, earthquake magnitudes, spread of disease • Random networks tend to evolve according to a power law. • Small-world phenomenon • “Neighborhoods” will be joined by a common member • Hubs serve to connect neighborhoods • Linkage is closer than one might expect • Application: Construction of networks and protocols that produce maximal flow/efficiency

Local structure • More diverse than a power law • Pages with similar topics self-organize into communities • Short average path length • High link density • Webrings • Inverse: Does a high link density imply the existence of a community? • Can this be used to study the emergence and growth of web communities?

Web Communities • Alternate definition • Each member has more links to community members than non-community members. • Extension of a clique. • Can be discovered with network flow algorithms. • Can be used to discover new “categories” • Help people interested in a topic find each other. • Focused crawling, filtering, recommender systems

Crawling, Ranking and Indexing