60 likes | 159 Views
Search engine note. Search Signals. “ H euristics” which allow for the sorting of search results Word based: frequency, position, … HTML based: emphasis, Header URI based: server name, URL Page based: Not dependent on the Search term, but on the page features PageRank the most important
E N D
Search Signals • “Heuristics” which allow for the sorting of search results • Word based: frequency, position, … • HTML based: emphasis, Header • URI based: server name, URL • Page based: Not dependent on the Search term, but on the page features • PageRank the most important • Search results are a combination of these
Anchor text • Other pages, images, documents, etc. are linked via “anchors” • E.g. <a …>, <img …>, etc • Text around the anchor describes the linked page • <a href=“http://www.cowabduction.com”> UFOs are stealing our cows! </a> • These words index to the LINKED page
Search “algorithm” • Single or multi-word • For every word in query • Find the pages the word occurs on and compute • Group 1: Pages with all those words (intersection) • Group 2: Pages with any of those words (union) • For every page in the returned set • Sort by formula • k1 * signal1 + k2 * signal2 + … +kn * signaln • (k’s sum to 1 is advantageous computationally)
Indexes • Search index • For every page, what words occur on that page • Plus “features” of word occurance (location, html, etc) • Inverted (reverse) index • For every word, what pages it occurs on
Summary • http://www.youtube.com/watch?v=fnSJBpB_OKQ