100 likes | 139 Views
Explore HITS method for web page ranking based on hub & authority evaluation. Learn preprocessing, vector calculation, & validation techniques.
E N D
HITSHypertext Induced Topic Selection Gyozo Gidofalvi Uppsala Database Laboratory
Idea • Given a set of web pages • that are all concerned with the same topic • we want to find • the most interesting pages • by examining the internal link structure in the set • we want to find • the pages that are most likely to guide us to an interesting pages Gyozo Gidofalvi
Foundation • Identify Hubs and Authorities • Definition is mutually recursive: • A good hub is pointing to good authorities • A good authority is pointed to by good hubs • The hub value of a site is • the sum of the authority values of the sites that the site is pointing to. • The authority value of a site is • the sum of the hub values of the sites that points to the site. Gyozo Gidofalvi
Pseudo-code • Find a set of pages about a given subject • You may use an existing search engine (such as Google) • In the assignment, you are provided a bunch of pages with links • Preprocess the link structure • Initialize hub and authority vectors • Normalize the vectors to length 1 • Calculate the new authority vector based on the link structure and the hub vector • Calculate the new hub vector based on the link structure and the authority vector • If the new values of the hub and authority vectors are similar enough to the old ones we are done, otherwise repeat from 4 • Sort the vectors and find the top authorities and hubs Gyozo Gidofalvi
Calculating the hub and authority vectors • First we initialize the hub and authority vector to some value. • What initial values are appropriate? • Does it matter what we initialize to? • Next, we calculate the new hub and authority vectors using the formulas • Does it matter which order these calculations happen? • Do we need to normalize the vectors in each iteration? • How do we know when to stop? Gyozo Gidofalvi
Preprocessing • Preprocessing will improve the accuracy o • Several links may point to the same page; • http://www.it.uu.se • http://www.it.uu.se/index.html • www.it.uu.se • Remove site-internal links as this can make a site seem more important than it really is. • Remove links to sites for which we do not know the link structure. Gyozo Gidofalvi
The assignment • You will mine four different link structures for four different queries. • We have done the web crawling and some of the preprocessing for you! • Input files are on the lab course web page • However, you must • Do some preprocessing yourselves • Directions for pre-processing are on the lab course web page • Validate your implementation • Think of how to verify your solution • Your validation does not have to be fancy • not even automated • At least, implement the test case on the following slide, and see what output it gives you. • Make sure that the test case output is reasonable Gyozo Gidofalvi
a b c d Example (test case) • Rank the pages according to hub and authority value in this link structure: Gyozo Gidofalvi