280 likes | 531 Views
The Homepage Finder. Aditya Ramani Arpit Jain Kashif Manzoor Omid Fatimeah. Agenda. Introduction Is there really a problem? Is this problem worth solving? Architectural Overview The block diagram Architectural Detail The explanation and justification Results
E N D
The Homepage Finder Aditya Ramani Arpit Jain Kashif Manzoor Omid Fatimeah
Agenda • Introduction • Is there really a problem? • Is this problem worth solving? • Architectural Overview • The block diagram • Architectural Detail • The explanation and justification • Results • Comparison with other search engines • The value added • Demo • Screenshots • Live demo • Q& A The HomePage Finder
Introduction • Drawbacks of the traditional keyword search • Too generic • Trying to satisfy everyone’s (even conflicting) wishes • Returns too many results • Requires user to filter out unwanted results through: • Successive refinement of the query • Picking out the desired result manually. The HomePage Finder
Finding homepage – not a trivial task ! • Manual intervention: • Find the homepage of Bill Clinton ? The HomePage Finder
Fine tune the query The HomePage Finder
Further refinement But does it work all the time ….. ? The HomePage Finder
No it does not ! The HomePage Finder
Why Find Someone’s Homepage? • What is the office phone number of Dr. Anhai Doan? • What is the email address of Dr. Jiawei Han? • Profile of Dr. Kevin Chang • A short biography of a Arpit Jain • How does Omid Fatemieh look like (arranged marriage FAQ ?) • This can be regarded as a classification problem. Given a URL determine whether the page belongs to the class “homepage” or “conference page” or none The HomePage Finder
Heuristic 0 Heuristic 1 Heuristic 2 Heuristic 3 Heuristic 4 Heuristic 5 Architecture of Homepage Finder Conference Query rewording Is Conference ? Conference Heuristic Homepage Query rewording Results Preprocessor Results composer Rank Applier Results Consolidator The HomePage Finder
Homepage Characteristics • Acts as a portal for user information • Often: • Contains the person’s name in title. • Contains some variation of the person’s name in the URL. • Is the stem of several other pages related to that person. • Has the words home, page, site, web, web site in its contents. • Contains the person’s contact information, publications, pictures, profile, resume etc. • Has the words home, page, site, web, web site in its contents. • The URL ends with a ‘/’ or ‘index.html’, ‘default.htm The HomePage Finder
Exceptions to Normal Rules • Personnel directory web sites. • Dynamic web sites that may take in person’s name as a GET parameter (http://www.uiuc.edu/users/Name?Kashif+Manzoor) • Unusual web page URL (sfatemi2, kashman). • Appearance of name on the Popular web sites (e.g. Amazon reviews, or researcher profile page at DBLP) The HomePage Finder
Conference Page Heuristics • Conference Query Rewording: • Q1: Raw Query (e.g. SIGMOD) • Q2: Intitle query (e.g. intitle: SIGMOD conference OR symposium) • Q3: Allintitle Query( e.g. allintitle:SIGMOD conference OR symposium) • Does the keyword represent a conference? • Q3 must be zero for a person (unless his name is conference) • Q1/Q2 ratio should be higher than a threshold (0.4) • Selecting the Conference URL The HomePage Finder
Homepage Heuristics • Query Rewording: • Q1: Raw Query (e.g. Anhai Doan) [relative weight=1] • Q2: intitle query (e.g. intitle:Anhai Doan) [relative weight=3] • Q3: allintitle query ( e.g. allintitle:Anhai doan home OR homepage OR webpage OR "web site" OR personal OR page) [relative weight=2] • Targeting specific words [relative weight=0]: • Q4: resume/CV (e.g. Chengxiang Zhai resume OR "CV" OR "Curriculum Vitae“) • Q5: Chengxiang Zhai publications • Q6: Chengxiang Zhai contact The HomePage Finder
Homepage Heuristics (cont’d) • URL Preprocessor: • CGI URLs usually are not homepages. (identified by ‘?’ in the URL). Assign a very low rank • Assign query’s relative weights • Heuristics • Does the URL contain variations of the user name ? (e.g. http://anhai.cs.uiuc.edu/home/) [relative weight=2] • Does the URL contain words like (home, user, homepage, people…) (e.g. http://www.cs.uiuc.edu/homes/sfatemi2/) [relative weight=1] • Does the URL end with “/”, ‘index.*’, ‘default.*’[relative weight=1] • URL Stem matching [relative weight=number of occurrences] The HomePage Finder
Justification of Chosen Heuristics • Intuitively the concept of bagging was adopted where each heuristic was treated as a classifier (giving yes if the URL is a homepage no therwise). Based on the experimental results and the error rates the bagging was upgraded to boosting whereby the vote of each heuristic was assigned a weight based on how well it performed during bagging. This is a loose intuitive implementation of Adaboost Algorithm(Freund, Schapier 1997). • Content parsing does not improve accuracy significantly and impact the response time severely “DYNAMIC REFERENCE SIFTING: A CASE STUDY IN THE HOMEPAGE DOMAIN”[in Proceedings of the Sixth International World Wide Web Conference, pp.189-200, 1997] • Majority of web pages follow these simple standards which make the heuristics pragmatic and powerful. • They are configurable and can be extended easily. • These heuristics do not reinvent wheel (we only do what Google hasn’t done already) • Does not incorporate Google rank at all. • Their overhead is minimal. The HomePage Finder
Limitations • Dependent on Google • Only process top ten Google results • Aditya Ramani’s homepage could not be found. Why? • Performs 6 google queries to find the homepage (potentially at least 6 times slower than google). • Does not perform web page contents analysis: • Chengxiang Zhai’s homepage at CMU is ranked higher than his UIUC homepage • Amazon and other websites sometimes throw the heuristics off the right track. • User names may not appear in predictable format in the URL • User’s web pages may not be developed according to the standard practices (e.g. user name may not appear in the title…) The HomePage Finder
Limitations (cont’d) • Conference homepages results accuracy is affected due to • lack of sufficient heuristics • acronyms that conferences use. • can be improved by allowing the user to specify if the key word is a conference title or a person name. • Experimental Results • Not to exhaustive due to manual tagging required and the limitations of the google’s free api. • unavailability of a homepage portal web site to be used as test data. • Adaboost Algorithm was not systematically applied to the heuristic results, instead the weights were adjusted manually based on the error following the general guidelines of the algorithm. • heuristic contributions should have been further investigated similar to how Doan et al. [2001] compares the learners. The HomePage Finder
Related Work and References • J. Shakes, M. Langheinrich, and O. Etzioni. Dynamic reference sifting: A case study in the homepage domain. In Proceedings of the 6th World Wide Web Conference, 1997. • Develops a system called Ahoy – that applies three external systems in addition to an internal extensible repository heuristics to retrieve the possible homepage URLs. It then applies heuristics to tune and rank the final results. • A. Culotta, R. Bekkerman, and A. McCallum. Extracting social networks and contact information from email and the web. In Proceedings of CEAS-1, 2004. • Presents a much more comprehensive of solution. Only part of which is to retrieve the homepage of the user. It extracts the homepage hints from the email text and then attempts to find the user homepage – only to be able to extract community information and contact information if possible. • Ron Bekkerman, Andrew McCallum. Disambiguating Web Appearances of People in a Social Network. In Proceedings of World Wide Web Conference, 2005. • A follow up work on the previous paper. Given several web pages belonging to a person names “John Smith” filter only the pages that relate to the “John Smith” who is related to you and strike out all the pages related to some other “John smiths”. The paper uses the following approach: “Analyze the community-links to identify the John smith that is part of the community to which you belong, ignore all others” The HomePage Finder
Results • HPF outperforms all other search engines. • Although HPF uses google as its first step but it pays no attention to google rank, hence its better performance than google is a credit to its own heuristics. • Google and yahoo usually have the same home page in the top 10 results, but if it is not the first result we report this as an error. • The above data was calculated on ‘famous’ and ‘not so famous’ categories i.e. 50% data was based on famous people (professors, actors etc.) other 50% was based on not so famous people (under grad students, grad students, our friends etc.) The HomePage Finder
Experimental Results • Manually ran several queries on three commercial search engines. • Keywords chosen according to the following distribution: • 50% persons being famous or semi-famous (e.g. researchers, actors etc) • 50% persons being not so famous (e.g. personal friends, graduate and undergraduate students) • A result is considered correct if the homepage of the person is the first one result of the search engine. • Error is calculated by considering the ranked results as ordinal data where each homepage result can have a rank: Rif { 1,2 … Mf} and then applying the Z-Score normalization. (i.e. Zif = Rif – 1/ Mf-1) where Rif is the rank assigned by the search engine i to the correct homepage and Mf is the worst possible rank (considered as 10 in our case). • The Mean error is calculated over all the test data to give the final overall %age error of a search engine. The HomePage Finder
Heuristic Effectiveness H1: user name in URL heuristic H2: URL pattern heuristic H3: specific words in URL heuristic. H4: Stem matching heuristic The HomePage Finder
Conclusions • Heuristics weights were finalized based on these results • The stem matching heuristic was improved by using the stem matching occurrence. This improved the accuracy quiet a bit. (average contribution for correct positives was 4.3 and for false positives was :1.9 The HomePage Finder
Demo The HomePage Finder
Google Ranked it: 5 We ranked it : 2 Google Ranked it: 2 We ranked it : 3 The HomePage Finder
Google Rank: 3 Our Rank: 1 The HomePage Finder
Querying with “Arpit Jain” does not bring this result in the top 5 at google The HomePage Finder
Thank you The HomePage Finder