350 likes | 704 Views
UIUC People Finder. Info. University of Illinois at Urbana Champaign Advanced Database Management Systems CS511 Instructor ChengXiang Zhai Sena Lee (senalee2@uiuc.edu) Heewon Jung (hjung20@uiuc.edu) Seung Pyo Lee (slee232@uiuc.edu) Ricardo Redder (rredder2@uiuc.edu)
E N D
Info University of Illinois at Urbana Champaign Advanced Database Management Systems CS511 Instructor ChengXiang Zhai Sena Lee (senalee2@uiuc.edu) Heewon Jung (hjung20@uiuc.edu) Seung Pyo Lee (slee232@uiuc.edu) Ricardo Redder (rredder2@uiuc.edu) John Laipple (laipple@uiuc.edu)
Agenda • Problem • Motivation • Common problem • Definition • Challenges • Solution • Implementation • Retrieval • Interpretation • Decision • Demo • Future work
Motivation • For a given a person • The information about the person stored in relational databases is very limited.e.g.: name, age, address, etc. • There is a lot of information about he or she in the internet.e.g.: web-pages, papers, blogs, pictures • Use the best of both worlds
Common problem ChengXiang Zhai Search
Phonebook ChengXiang Zhai Search
Entity retrieval • Given: • a set of entities E • a relational table where each tuple describes some aspects of an entity • a set of documents • A who is interested in an entity ei, pose a query (Q), and expects the tuple which represents ei, and the documents associated with ei.
Our example • Query = keywords (usually name) • Table = Phonebook • Documents = Results from search engines
Challenges • Semantic problem • It is different from finding a document that is mathematically similar to the query • It is subjective, the final target is in our mind, and it is not expressed by a function
Solving • Use the information from the relational database to improve the documents search • The information from the phonebook is reliable, it is very accurate • The search engines are more generic, a simple search for a name might not be useful.
Our example again ChengXiang Zhai Search
Sequence • User type a query • User click the Search button • Application searches in the Phonebook • Application retrieve the information from the Phonebook • Application searches in the search engines, using the previous information
Implementing the idea • How to retrieve the information and documents from web? • How to interpret the results? • How to decide whether a given document relates to the entity or not?
Web-sites as functions • Search engines • User types the text • Click on the button • Read the results • Click on the results • UIUC People Finder • Application send the text to the search engine (1, 2) • Store the results (3, 4)
Using exposed HTTP interface • Search engines • Uses GET or POST methods to receive information • Send the results in HTML • Application • Convert the query to a GET or POST method, and send it • Read the HTML
Wrappers • Receive the text • Build the appropriate URL • Connect to the URL • Read the response Query text Wrapper HTML Example: http://www.google.com/search?hl=en&q=chengxiang+zhai&btnG=Google+Search
How do we interpret? • Visual language • Different styles different meanings • Underline Links • Useful information Center
Extraction from HTML • HTML is Tag based < > • Different styles • <font size =…> • <h2> • <bgcolor =…> • Links • <a href = …> • Center • <body>
How to decide whether a given document relates to the entity or not?
How do we decide? • Look for related information • Context • Names • Other information
Application • Search for keywords found in the Phonebook. • Search for the name • Search for the department • Search for the address • etc. • Rank the pages • Name +100 points • Departament +50 points • Email +250 points
Problem • Performance • Problem: Search engines return thousands, or millions of results • Solution: Limit the number of retrieved web-pages • Problem: Even limiting the number of analyzed web-pages, many pages are accessed • Solution: Cache
Final architecture www online Google Yahoo Phonebook Searchers Information Picture Documents cache Query text offline
Future work • Extend to other domains • MySpace, ACM, Papers, Blogs, etc… • Automatic link extraction • Better ranking function • User feedback • Owner feedback