240 likes | 334 Views
Design of a Click-tracking Network for Full-text Search Engine. Group 5: Yuan Hu, Yu Ge , Youwen Gong, Zenghui Qiu and Miao Liu. Outline. Introduction Objective Project diagram Web Crawling Indexing schema Ranking strategies PageRank Algorithms Neural Network
E N D
Design of a Click-tracking Network for Full-text Search Engine Group 5: Yuan Hu, Yu Ge, Youwen Gong, ZenghuiQiu and Miao Liu
Outline • Introduction • Objective • Project diagram • Web Crawling • Indexing schema • Ranking strategies • PageRank Algorithms • Neural Network • Content-Based Ranking • Software and Reference
Introduction • Full-text Search Engine • search on key words • rank results • What is in a Search Engine? • Crawling • Indexing • Ranking results of query
Objective • Design a full-text search engine • Rank search results in different ways
Project Diagram Website Crawling Content-Based Ranking Text & urls Indexing Click-Tracking Network Database PageRank Algorithms Ranked results Query Function
Web Crawling Main page: http://en.wikipedia.org/wiki/Machine_learning Depth 1: crawling all the url links on the main page http://en.wikipedia.org/wiki/Machine_learning#Decision_tree_learning …… Depth 2: crawling all the url links found in depth 1 http://en.wikipedia.org/wiki/Decision_tree_learning#Information_gain …… # Implemented with Python urllib2 module and BeautifulSoup API
URL Main Page URL Depth 1 LINK LINK URL Depth 2
Schema for Basic Index # Implemented with SQLite
Results for Multiple-words Query Words Combination Word location Same url _id Query function ! Notice that all the url_ids returned are not ranked..
http://www.rasch.org/rmt/rmt232a.htm PageRank Algorithm • Developed by Larry Page at Stanford U. in 1996. • How important that page is. • The importance of the page is calculated from all the other pages that link to it. http://www.rasch.org/rmt/rmt232a.htm
How to Calculate PR • d:damping factor, 0<d<1, 0.85. • PR(B), ……..,PR(D)…. : PageRank value of each webpage linking to page A. • L(B),…….,L(D),….. : The number of links going out of page B,……D…..
Example PR(A) = 0.15 + 0.85 * ( PR(B)/links(B) + PR(C)/links(C) +PR(D)/links(D) ) = 0.15 + 0.85 * ( 0.5/4 + 0.7/4 + 0.2/1 ) = 0.15 + 0.85 * ( 0.125 + 0.175 + 0.2) = 0.15 + 0.85 * 0.465 = 0.575
http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htmhttp://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm How to Update the PR Value If we don’t know what their PR should be to begin with, just assign an initial PR value for every page. 20 Iterations Update http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm
Results for PageRank PageRank values
Neural Network Why? • Make reasonable guess about results for queries that they have never seen before. Click-tracking • The weights are updated based on the search results which the user clicked.
Neural Net Work • Step1: Setting Up the Database • Step2: Feeding Forward Activation • Step3: Training with BackPropagation How Neural Network works? Solid line: Strong connections Bold text: Active node
Step1: Setting Up the ANN Database • Create a table for hidden layer(red box) • Create two tables for the connections(green boxes)
Step2: Feeding Forward Activation • Objective: activate the ANN. • Take words as inputs • Activate the links in the network • Give outputs for URL • Hyperbolic tangent function X-axis: total input to the node
Step3: Training with Backpropagation • Train the network every time someone performs a search and choose one of the links • The same algorithm covered in class. • Learning rate = 0.5
Results For Neural Network Step 1: From ID Strength Hidden node To ID Step 2: relevance of URL input URL Step 3: Training with one query
Results For Neural Network(contd) Step 3: Training with more queries
Content-Based Ranking Basic Idea: Calculate a score based only on the query and the content of the page • Word frequency • Document location • Word distance
Software Reference • Ubuntu 11.04 • Python 2.7.3 • SQLite • Collective Intelligence- Toby Segaran • SQLite Tutorial - ZetCode • Dive into Python – Mark Pilgrim