260 likes | 368 Views
Internet Search Engine freshness by Web Server help. Presented by: Barilari Alessandro. Introduction. Search engines are an important source of information and keeping them up-to-date will result in more accurate answers to search queries.
E N D
Internet Search Engine freshness by Web Server help Presented by: Barilari Alessandro
Introduction • Search engines are an important source of information and keeping them up-to-date will result in more accurate answers to search queries. • Search engines create their databases by probing web servers on a per-URL basis with a little help from the web servers. Alessandro Barilari
Main Problem • There are no standard for facilitating the push of updates from servers to search engines: • It takes up to six months for a few page to be indexed by popular web search engines; • The data which is indexed by the search engines is often stale. Alessandro Barilari
Solution… • Web server help to facilitate search engine freshness results in a favorable situation for web sites, search engines and users. Alessandro Barilari
…and its problems • The number of updates per second is very large. • Must balance between: • The number of interactions between web sites and search engines, and • The freshness of the search engines. Alessandro Barilari
Page rank impact • Pages which are popular will have higher page ranks: • Use popularity in addition to age and freshness to compute the mismatch between a web site and a search engine Alessandro Barilari
Summary • Definitions and Cost Model • Algorithm • Analysis • Pratical issues Alessandro Barilari
Some definitions • Update: an update u to a file f is a modification to f that has been flushed to the disk; • Propagation of an update: an update is said to be propagate when the web site has informed the search engine about the update. A SE may or may not retrieve that update; • Meta-update propagation: At any time t, let U(t) be the set of unpropagated updates. The web site informs the search engine about all the updates U(t); Alessandro Barilari
Some definitions (2) • Weight of a file: given a content file, its weight f (non-negative) denotes the importance of the file; the weights are chosen such that: • Last_modification_time(u,t): the last time before t when the file f(u) was updated. Alessandro Barilari
The Cost Model • Components: • Communication cost; • Opportunity cost: represents the stalenes of the search engine data as compared to the data on the web server. • CPU cost is ignored Alessandro Barilari
Opportunity cost (OC) • Given an unpropagated update u to a content file f; the opportunity cost for update u at time t is: OC(u,t)=f(u)x(t - last_modification_time(u,t)) • Definition for meta-update propagation: Alessandro Barilari
Communication cost (CC) • sizef(u)(t): the size of file f(u) at time t; Alessandro Barilari
Potential Communication cost (PCC) • Represents the communication cost which would need to be incurred in case update u were to be propagated after time t: Alessandro Barilari
The Cost Function • Given that an update u is unpropagated at time t, the cost function for that update at time t is given by: Alessandro Barilari
Summary • Definition and Cost Model • Algorithm • Analysis • Pratical issues Alessandro Barilari
FreshFlow Algorithm When OC_tot equals PCC_tot at any time t, the web server can inform the search engine about all the unpropagated updates. Alessandro Barilari
Summary • Definition and Cost Model • Algorithm • Analysis • Pratical issues Alessandro Barilari
Analysis • The cost of the FreshFlow algorithm (called FF) is compared with the cost of an optimal off-line algorithm (called ADV) Alessandro Barilari
Analysis (2) • Lemma (1): OC(u,t) is monotonically non-decreasing; • Lemma (2): suppose an update u to a file f, and suppose FF transmits but ADV does not. Then OCADV(u,t)≥OCFF(u,t). • Lemma (3): if the update is transmitted by the adversary (ADV), then CCADV(u,t) ≥CCFF(u,t). Alessandro Barilari
Theorem • FF is 2-competitive: CostFF(u,t) ≤ 2 x CostADV(u,t) Alessandro Barilari
Summary • Definition and Cost Model • Algorithm • Analysis • Pratical issues Alessandro Barilari
Pratical issues • There are multiple search engines: • Synchronization effect: pushing the updates would put pressure on the last-hop link to the web server; • Search engine load: some search engines might deny the receipt of updates. Alessandro Barilari
The middleman approach • Each web server contacts only one middleman for sending its updates; • Could be a group of middlemen. Alessandro Barilari
Benefits • The middleman can solve some additional issues: • Verifying trustworthiness of web servers; • Restricting the rate at which updates get transmitted to search engines; Alessandro Barilari
Limitations • The algorithm has not been used in practice; • The search engines need the cooperation of the web servers to keep track of updates to their URLs. Whether web servers will incorporate such a service remains to be seen. Alessandro Barilari
Conclusions • The FreshFlow algorithm is a solution that improve the data updates of the search engines, mantaining high level efficiency and performance; • The authors are planning to implement the algorithm in a real system (and have a future pubblication!) Alessandro Barilari