270 likes | 358 Views
FeedEx: Collaborative Exchange of News Feeds. Seung Jun, Mustaque Ahamad Georgia Institute of Technology WWW 2006. Outline. One line comment Motivation/Problem Approach Analysis of feed publishing Challenges Experiments Critique. One line comment.
E N D
FeedEx: Collaborative Exchange of News Feeds Seung Jun, Mustaque Ahamad Georgia Institute of Technology WWW 2006
Outline • One line comment • Motivation/Problem • Approach • Analysis of feed publishing • Challenges • Experiments • Critique
One line comment • Disseminate web feeds in a distributed (P2P) manner to increase scalability of web servers Traditional method P2P method RSS A B A B RSS reveals visitors to content providers RSS decoupled fetch operation from read
Scalability Motivation & Problem • RSS/Atom feeds have become increasingly popular • Published by most traditional media and blogs • Feeding mechanism nyt.com http://nyt.com/../feed.xml HTTP response HTTP request … … Update page as contents are added RSS reader: Poll server to check updates
Approach • The Approach • P2P overlay + gossip based protocol • P2P: Scalable growth in resources with service demand • Gossip: Scalable, Robustness (Join & Leave) • Feature of this overlay • Don’t have to guarantee delivery or delay • Challenges content searching ? Data dissemination Free riding prevention Fetching interval determination Overlay construction
Analysis of Feed Publishing • Methodology • 245 popular feeds monitored for 10 days • Most popular feeds – information from Gmail’s web clips, Bloglines • Feeds fetched every 2 minutes • Measured.. • Publishing rate • Entry count in a feed • Entry lifetime
Publishing Rate by Rank • Great difference between publishers • Partly zipf distribution
Entry Count • High publish rate, More entry counts? – NO • Lifetime of entries are short Entries can be lost with infrequent requests
Publishing Rate by Time • 4 types of publishing patterns
Challenges – Overlay Construction (1/2) – • Goal: Minimize network management overhead • Join • Well known host OR Contact previous neighbors • Share subscription set info • Update subscription set info to the network • Leave • Soft-state • Update subscription set periodically Gateway Neighbor list Subscription set
Challenges – Overlay Construction (1/2) – • Neighbor selection • Many neighbors may incur overhead • Need to adapt to my resource status • select “useful” neighbors to me • Whose subscription set is similar to me A 1 direct, 1 one-hop, 1 two-hop B
Challenges – Fetching interval determination – • Adaptive Fetching • Problem: Little hints about the publishing rate or entry lifetime • Frequent polling: overload servers, consume clients’ net bandwidth • Lazy polling: increase delay or miss entries • Adaptive Algorithm • Intuition: Frequent fetching few new entries • Freshness rate: fraction of new entries in the fetched document • If Freshness rate < target freshness Halve the fetching rate • If Freshness rate > target freshness Double the fetching rate Entries in a feed HANI • Report 1 • Report 2 • Report 3 • … Fetch
Challenges – Data dissemination– • Goal: Minimize bandwidth consumption • Limit the boundary of delivery • Forward only to matching neighbors (subscription set, hop_count) reduce forwarding overhead • Reduce the unit of delivery • Unit of delivery : Entry bundle • A set of new entries (Filter out old entries) Reduce redundant content delivery • Check before forwarding • Exchange id of an entry bundle (ID: SHA-1 digest of the bundle) • If it is an undelivered bundle deliver it Max subset hops = 1 HANI Fetch
Challenges – Free riding prevention– • Nodes may manifest selfish behavior • Only receive, without forwarding • Lie subscription set to become a preferred neighbor • Solution: Provide a neighbor evaluation method • Contribution metric • Nodes who forwards feeds I subscribe, and my near neighbors subscribe • Level of contribution: direct subscription, 1 hop subscription, 2 hop sub, … • cmi, j += wf−hf • Cut out unhelpful neighbors: I helped, but it doesn’t helped me • di,j = cmi,j − cmj,i • Feature • Uses local information only Easy to implement and enforce the mechanism
Challenges – Entry searching – • Overlay as a distributed storage • Iterative searching • Strong points: Searching latency, query traffic • Recursive searching (flooding) • Strong points: low overhead of a requester, caching for popular queries, reflect to neighbor evaluation ?
Benefits of FeedEx • Scalability • Archivability • Storage of entries • Controllability • Compared to web based readers : e.g. Fetch interval • Filtering and recommendation • Share opinions on entries (e.g. voting) • Feed recommendation • Privacy • Users can fetch documents for others • anonymize actual users
Architecture of FeedEx • Prototpye: python • Networking: Twisted • Protocol : XML-RPC • Interoperability, fast-prototyping • Entry Storage: SQLite (Lightweight RDB) • RSS parser : feedparser.org
Experimental Setup • Two modes • Stand-alone mode SLN • FeedEx mode XCH • Metrics • Time lag • Missing entries • Communication cost • Experiments • Use 189 PlanetLab nodes • Run 22 hours on a weekday • Primary factor: 6 fetching intervals • Let each node subscribe 20 out of 70 feeds
Results: Time Lag • Average Time Lag • Average of node averages • Without applying adaptive fetching algorithm Despite of fetching interval, contents are delivered soon 15.8times
Results: Missing Entries • Rate of Missing entries • # enrtries in a node / # of entries in a reference node • Low missing rate • despite of a problem(DNS error or routing error) in the network • Sometimes better than the reference node
Results: Communication Cost • Two most frequently called precedures: check_did, put_entries • Check_did call: single IP packet • Put_entries: 2 calls / minute deliver 2.67 entries / call • Low communication cost
Critique • Strong points • Made an new problem from an old domain “web caching” • Free from delay / failure of nodes • Draw out possible benefits/extensions • simple! • Practically deployable • Tried to find a mechanism both good for servers and clients
Critique • Weak points • Overload due to RSS feed delivery? • Only a small text file delivery • Should have considered podcasting(Multimedia RSS) • Will the clients donate their resource? • Is “short delay” a strong incentive? • Is “low bandwidth consumption” a strong incentive? • Will the subscription sets of people really overlap a lot? • Net effective to SPs providing diverse RSS feeds • e.g. Naver blog, egloos.. • Is it really robust to frequent leave and join? • Lack of server side evaluation • Server load & network resource • Delivering critical data (e.g. timely news) using RSS?
Entry Lifetime • Generally CNN, • Publishers have policies (probably)
Topic of interest (Maybe Tags?) feeds Topic based feed pub/sub (P2P based) Contents related to the topic feeds Web Content providers New idea • Topic based feed pub/sub system • Why should we register the address of a feed? • Need to find addresses providing contents I want • A feed may contain contents that I don’t want
New idea • Topic based feeding services are already launched • Baebo • Create new feeds by keywords from the Amazon, Yahoo, eBay feeds • Say4 • Extract entries containing sentences in the bible from the BBC feed. • But centralized server runs the service • Limitation in the number of input feeds • Hard to add input feed dynamically compared to P2P approach