260 likes | 363 Views
A Short Walk in the Blogistan. David de-Vilder Jonny Foss Kevin Gundooa. Introduction. Blogistan is a word that is used to describe the collection of blogs, or blogspace. The paper’s contribution:
E N D
A Short Walk in the Blogistan David de-Vilder Jonny Foss Kevin Gundooa
Introduction • Blogistan is a word that is used to describe the collection of blogs, or blogspace. • The paper’s contribution: • How emerging interests and patterns can be extracted from tracking a collection of blogs over time • The size and nature of the blogistan based on a recent collection of blogs • Inferences and observations on identifying blogs, spamming problems in blogs and how blogs are accessed
Defining a Blog • Basically large queues with additions appearing at the top of the page and older material getting moved downwards in the queue. • Add many-many interaction to the internet… rather than 1 journalist to many readers – now many writers talk to many readers. • Fastest growing part of WWW in past 2 years - every day 100,000 new blogs are created, with 1.3 million posts every day [BBC News, 8th November 2006]
Blogs vs. Webpages • Blogs are often a single page site • Blogs are generally authored by a single person • Navigation through blogs is typically easier – cross-links and backtracking • Active blogs generally updated more frequently than traditional webpages • “Blogroll” structure enables easy reading of newly added information (HTTP Range request)
Data Gathering/Filtering • Authors use their own URL examining techniques to produce a seed collection of 10,000 blogs of varying popularity • Each blog was visited 5 times a day for over a month (August – September 2003). • Retrieved meta-information (via HEAD) and body (via GET) of each blog URL. • Duplicate (and inactive/automatic) blogs were removed from the set of 10,000. • This produced a set of 8679 working blogs
Analysis - Emergence Patterns • URLs that are first referenced by a blog after the study’s measurement start time are termed as new URLs. All other URLs are referred to as old URLs. • Interesting URLs are identified by the number of references made to them – this is known as multiplicity • Removing duplicates has significant impact of the “interest ranking” of the top new URLs • Complications with blog-hosting sites and duplicate removal
Distribution of Multiplicity of all URLs and of all new URLs Traditional static hyperlink analysis unsuitable for analysing blogs Number of references for new URLs is considerably lower than for old URLs at the time they are useful to mine
Lifespan Distributions Distribution of lifespans of new and old URLs New URL references are much more short-lived than old URL references 25% and 80% of URL references lasted longer than 20 days for new and old URLs respectively
The Blogistan • Estimated size: 1-4 million • More than a third of blog space not actively changed • Each blog page was given a unique key to reduce duplicates • Number of distinct blog URLs reduced by 40%
Blog Domain Attributes • Around 180,000 unique domains and 14,000 second-level domains • Large gap between first and second level domains partly explained by blog-hosting sites • DNS lookup of domain names produced around 12,000 unique IP addresses • Surprisingly few IP addresses in this large area
Http Protocol Issues • Only the newest entries need to be downloaded – these are always at the top of the page because of the ‘blogroll’ structure. • Can check if page has been modified using Last-Modified tag of HTTP HEAD response. • Partial download can be achieved using the HTTP RANGE request • Unfortunately not all web servers support the RANGE request – (only 40% in the test)
Inferences from Analysis • Only preliminary observations based on the study, do not necessarily hold true. • Identifying a Blog • Three sets of URLs used to test the hypotheses • Popular websites have more references than blogs • Blogs have more unique references than less popular websites • Blogs have more self references than webpages
Inferences from Analysis • Anti-Spam Measures • Replicating pages and inserting links to attract undue attention • Blogs inadvertently provide free space for spammers (Referrer field) • Many blogs allow others to place comments, another target for spamming • Spammers can place any links of their choice - boosts ranking on search engines • Potential Solution: Automated distinguishing of spam & genuine references • Server Logs and Popular Blogs • Web logs of two very popular, anonymous blogs • Most popular request was to top-level blog URL • A third of all external references were from a single site – news aggregator site • Search engine crawler tested on sites - no distinction between blogs & non-blogs • Only two blogs tested, extensive testing needed to obtain meaningful results
Applications & Further Work • Methodology used to identify emerging interests can be applied to general approach to mine evolving interconnection networks. • This can have applications beyond the Blogistan. • By cancelling out repeated patterns, it is possible to identify new ones. • Example application in a different realm is to study ISP level netflow data over time. • Enable identification of bot attacks on hosts, detect new worms and predict flash crowds.
Referenced Papers Rate of Change and Other Metrics: a Live Study of the WWW by F. Douglis et al. • Investigated utility of Web Server Caches • Mainly focused on large companies websites since there were not many large community websites in 1997 • When pages change frequently – there's little point in a web server cache...
Rate of Change... • Dynamic pages • True user interaction, e.g. Amazon, eBay etc. • Semi-Dynamic pages • Frequently updated pages such as blogs • Static pages • Simple HTML pages – rarely updated
Rate of Change varies with... • Content-Type • e.g. Is it HTML, .doc, .jpg • Top Level domain • e.g. http://www.warwick.ac.uk will probably change more than: http://www2.warwick.ac.uk/insite/newsandevents/notices/xmascard/
Web Server Caches • Almost useless for true dynamic pages • Very useful for static pages • Limited use for blogs. Rate of Change... concludes that web server developers should consider whether individual pages should be cached
Referenced Papers A Large-Scale Study of the Evolution of Web Pages by Fetterly et. al. (2003) • Monitored 150,836,209 pages, over 11 weeks. Took MD5Sum of page contents, and feature vectors to monitor whether or not a page had changed recently
Referenced Papers A Large-Scale Study of the Evolution of Web Pages by Fetterly et. al. (2003) Concluded: • Larger pages change more often than smaller pages! • Something which the Rate of Change... paper explicitly said didn't happen! (6 years ago)
Referenced Papers On the Bursty Evolution of Blogspace by R. Kumar et. al. (Proc. WWW 2003) • Studied 750,000 links between 25,000 blogs • Introduced new tools which: • Created time graphs based on when the links between blogs appear, and how blogs interact between each other.
Characterising the Splogosphere (WWE06) Splogs (Spam blogs) are now inundating blog search engines Their purpose is to host ads or raise the PageRank of target sites System to detect splogs up to an accuracy of 90% presented Aim is to facilitate development of effective techniques to weed out splogs from the blogosphere RecentDevelopments
Recent Developments Latent Weblog Communities (WWE 06) • Latent Weblog Community (LBC) concept proposed by Kazunari Ishida (Tokyo University) • "Weak Pair" algorithm to find connected clusters of blog posts • Some success in identifying "whimsical links" and multiple blogs by the same author • Aim is to organise and categorise these groups into a catalog (as a search engine alternative)
Recent Developments Categorising Key Bloggers (WWE 05) • 3 key blogger types identified by Shinsuke Nakajima (NAIST) • Topic-finders, Agitators and Summarisers • Aim is to use influential bloggers to complement mainstream websites and television. • 500k blogs with 10m entries being tracked
Recent Developments • 37.3M blogs tracked by Technorati • Blogosphere is multilingual and deeply international • English has fallen to less than a third of all blog posts in April 2006 • Japanese and Chinese language blogging grown significantly Blog Language Spread (Technorati Analysis Apr 06)
Thank you Any Questions?