210 likes | 307 Views
Exploring in the Weblog Space by Detecting Informative and Affective Articles. Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University Qiang Yang Hong Kong University of Science and Technology WWW2007. Introduction. Unique characteristics of blogs
E N D
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University Qiang Yang Hong Kong University of Science and Technology WWW2007
Introduction • Unique characteristics of blogs • Mainly maintained by individual persons and thus the contents are generally personal • The link structures between blogs generally form localized communities • Ongoing research on blogs • Content based analysis • Blog communities’ evolution • Different kinds of tools to help users retrieve, organize and analyze the blogs
Introduction – Genres in Blog’s Content • Affective • The online diary by which people share their daily life publicly, express their feelings or thoughts or emotions through the blogs • Informative • Topic-oriented; the topic can be related to a hobby or the author’s profession or business
Introduction – the Problem and the Approach • The problem • Separating informative articles from affective articles in blogs. • The approach • Considering the problem as binary classification • Challenges • The definitions of the informative articles and the affective articles • The training corpus for both categories • The machine learning algorithm
Introduction – Studies in the Weblog Space • Emotion and topic classification of blog articles • To improve the effectiveness of emotion classification through filtering out informative articles • Blog search • An intent-driven blog-search engine is proposed to resort the search results by considering their score of informative values. • Automatic detection of high-quality blogs • To measure the quality of a blog by calculating the percentage of informative articles
Definition of Informative and Affective Articles • A survey is done among the users who usually participate in the activities in blogs • Contents of informative articles include: • News that is similar to the news on traditional news websites • Technical descriptions, e.g. programming techniques • Commonsense knowledge • Objective comments on the events in the world • Contents of affective articles include: • Diaries about personal affairs • Self-feelings or self-emotions descriptions
Algorithms • Classification algorithms • Naïve Bayes Classifier (NB) • Support Vector Machine (SVM) • Rocchio Classifier • Feature selection algorithms • Information Gain (IG) • χ2 statistic (CHI)
Classification Algorithm – Naïve Bayes Classifier • Laplace smoothing is applied to overcome the zero-frequency problem
Classification Algorithm – Rocchio Classifier • Category profile based classifier where |cj| is the number of documents in the category cj and denotes document with terms weighted by TF-IDF
Feature Selection Algorithms • Information Gain (IG) • χ2 statistic (CHI)
Experiment Data • 5000 articles crawled from MSN space • 3,547 of them are labeled as affective and 1,109 are labeled as informative while the others are filtered because of the encoding problem • 2,200 articles from Sohu.com Directory as informative articles • News, commonsense knowledge or objective comments about 22 different topics Table 1. Statistics of Data Set
Experiment – Comparing Classification Algorithms Table 2. Performances of three classification algorithms
Comparing Feature Selection Algorithms Table 3. Performances on different features set
Representative Features Table 4. Top 20 representative features of each category
Study on Emotion and Topic Classification • Assume that informative articles do not express personal emotions • Extracting affective articles can help to build a corpus with pure emotional articles Figure 1. Two-step approach for topic and emotion classification
Experiment on Emotion Classification • Data • Training: 2,494 blog articles are manually labeled into two emotion tendencies, positive and negative • Testing: 1,303 articles from 75 blogs in MSN Space Table 5. Data set used for emotion classification
Experiment Result on Emotion Classification • Before the binary emotion classifier, the information-affectiveness classification is used (I-Approach) or not (II-Approach) Table 6. Comparison results for two emotion classification approaches
Study on Intent-driven Weblog Search Engine • Blog search is at the state of Web search currently • Intent-driven search (re-rank) Smixed = λ.Sif+ (1-|λ|).Sorigin where Sif is a confidence value between -1 (strong affective intent) and 1 (strong informative intent), and Sorigin is the original relevance score
Analysis for the Distribution of Two Genres of Articles Figure 2. Distribution of informative articles and affective articles on 99,059 blog articles
Detecting High-quality Blogs Figure 3. Distribution of blogs with different levels of quality on 6,319 blogs
Conclusion and Future Work • The task of separating informative and affective articles is addressed and considered as a binary classification task. • The applications of above information-affectiveness classification are studied, including emotion classification, intent-driven blog search and high-quality blogs detection. • Future work: 1) building a much large data set by using semi-supervised learning techniques 2) applying the existing approach on the data in other languages