140 likes | 231 Views
Trustworthiness assessment (on web pages). Task 3.3. Introduction. The number of available data sources keeps increasing at fast pace Sensors embedded in mobile phones, websites, blogs, … Data becomes more valuable when combined from different sources
E N D
Trustworthiness assessment (on web pages) Task 3.3 Planet Data - Madrid
Introduction • The number of available data sources keeps increasing at fast pace • Sensors embedded in mobile phones, websites, blogs, … • Data becomes more valuable when combined from different sources • What about the trustworthiness of this aggregated data? • Unknown data sources • No standard way to evaluate trustworthiness • Subjectivity of the consumer of the data • Important economic incentive to lie • Interesting case of the WWW • Web credibility assessment Planet Data - Madrid
What is the problem of web credibility ? • Non credible websites represent an important percentage of the web • Credibility seen as an aggregation of objective and subjective components (Fogg) • Credibility= trustworthiness AND expertise • Web users can be naïve or lazy and won’t try to verify information • Focus on domains where expertise is hard to evaluate for lambda users • Medical treatments • Trading operations • Ideological assertions • Economic / politic interests are at stacks Planet Data - Madrid
Background • Trustworthiness components in the context of web credibility: • Y. Yamamoto and K. Tanaka. Enhancing credibility judgment of web search results. • Accuracy: referential importance • Authority: social reputation • Objectivity: content typicality • Currency: update frequency • Coverage: coverage of topic • M. J. Metzger. Making sense of credibility on the web: Models for evaluating online information and recommendations for future research. • Credentials • Advertisements • Design Planet Data - Madrid
Credibility assessment as a classification problem • Use historical information on evaluations for future credibility assessment • A machine learning approach • Binary classification • Users evaluate pages as credible or non-credible • Content-based Features • Extracted programmatically from web pages • Training set and test set • Leave-one-out cross validation • Tested by category Planet Data - Madrid
Feature selection • Categories • Act as a filter, only pages from the same category are tested for similarity • Keywords and Entities in the document • Reflect the topic of the web page at a finer grain • Sentiment analysis • Computed at the words level • Used in conjunction with keywords & entities • Part of speech • Extra feature reflecting the overall structure of the webpage • Number of Ads displayed (in process) • They distract users from their activity and the page loose credibility • Complexity of the css files (not included yet) • Pages with no structure tend to loose credibility • PageRank • Google’s metric which include a credibility measure Planet Data - Madrid
Experimental setup • Two machine learning algorithms • kNN Item-Item algorithm • Compute a similarity between pages • take only into account the most similar pages • C4.5 decision tree • Has good performance in general • However not suitable for multivalued features (keywords, entities) • Defined as a baseline • Microsoft corpus • 1000 pages evaluated for credibility by experts and regular users • Divided into 5 topics • Top 40 pages retrieved by search engines for 5 queries • Rescaled from Likert scale [0;5] to binary scale {-1;1} Planet Data - Madrid
( ) å ´ s r , , i j u j Î j similarIte ms r = å , u i s , i j Î j similarIte ms Content-based rating • kNN item-item algorithm • Based on similarity between pages rated by the user • Aggregated similarities • Based on pages features’ similarity • Cosine similarity for monovalued features (POS, pageRank, …) • Jaccard similarity for multivalued features (keywords, entities) • Only positive similarity are taken into account Planet Data - Madrid
Evaluation Preliminary results Planet Data - Madrid
Results • Mixed results • Precision ~ 0.7, recall ~ 0.8 • Impossible to predict accurately the credibility • Biased by ratings distribution over classes Planet Data - Madrid
Results • Tests on keywords + entities + sentiment • Similar results (Precision ~ 0.7, Recall ~ 0.8) Planet Data - Madrid
Results • Tests on all features (POS + keywords + entities + sentiments) • Similar results (Precision ~ 0.7 and Recall ~ 0.8) Mixed results among classes Planet Data - Madrid
Future work • Semantic distances • Pages seen as set of concepts • Definition of a distance between two sets in the concepts space • Similarity using a path distance in a concept hierarchy • Social referrals • Use evaluation of other peoples • Weights based on their trustworthiness • Estimate page credibility based on beta reputation • Combine reputation with classification approaches to have an aggregated metric • To get better estimation of the credibility than the two components separated Planet Data - Madrid
Conclusion • Project based on content-based aspects • Results promising although room for improvement • Accuracy of the prediction • Time complexity of the implementation • Several features remain unimplemented • Local extraction of features • Integration of new page features • Semantic aspect of web pages Planet Data - Madrid