1 / 14

Trustworthiness assessment (on web pages)

Trustworthiness assessment (on web pages). Task 3.3. Introduction. The number of available data sources keeps increasing at fast pace Sensors embedded in mobile phones, websites, blogs, … Data becomes more valuable when combined from different sources

Download Presentation

Trustworthiness assessment (on web pages)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Trustworthiness assessment (on web pages) Task 3.3 Planet Data - Madrid

  2. Introduction • The number of available data sources keeps increasing at fast pace • Sensors embedded in mobile phones, websites, blogs, … • Data becomes more valuable when combined from different sources • What about the trustworthiness of this aggregated data? • Unknown data sources • No standard way to evaluate trustworthiness • Subjectivity of the consumer of the data • Important economic incentive to lie • Interesting case of the WWW • Web credibility assessment Planet Data - Madrid

  3. What is the problem of web credibility ? • Non credible websites represent an important percentage of the web • Credibility seen as an aggregation of objective and subjective components (Fogg) • Credibility= trustworthiness AND expertise • Web users can be naïve or lazy and won’t try to verify information • Focus on domains where expertise is hard to evaluate for lambda users • Medical treatments • Trading operations • Ideological assertions • Economic / politic interests are at stacks Planet Data - Madrid

  4. Background • Trustworthiness components in the context of web credibility: • Y. Yamamoto and K. Tanaka. Enhancing credibility judgment of web search results. • Accuracy: referential importance • Authority: social reputation • Objectivity: content typicality • Currency: update frequency • Coverage: coverage of topic • M. J. Metzger. Making sense of credibility on the web: Models for evaluating online information and recommendations for future research. • Credentials • Advertisements • Design Planet Data - Madrid

  5. Credibility assessment as a classification problem • Use historical information on evaluations for future credibility assessment • A machine learning approach • Binary classification • Users evaluate pages as credible or non-credible • Content-based Features • Extracted programmatically from web pages • Training set and test set • Leave-one-out cross validation • Tested by category Planet Data - Madrid

  6. Feature selection • Categories • Act as a filter, only pages from the same category are tested for similarity • Keywords and Entities in the document • Reflect the topic of the web page at a finer grain • Sentiment analysis • Computed at the words level • Used in conjunction with keywords & entities • Part of speech • Extra feature reflecting the overall structure of the webpage • Number of Ads displayed (in process) • They distract users from their activity and the page loose credibility • Complexity of the css files (not included yet) • Pages with no structure tend to loose credibility • PageRank • Google’s metric which include a credibility measure Planet Data - Madrid

  7. Experimental setup • Two machine learning algorithms • kNN Item-Item algorithm • Compute a similarity between pages • take only into account the most similar pages • C4.5 decision tree • Has good performance in general • However not suitable for multivalued features (keywords, entities) • Defined as a baseline • Microsoft corpus • 1000 pages evaluated for credibility by experts and regular users • Divided into 5 topics • Top 40 pages retrieved by search engines for 5 queries • Rescaled from Likert scale [0;5] to binary scale {-1;1} Planet Data - Madrid

  8. ( ) å ´ s r , , i j u j Î j similarIte ms r = å , u i s , i j Î j similarIte ms Content-based rating • kNN item-item algorithm • Based on similarity between pages rated by the user • Aggregated similarities • Based on pages features’ similarity • Cosine similarity for monovalued features (POS, pageRank, …) • Jaccard similarity for multivalued features (keywords, entities) • Only positive similarity are taken into account Planet Data - Madrid

  9. Evaluation Preliminary results Planet Data - Madrid

  10. Results • Mixed results • Precision ~ 0.7, recall ~ 0.8 • Impossible to predict accurately the credibility • Biased by ratings distribution over classes Planet Data - Madrid

  11. Results • Tests on keywords + entities + sentiment • Similar results (Precision ~ 0.7, Recall ~ 0.8) Planet Data - Madrid

  12. Results • Tests on all features (POS + keywords + entities + sentiments) • Similar results (Precision ~ 0.7 and Recall ~ 0.8) Mixed results among classes Planet Data - Madrid

  13. Future work • Semantic distances • Pages seen as set of concepts • Definition of a distance between two sets in the concepts space • Similarity using a path distance in a concept hierarchy • Social referrals • Use evaluation of other peoples • Weights based on their trustworthiness • Estimate page credibility based on beta reputation • Combine reputation with classification approaches to have an aggregated metric • To get better estimation of the credibility than the two components separated Planet Data - Madrid

  14. Conclusion • Project based on content-based aspects • Results promising although room for improvement • Accuracy of the prediction • Time complexity of the implementation • Several features remain unimplemented • Local extraction of features • Integration of new page features • Semantic aspect of web pages Planet Data - Madrid

More Related