1 / 17

Wikipedia Vandalism Detection : Combining Natural Language, Metadata, and Reputation Features

Wikipedia Vandalism Detection : Combining Natural Language, Metadata, and Reputation Features. B. Thomas Adler, Luca de Alfaro, Andrew G.West Raga Sowmya Tummalapenta. Introduction. Wikipedia Benefits Problems Vandalism Methods to detect Vandalism Bots Statistics and Machine Learning.

urania
Download Presentation

Wikipedia Vandalism Detection : Combining Natural Language, Metadata, and Reputation Features

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Wikipedia Vandalism Detection : Combining Natural Language, Metadata, and Reputation Features B. Thomas Adler, Luca de Alfaro, Andrew G.West Raga Sowmya Tummalapenta

  2. Introduction • Wikipedia • Benefits • Problems • Vandalism • Methods to detect Vandalism • Bots • Statistics and Machine Learning

  3. Architecture • PAN-WVC-10 corpus • Is a corpus for the evaluation of automatic vandalism detectors for Wikipedia. • Feature extraction followed by data-trained classification. • Features can be obtained from • The revision itself • From the comparision of the revision against another revision (i.e, a diff) • From information derived from previous or subsequent revisions. • Evaluation – 10-fold cross-validation.

  4. Vandalism Detection Problem • Immediate Vandalism • Occuring in the most recent revision of an article • Make use of the information available at the time a revision is committed. • Historical Vandalism • Occuring in any revision including past ones • Can use any feature.

  5. Classes • Division of features into classes • Complexity • Difficulty of generalization • Classes • Metadata (M) • Text (T) • Reputation (R) • Language (L)

  6. Proposed Approach • Integration of the three of the leading approaches to Wikipedia vandalism detection • Mola-Velasco system (NLP) • WikiTrust system (reputation) • STiki system (metadata) • Non-overlapping set of features

  7. Metadata • Properties of a revision that are immediately available • Identity of the editor • Timestamp of the edit • Minimum Computational Complexity • Examples that expose unexpected similarities in vandal behavior • Time since article last edited • Local time of day and day of week • Revision comment length

  8. Text • Language-independent features derived from analysis of the edit content. • Examples: • Uppercase ratio and digit ratio • Average and minimum edit quality

  9. Language • Similar to text features • Features require expert knowledge about the language. • Examples • Pronoun frequency • bad words

  10. Reputation • Requires extensive historical processing of Wikipedia to produce a feature value. • Examples • User reputation • Country reputation

  11. Comprehensive list of features organized by class • Features in the “!Z” (not zero-delay) class are those that are only appropriate for historical vandalism detection

  12. Experimental Results • Results are presented in terms of area under curve (AUC) for two curves • Precision-recall curve (PR) curve • Receiver Operating Characteristics (ROC) curve • AUC-ROC curve is often presented for binary classification problem. • AUC-PR curve offers a more discriminating look into the performance of various feature combinations.

  13. Observations • Improvement in the performance of the Language (L) set, due to the next comment revert feature. • Both Metadata (M) and Text (T) show impressive gains in going from the Immediate task to the Historic task. • The predictive power of [M+T] and [M+T+R] are nearly identical in historic setting. So Reputation is useful in the immediate detection case but is less useful in historic detection.

  14. Conclusion • Although the previous works on the problem of Wikipedia vandalism detection utilize features from multiple categories, each work has individually focused predominantly on a single category. • This paper combines the features of three previous works, each representing a unique dimension in feature selection. • The results outperform the winning system of the PAN 2010 competition (62% vs 82% AUC).

More Related