70 likes | 208 Views
ZOT! To W IKIPEDI A V ANDALIS M PAN-10 @ CLEF 2010 Shared Task #2 Rebecca Maessen and James White. Data Set Low number of vandalism edits in PAN@CLEF data gave little information on patterns distinguishing vandalism from regular edits. Added 5,276 manually classified vandalism edits from
E N D
ZOT! To WIKIPEDIAVANDALISMPAN-10 @ CLEF 2010 Shared Task #2Rebecca Maessen and James White Data Set Low number of vandalism edits in PAN@CLEF data gave little information on patterns distinguishing vandalism from regular edits. Added 5,276 manually classified vandalism edits from West et all research Combined data set 20,283 edits • 31% “ill-intentioned” edits compared to 6% before
Algorithms • Logistic regression on word vector and W-J48 decision trees on metadata features • Logistic regression and W-J48 decision trees after combining the features • Ensemble methods: Bagging, Boosting and Random Forest
Conclusion • Results better than expected beforehand • Achieve an f-measure that is just as good as results from previous works • Decision trees is a favorable approach to the Wikipedia vandalism problem • Top down feature analysis and statistical information from word vectors are both relevant to classifying vandalism