340 likes | 672 Views
Finding High-Quality Content in Social Media. c henwq 2011/11/26. Authors. Eugene Agichtein Emory University Research: Intelligent Information Access Lab ( IRLab ) News:our team wins the "Best Paper" award at SIGIR 2011. . Abstract.
E N D
Finding High-Quality Content in Social Media chenwq 2011/11/26
Authors Eugene Agichtein Emory University Research: Intelligent Information Access Lab (IRLab) News:our team wins the "Best Paper" award at SIGIR 2011.
Abstract • From the early 2000s,user-generated content has become popular on the web.Thequality of user-generated content varies drastically from excellent to abuse and spam. • To separate high-quality content from the rest automatically • Graph-based framework • combine the different sources of evidence in a classification formulation
Contents 1 Related work 2 CONTENT QUALITY ANALYSIS 3 MODELING CONTENT QUALITY 4 EXPERIMENT & Conclusion
Related work • Link analysis in social media • Propagating reputation • Question/answering portals and forums • Expert finding • Text analysis for content quality • Implicit feedback for ranking
Related work • Link analysis in social media • G = (V, E) • V corresponding to the users of a question/answer system • a directed edge e = (u, v) ∈ E from a user u ∈ V to a user v ∈ V if user u has answered to at least one question of user v • G’ = (V, E’) • PageRank,ExpertiseRank, HITS
Contents 1 Related work 2 CONTENT QUALITY ANALYSIS 3 MODELING CONTENT QUALITY 4 EXPERIMENT & Conclusion
CONTENT QUALITY ANALYSIS ——Intrinsic content quality • As a baseline, we use textual features only—with all word n-grams up to length 5 that appear in the collection more than 3 times used as featuresusers
CONTENT QUALITY ANALYSIS ——Intrinsic content quality Punctuation and typos Syntactic and semantic Grammaticality Punctuation Capitalization Spacing density Character-level entropy Spelling mistakes Out-of-vocabulary words • Average number of syllables per word • Entropy of word lengths • Readability measures • Part-of-speech sequences • Formality score • Distance between its (trigram) language model and several given language models
CONTENT QUALITY ANALYSIS ——User relationships • items and users Graph • user-user Graph u q answer u u has answered a question from user v v
CONTENT QUALITY ANALYSIS——Usage statistics • The number of clicks on some item • The dwell time on some item
CONTENT QUALITY ANALYSIS ——classification framework • We cast the problem of quality ranking as a binary classification • support vector machines • log-linear classifiers • stochastic gradient boosted trees • Our goal is to discover interesting,well for-mulated and factually accurate content
Contents 1 Related work 2 CONTENT QUALITY ANALYSIS 3 MODELING CONTENT QUALITY 4 EXPERIMENT & Conclusion
MODELING CONTENT QUALITY ——user relationships • Our dataset, viewed as a graph as illustrated in Figure 1
MODELING CONTENT QUALITY ——user relationships • The relationships between questions, users asking and answering questions, and answers can be captured by a tripartite graph outlined in Figure 2
MODELING CONTENT QUALITY ——user relationships • the unique characteristics of the community question/answering domain
MODELING CONTENT QUALITY ——user relationships • Question subtree • Q Features from the question being answered • QU Features from the asker of the question being answered • QA Features from the other answers to the same question
MODELING CONTENT QUALITY ——user relationships • User subtree • UA Features from the answers of the user • UQ Features from the questions of the user • UV Features from the votes of the user • UQA Features from answers received to the user’s questions • U Other user-based features
MODELING CONTENT QUALITY ——user relationships • Question features
MODELING CONTENT QUALITY ——user relationships • Implicit user-user relations • G = (V,E) • E = Ea∪Eb∪Ev∪Es∪E+∪E− • Gx= (V,Ex) • hx the vector of hub scores on the vertices V • ax the vector of authority scores • pxthe vector of PageRank scores • p´x the vector of PageRank scores in the transposed graph
MODELING CONTENT QUALITY ——user relationships • Implicit user-user relations
MODELING CONTENT QUALITY ——user relationships • Content features for QA • to identify the most salient features for the specific tasks of question or answer quality classification • the KL-divergence between the language models of the two texts • their non-stopwordoverlap • the ratio between their lengths
MODELING CONTENT QUALITY ——user relationships • Usage features for QA • number of item views (clicks) • Metadata of question • how long ago the question was posted • derived statistics • the expected number of views for a given category • the deviation from the expected number of views • other second-order statistics • the click frequency
Contents 1 Related work 2 CONTENT QUALITY ANALYSIS 3 MODELING CONTENT QUALITY 4 EXPERIMENT & Conclusion
Experiment & Conclusions ——EXPERIMENTAL SETTING • Dataset Edgesinduced from the whole dataset.
MODELING CONTENT QUALITY ——EXPERIMENTAL SETTING • Dataset statistics
MODELING CONTENT QUALITY ——EXPERIMENTAL SETTING • Dataset statistics
MODELING CONTENT QUALITY ——EXPERIMENTAL SETTING • Dataset statistics
MODELING CONTENT QUALITY ——EXPERIMENTAL SETTING • Dataset statistics
MODELING CONTENT QUALITY ——EXPERIMENTAL SETTING • Dataset statistics
MODELING CONTENT QUALITY ——EXPERIMENTAL SETTING • Dataset statistics
MODELING CONTENT QUALITY ——EXPERIMENTAL SETTING • Dataset statistics
MODELING CONTENT QUALITY ——EXPERIMENTAL SETTING • Dataset statistics