260 likes | 272 Views
This study explores cultural representations on Wikipedia to uncover cross-cultural stereotypes and misunderstandings. It introduces a computational approach for mining relations between cultural communities, focusing on cultural understanding, similarity, and affinity through language.
E N D
Mining the Social Web Jwan Alhussein jwan@live.com 1
Mining cross-cultural relations from WikipediaA study of 31 European food cultures • Authors: Paul Laufer, Claudia Wagner, Fabian Flöck, Markus Strohmaier : JwanAlhussein
Introduction Wikipedia represents one of the primary sources of knowledge about foreign cultures. Uncover diverging representations of cultures provides an important insight, since they may foster the formation of cross-cultural stereotypes, misunderstandings and potentially even conflict. JwanAlhussein 3
The Wikipedia article on "French cuisine" found on the Romanian-language edition might surprise a French national when translated into her mother tongue. Unlike the French "original", there exists no extended mention of French wines and only a very short paragraph on croissants and pastries. But, on the other hand, it features a section on foisgras and lamb dishes so wide information that the French language have. JwanAlhussein 12 July 2016
Problem Mining the relations between cultures as expressed on Wikipedia. Approach and contributions • We introduce a computational approach for mining and assessing relations between cultural communities on Wikipedia along three dimensions • cultural understanding • cultural similarity • cultural affinity • by exploring the communities descriptions and interest in cultural practices. JwanAlhussein 12 July 2016 5
2. RELATED WORK Previous research acknowledged the fact that interesting differences exist in different language editions of Wikipedia. Unlike previous work we exploit the collectively generated descriptions of cultures on different language versions of Wikipedia since each language community may perceive and document their own and other cultures through their particular cultural lenses. Countries with a lower Human Development Index such as Russia or Poland show less interest in editing and maintaining Wikipedia than more developed countries such as Denmark or Germany. 12 July 2016 JwanAlhussein
3. METHODS & DATASETS We use language as a proxy for cultural communities since language is closely linked to both national and cultural boundaries. 3.1 Cross-Cultural Relation Mining Cultural Similarity. Cultural Understanding.Cultural Affinity and Bias. 12 July 2016 JwanAlhussein
Cultural Relations Similarity Understanding Affinity 12 July 2016 JwanAlhussein
Cultural Similarity Italian cuisine German cuisine Pasta Riesling Parmigiano Jaccard similarity Sousage Sauerkraut Pizza Wheat Beer Tortano 12 July 2016 JwanAlhussein
Cultural Similarity between Neighbors 12 July 2016 JwanAlhussein
Cultural Understanding Wikipedia edition Understanding “Native” definition 2 / 5 0 / 6 Used concepts Understanding the Italian food culture 12 July 2016 JwanAlhussein
What may explain Cultural Understanding? Germany Create for each country a list of countries ranked by where most of its immigrants come from. Create for each country a list of countries ranked by how similar their values and beliefs are according to ESS. • ESS is a biennial 30-country survey of attitudes, beliefs and behavior. 12 July 2016 JwanAlhussein
Cultural Affinity View statistics of cuisine pages in different language editions How much more attention than we would expect does language community A pay to the culture of community B? 12 July 2016 JwanAlhussein
Self-Focus & Regional Bias 12 July 2016 JwanAlhussein
Summary Affinities between language communities are present in Wikipedia and drive the attention process Cultural understanding can to some extent be explained by migration Cultural similarities inferred from Wikipedia are pretty plausible crowdflower Relation between similarity, understanding and affinities? Understanding and affinity: -0.35 Similarity and affinity: 0.27 Similarity and understanding: 0.19 12 July 2016 JwanAlhussein
Democrats, Republicans and Starbucks Afficionados: User Classification in Twitter • Authors: Marco Pennacchiotti, Ana-Maria Popescu: 12 July 2016 JwanAlhussein
Classification task • The starting point is to fulfill the incomplete user attributes by classifiying the user with respect to the incomplete user attribute, indeed. • Most of the users do not mention explicitly her political view, for example • There are various methods for solving the user classification problem • What do we have in social media domain ? • Users have many attributes, such as age, gender, etc… • Based on the attributes a classifier may be trained/constructed • Social Network • Users have friends that she follows • How to define the classification task so that we can combine these two types of information ‘structure’, user attributes and social network ? 12 July 2016 JwanAlhussein 17
Machine learning model • A novel architecture combining user-centric information and social network information • User-centric information are the attributes of the users, which we call as features hereafter • Social Network information is the information of friends of the users • Main contribution of the paper • Use Gradient Boosted Decision Trees (GBDT) framework as the classification algorithm • Train the GDBT with given labeled input data • And label the users with respect to the built classifier • Then apply same classifier model to the friends of the users and label the friends also • Lastly, update each user’s label with respect to her friends’ label using an update formulae 12 July 2016 JwanAlhussein 18
User-Centric Information • User-centric information is represented as features. There is a overmuch feature set mainly comprised of four parts • Profile features(PROF) • User name, use of avatar picture, date of account creation, etc… • Tweeting behavior features(BEHAV) • Average number of tweets per day, number of replies etc... • Linguistic content features • Richest feature set, comprised of four sub-feature sets • Uses Latent Drichlet Allocation (LDA) as Language Model • Prototypical words(LING-WORD): • Proto words, words that are icons in users. • Found probabilistically from the data • Firstly partition the users into n class, then find the most frequent words for each class and take mostly used k words for each class • Prototypical hashtags(LING-HASH): • Hashtag (#) to denote topics • Same technique for proto words • Generic LDA(LING-GLDA): • LDA is the language model they used, extracted topics with respect to the LDA model and represents users as a distribution over topics • LDA is trained by all sets of users • Domain-specific LDA(LING-DLDA): • Same as Generic LDA, but trained with specific training set such as users that are only democrats and republicans • Sentiment words(LING-SENT): • Manually collected small set of terms, Ronald Regan, good or bad ? • Opinion Finder Tool gives the sentiment as positive, negative, neutral 12 July 2016 JwanAlhussein 19
User-Centric Information • Social Network Features • Combination of two different features • Friend accounts(SOC-FRIE): • Informs about sharing same friends for different labeled users such as democrats and republicans • Prototypical replied(SOC-REP) and retweeted (SOC-RET) users: • Find most frequent mentioned (@) and retweeted (RT) users for different labeled users • That’s all for user-centric information • OVERMUCH, indeed… 12 July 2016 JwanAlhussein 20
Experimental Evaluation • Three binary classification tasks: • Detecting political affiliation • Democrat or Republican • 5169 Democrats and 5169 Republicans • 1.2 millions friends • Ethnicity • African American or Not • 3000 African Americans and 3000 Not African Americans • 508K friends • Following a business • Following Starbucks or Not • 5000 Starbucks follower and 5000 Not • 981K friends 12 July 2016 JwanAlhussein 21
Experimental Results, Political Affiliation Task • Best achieved result for combined HYBRID model among three tasks however, not significant increase over single ML model • Social Network features are very successfull. This is because users from a particular political view are friends with similar particular views. Suporttingsinle Graph-Based Label update is also very successfull alone 12 July 2016 JwanAlhussein 22
Overall Comments • #1 ML method mostly good enough and update part of the architecture does not bring significant improvement. If the task allows for users to form a community update function works, else, it may even hurt the alone ML system as in ethnicity case • #2 Linguistic Features always reliable 12 July 2016 JwanAlhussein 23
Review • The novelty of combining the typesof information is attractive, however, there are serious points that should be criticized • First of all the classifier is doing only binary classification and nothing said about multi-dimensional classification. Doing multi-dimensional classification using binary classifier is time-consuming and weakens the claim about the scalability. • As said, the novel arch. idea is attractive, however, the results show that label-update does not work well. Why ? They did not give any appriciable comment on why label update does not work well. This, I believe, shows that the feature set and the novel architecture is not well-studied. • There are overmuch features. But the reasons why these features are selected are not given. • Morever, applying same ML model the users and their friends replicates the information. Obviously connected users will have some common and different attributes, what is the point? • The social graph should be used more effectively. I think it should not be used to update the labels but as an importantly weigthed feature in the ML model. This is because we should superpose different informationtypes instead of using one to compensate the other. You can see difference in thinking vector space, update means spanning same vector again, superposing means using both vector concurrently. For example, proto words would have been extracted using the network, somehow. 12 July 2016 JwanAlhussein 24
Review • They told about Gradient Boosted Decision Trees (GBDT) but gave nothing about this classification algorithm, an explanation is expected at least in principle about GBDT. Same thing is valid for Latent Drichlett Allocation (LDA) language model. It is the first time I hear this language model, and they said nothing about LDA. It is only said that LDA is used as language model and associated with topics. But, what is LDA and how it is associated with topics? • There is no data analysis, very cruical lacking of paper, everything is data! They only gave the number of users used in training, but what about the test set? Development set? Any other statistics about the data? Moreover, they used different number of samples for each task. The success of label update is very low for ethnicity task than the political affiliation task, however, there are 1.2M friends for political affiliation task but almost half of them for ethnicity task, 508K. Hence the cross-task comments are not confident. • Experiments are not done in a structured way. They have just done the experiments and shows the results. There is not a useful comment. Beside, they did not explain why they have chosen these experiments. For example, I would want to see some success of subset features as features alone have mostly very good results, some subset may increase the overall HYBRID result. 12 July 2016 JwanAlhussein 25
Thanks for your attention ! 12 July 2016 JwanAlhussein 26