470 likes | 596 Views
Peoples’ Interests in Social Networks. Group Members: 07005029 – Abhinav Gokari 07005030 – Sudheer Kumar 07d05004 – Ignatius Pereira 07d05019 – Praveen Dhanala Under the guidance of Prof. Pushpak Bhattacharyya. Outline . Motivation Social Networks and Homophily Experiments
E N D
Peoples’ Interests in Social Networks Group Members: 07005029 – AbhinavGokari 07005030 – Sudheer Kumar 07d05004 – Ignatius Pereira 07d05019 – Praveen Dhanala Under the guidance of Prof. Pushpak Bhattacharyya
Outline • Motivation • Social Networks and Homophily • Experiments • Statistical Methods Used • Advantages of Analyzing Social Networks for Homophily • Conclusion • References
Motivation • Social Networks are the ongoing phenomenon • Orkut, Facebook, Twitter, etc., • Almost 1/10th of the world’s population use Facebook • There is a great scope for innovation and development in Social Computing which deals with creating social contexts through the use of software and technology. • Interesting problems arise like : • Social Network analysis • Target marketing and improving e-commerce • Friendship(or relationship) suggestions
Social Networks • A social network is a social structure made up of individuals (or organizations) called "nodes", which are tied (connected) by one or more specific types of interdependency such as friendship, kinship, etc., • Online social networks are attribute independent networks. • In online social networks, a relationship between two individuals is mutually self defined and binary. • Friendship is not functional and reasons could be subtle. E.g. an offline/online meeting, common workplace, pure visual interest.
Homophily • It is the tendency of individuals to associate or bond with others with a similar set of interests or attributes. (Birds of same feather flock together) • People choose friends who share same common interests and characteristics • One of the most general and least contested theoretical principles in sociology is the principle of homophily
Homophily(contd.) • A social system is homophilous if contacts are more similar to one another than to strangers in terms of their individual attributes and behavior • If homophily is a robust aspect of human behavior, it can be used to deduce a particular person’s attributes from his/her friends’ attributes in an online social network. • We shall now examine the following experiments to observe Homophily in social networks
Expt. based on the paper by Apoorv et al • A “Travel Site” is chosen with mutually self-declared friendships • A data-set of 181 nodes is selected with 1214 friendship links with additional information like their attributes, characteristics, etc., • Each user can select their hobbies from a list of 26 pre-defines hobbies • Also each user has additional characteristics such as “language spoken”, “country they live in”, etc., - Based on “Predicting Interests of People on Online Social Networks” by Apoorv et al, 2009.
Contd. • The information gathered on the website was • Friends’ Network: This is the mutually self-declared friends network matrix. • Hobbies: Members declare their hobbies by clicking on boxes next to a list of 26 possible hobbies. • Languages Spoken: There is a list of 139 languages from which members select a maximum of three languages they speak. • Age group: The age group is in terms of ranges, example under 20, 20-25, 26-30 etc, from which the user chooses one. There are a total of 12 ranges.
This is the 181 by 181 friends network matrix. If person p1 has a friend p2then F[p1,p2] will be 1, otherwise it will be 0.
H - This is the hobbies matrix, 181 by 26. 181 for number of people and 26 forthe number different hobbies a person may have. For example if p50 has threehobbies - Acting, Dancing and Theatre, H[p50, Acting], H[p50, Dancing], H[p50,Theatre] will be 1 and all the other cells in row H[p50] will be 0.
L - This is the languages spoken by people matrix. It is exactly similar to Hwith the only difference that the columns here are the different languages aperson can speak along columns. This matrix is therefore, 181 by 139
P - This is the places visited by people matrix. It is similar to H and L with theonly difference that the columns here are the different places visited by a personalong columns. This matrix is therefore, 181 by 263.
The hypothesis of the experiment is that there is a correlation between mutual self-declared friendship links in online social networks and attributes listed in the profiles of said friends, presumably because of homophily • GFHF algorithm is extremely sensitive to the correctness of the weight matrices. Thus, GFHF allows us to test our hypothesis.
GFHF Example Based on the paper by Amir Saffari et al. , 2010
GFHF Example Based on the paper by Amir Saffari et al. , 2010
Support Vector Machines • A classifier derived from statistical learning theory by Vapnik, et al. in 1992 • Currently, SVM is widely used in object detection & recognition, content-based image retrieval, text recognition, biometrics, speech recognition, regression analysis, etc. V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1992
Linear classifier • The goal of statistical classification is to use an object's characteristics to identify which class (or group) it belongs to. • A linear classifier achieves this by making a classification decision based on the value of a linear combination of the characteristics. An object's characteristics are also known as feature values and are typically presented to the machine in a vector called a feature vector.
SVM in test • MATLAB Support Vector Machine Toolbox • The toolbox provides routines for support vector classification and support vector regression. • A GUI is included which allows the visualisation of simple classification and regression problems. (The MATLAB optimisation toolbox, or an alternative quadratic programming routine is required.) http://www.isis.ecs.soton.ac.uk/isystems/kernel/
Support Vector Machine ran on a sample data. http://users.ecs.soton.ac.uk/srg/publications/pdf/SVM.pdf
Why SVM’s • Experiment by ThorsteinJoachims et al. on Text Categorization with support vector machines. • Text categorization is the classification of documents into a fixed number of predefined categories where each documents can be in one, multiple or no category at all. • SVM’s well suited for the task with categorisation with many features. • SVM’s are robust, don’t require parameter tuning. ThorsteinJoachims, Text Categorisation with support vector machines: Learning with many relevant features, 1998.
Why SVM’s Contd.? • SVM’s are based on Structural Risk Management Principle. • Idea of structural risk management is to find a hypothesis h for which we can guarantee the lowest true error i.e the probability to h will make an error on an unseen or a randomly selected test sample. • SVM’s are universal learners • Ability of learning is independent of the dimensionality of the feature space. • With the use of a simple kernel function, they can be used to learn polynomial classifiers.
SVM’s v/s MLP’s • Experiment by Barabino et al. on Support Vector Machines v/s Multiple Linear Perceptrons in particle Identification in Physics. • SVM are based on minimization of Structured risk whereas MLP’s are based on minimization of Empirical risk. • Findings :- • 1) very similar performance except the SVM perform as good as MLP’s • 2) SVM work well in case of large training drawn from input spaces of small dimensions M. Barabino et al., Support Vector Machines versus MultilinearPerceptrons in Particle Identification, 1999.
Back to the Expt. • To accept or reject our research hypothesis, we consider the prediction capability of GFHF using two weight matrices: • Randomly generated binary weight matrix(Gr) • Self declared friends network(Gf) • To incorporate the effect of other attributes, Support Vector Machine(SVM) is used along with GFHF • Two feature sets are used when using SVMs • The set with only personal characteristics(Sc) • Set with all the hobbies except the one being predicted(St)
Contd. • GFHF is run 30 times, each time for a random configuration of ni number of labeled data points whereniЄ N = (10; 30; 50; 70; 90) • These predictions are calculated for all 26 hobbies under consideration • Therefore, for each weight matrix, Gr and Gf we get a corresponding 26 x 5 x 30 matrix, where 26 is the number of hobbies, 5 is the different number of data-points and 30 is the number trials.
Explanation of Results • Table shows the accuracy of running GFHF with the random matrix (Gr) and with the friends matrix (Gf ) for 26 hobbies and across 3 different training set sizes (numbers of labeled data-points) • The numbers are averages over the 30 trials with the same configuration. • The second-to-last column shows the average of difference in accuracy between Gf and Gr across all training set sizes, and the last column shows the difference in accuracy between St and Gf, again as average across all training set sizes.
Contd. • The results show that in most of the cases Gf performs significantly better than Gr which implies that the underlying friends network is in fact important for prediction. • For some hobbies, the difference in the performance of Gf and Gr is extremely high. These are precisely the hobbies that over 50% of the people in the network have.
Contd. • There are quite a few hobbies for which the friends network does not provide any useful information. • We see that the friends network does not consistently help over a random network if the hobby has a relative incidence of 41% or less. • At 47% and above, the friends network consistently outperforms the random network.
Contd. • The results corresponding to Sc and St are also similar. • In general, St performs better than Sc, which performs better than Gf • From this table we also observe that as we increase the data, prediction accuracy increases for the SVM
Expt. Based on paper by AkshayPatil • The data was gathered from a large online social networking site • The data is essentially in form of a huge network of interconnected nodes, with nodes representing actual people or users and the ties between them denoting relationships in the social network. • Also each of the nodes store information regarding the individual user. This information make up the node or user profile, and is essentially a list of attribute: value pairs. -Akshay N Patil. Homophily Based Link Prediction in Social Networks. 2009
Definitions • The nodes are distinguished as • Class of Near Nodes N(u) – Nodes within 2-hop radius • Class of Far Nodes F(u) – All nodes other than Near nodes • We introduce a ‘t’ bit vector associated with every pair of nodes(to denote the attributes of a node), whereby we place ‘1’ at the ithposition if the two nodes match on attribute Ai, or a ‘0’ if they do not match.
Contd. • Now, for each attribute Ai in the network, we define a 2 × 2 contingency matrix as shown in Table 3.1, where, • C00: Pairs of nodes in FS not matching on Ai. • C01: Pairs of nodes in FS matching on Ai. • C10 : Pairs of nodes in NS not matching on Ai. • C11: Pairs of nodes in NS matching on Ai. • |Cij| = kij
X2 (chi square) Measure • The statistical measure we use to detect the homophily is X2 (chi square) Measure • X2 measure aggregates the deviation of observed values from the expected values, under the independence hypothesis . • The independence hypothesis in our case can be stated as follows - • “An attribute plays no role in classification of a node into a particular class Cij”
where, Ai refers to a particular attribute, C refers to the classes defined, klm refers to the number of users in class having value m for attribute A and n refers to the total number of users. • The larger the X2 value, the lower is the belief in the independence hypothesis, and hence larger is the role played by the particular attribute in relationship formation.
We can rewrite the forumla for X2 measure in known terms using the probabilities of each class/attribute and independence hypothesis as follows : • In this way, we calculate the X2 value associated with each attribute in the network.
The Odds Ratio • The X2 measures assesses how statistically unlikely the lack of association between similarity on an attribute and the probability of a social relationship is. • The X2 measure cannot tell us is whether the association is positive or negative. • Yet we need such a directional measure to test the principle of homophily, which predicts a positive relationship. • A negative relationship would imply negative homophily, a tendency for individuals to associate, not with alikes, but with different others
We therefore also compute the odds ratio for each attribute • The odds ratio is simply the odds that two similar individuals are connected divided by the odds that two dissimilar individuals are connected. • The odds ratio for an attribute can be defined as follows,
Explanation of Results • Trends that are visible from the online social network results are as follows, • Geographical location is the strongest factor affecting how relationships shape up in a social network. • The results also indicate that relationships are more likely to develop between individuals belonging to the same age group. • Religious affiliation and ethnicity are also dominant factors in relationship formation, as demonstrated by attributes like religion and languages spoken by individuals. • Likings, hobbies etc. are less likely to influence how ties are made in a social network. Relationships are less likely to be formed between individuals who for example enjoy the same movies or music, read the same books etc.
Advantages of analyzing Homophily • Some offline friendships may be absent in online social communities, and are thus detectable. Friends may not know of each other that they are members of the same online community. This is especially true for young online communities or new users to the system. Facebook has a feature of ‘people you may know’, with which, people who are possibly friends are suggested to be connected online. • Advantages in target marketing and e-commerce are straight-forward. For example, orkut shows us ads on our profile which are based on our profile information.Information spread in social networks is being used in diverse fields such as marketing campaigns
Contd. • Link prediction may also turn out to be useful for suggesting links that are likely to develop in the future, thus steering the evolution of a social community. • In the case of large organizations or companies, there is often an official hierarchy for collaboration and interaction. Methods for link prediction could be effectively used to uncover beneficial interactions or collaborations that have not yet been fully utilized, which would otherwise be hidden by this official hierarchy
Conclusion • It has been widely observed that social networks exhibit homophily • We have observed how to detect the Homophily and some important applications using this phenomenon • More research had to done on different kinds of sample data to analyze Homophily more accurately and exploit it
References • ApoorvAgarwal, Owen Rambow, NandiniBhardwaj. Predicting Interests of People on Online SocialNetworks . In the Proceedings of IEEE CSE 09, 12th IEEE International Conference on Computational Science and Engineering, IEEE Computer Society Press, Vancouver, Canada, 2009. • Akshay N Patil. Homophily Based Link Prediction in Social Networks. 2009 • Miller McPherson, Lynn Smith-Lovin, and James M. Cook. Birds of a feather: Homophily in social networks. Annual Review of Sociology, 2001.
References(Contd.) • V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1992 • ThorsteinJoachims, Text Categorisation with support vector machines: Learning with many relevant features, 1998. • M. Barabino et al., Support Vector Machines versus MultilinearPerceptrons in Particle Identification, 1999. • Amir Saffari, Christian Leistner, Horst Bischof. Semi-supervised Learning in Vision. CVPR San Francisco, 2010 • http://www.dtreg.com/svm.htm • Wikipedia