490 likes | 635 Views
social networks analysis seminar introductory lecture #2. Danny Hendler and Yehonatan Cohen Advanced Topics in on-line Social Networks Analysis. Seminar schedule. Introductory lecture #1. 5/3/14. 10/3/14. Papers list published, students send their 3 preferences. 12/3/14.
E N D
social networks analysis seminarintroductory lecture #2 Danny Hendler and Yehonatan CohenAdvanced Topics in on-line Social Networks Analysis
Seminar schedule Introductory lecture #1 5/3/14 10/3/14 Papers list published, students send their 3 preferences 12/3/14 Introductory lecture #2 All students preferences must be received 14/3/14 No seminar (Purim!) 19/3/14 26/3/14 Student talks start 11 weeks of Student talks Semesterends
Talk outline • Nodes centrality • Degree • Closeness • Betweenness • Machine-learning
1 2 3 Nodes centrality 13 4 9 11 • Name the most central/significant node: 10 12 8 7 5 6
Nodes centrality 6 7 12 10 8 5 4 9 13 11 1 3 • Name the most central/significant node: 2
Nodes centrality • What makes a node central? • Number of connections • It is central if it disconnects the graph • High number of paths passing through the node • Proximity to all other nodes • Central node is the one whose neighbors are central • …
Nodes centrality: Applications • Detection of the most popular actor in a network Spamming / Advertising • Network vulnerability Health care / Epidemics • Clustering similar structural positions Recommendation systems • …
Nodes centrality: Degree • In this lecture we will define the connectivity degree of a node as the number of its neighbors. • Alternative definitions are possible where you take into account • The strength of ’s connections • The direction of ’s connections • Etc.
Nodes centrality: Degree • Name the most central/significant node: 5 8 6 4 1 3 7 2 9
Nodes centrality: Degree 6 7 12 10 8 5 4 9 13 11 1 3 2
Nodes centrality: Closeness (Reach) • Vertices that are connected to are directly reachable from . • Vertices connected to ’s neighbors are still reachable although it is harder to reach them. • – distance in hops from to . • – the reach attenuation factor • means no attenuation. All vertices in ’s connected component are equally reachable from .
Nodes centrality: Closeness (Reach) 6 7 12 10 8 5 4 9 13 11 1 3 2 Reach attenuation factor
Nodes centrality: Betweenness • Measures the extent to which a node lays between all others in a network. • Betweennessis used to estimate the control a node may have over the communication flows in a network. • is the number of shortest paths between and . • is the number of shortest paths between and that pass through .
Nodes centrality: Beetweenness 6 7 12 10 8 5 4 9 13 11 1 3 2 Reach attenuation factor
Talk outline • Nodes centrality • Machine Learning • The learning process • Classification • Evaluation
Machine Learning • Herbert Alexander Simon: “Learning is any process by which a system improves performance from experience.” • “Machine Learning is concerned with computer programs that automatically improve their performance through experience. “ Herbert Simon Turing Award 1975Nobel Prize in Economics 1978
Machine Learning • Learning = Improving with experience at some task • Improve over task T, • With respect to performance measure, P • Based on experience, E. Herbert Simon Turing Award 1975Nobel Prize in Economics 1978
Machine Learning • Example: Spam Filtering • T: Identify Spam Emails • P: • % of spam emails that were filtered • % of ham/ (non-spam) emails that were incorrectly filtered-out • E: a database of emails that were labelled by users i.e. Feedback on emails: • “Move to Spam” , “Move to Inbox”
Machine Learning Applications?
Machine Learning: The learning process Model Testing Model Learning
Machine Learning: The learning process Model Testing Model Learning ● Content of the email ● Number of recipients ● Size of message ● Number of attachments ● Number of "re's" in the subject line … Email Server
Machine Learning: The learning process • From e-mails to feature vectors: • Textual-Based Content Features: • Email is tokenized • Each token is a feature • Meta-Features: • Number of recipients • Size of message
Machine Learning: The learning process Target Attribute Vocabulary Instances Binary
Machine Learning: The learning process Target Attribute Input Attributes Instances Nominal Ordinal Numeric
Machine Learning: Model learning Learner Classifier
Machine Learning: Model testing Database Training Set Learner
categorical categorical continuous class Machine Learning: Decision trees Training Data
categorical categorical continuous class Machine Learning: Decision trees Splitting Attribute Refund Yes Model: Decision Tree Training Data
categorical categorical continuous class Machine Learning: Decision trees Splitting Attribute Refund Yes NO Model: Decision Tree Training Data
categorical categorical continuous class Machine Learning: Decision trees Splitting Attributes Refund Yes No NO MarSt Married Model: Decision Tree Training Data
categorical categorical continuous class Machine Learning: Decision trees Splitting Attributes Refund Yes No NO MarSt Married Model: Decision Tree Training Data
categorical categorical continuous class Machine Learning: Decision trees Splitting Attributes Refund Yes No NO MarSt Married NO Model: Decision Tree Training Data
categorical categorical continuous class Machine Learning: Decision trees Splitting Attributes Refund Yes No NO MarSt Married Single, Divorced NO Model: Decision Tree Training Data
categorical categorical continuous class Machine Learning: Decision trees Splitting Attributes Refund Yes No NO MarSt Married Single, Divorced TaxInc NO > 80K Model: Decision Tree Training Data
categorical categorical continuous class Machine Learning: Decision trees Splitting Attributes Refund Yes No NO MarSt Married Single, Divorced TaxInc NO > 80K YES Model: Decision Tree Training Data
categorical categorical continuous class Machine Learning: Decision trees Splitting Attributes Refund Yes No NO MarSt Married Single, Divorced TaxInc NO > 80K YES Model: Decision Tree Training Data
categorical categorical continuous class Machine Learning: Decision trees Splitting Attributes Refund Yes No NO MarSt Married Single, Divorced TaxInc NO > 80K < 80K YES Model: Decision Tree Training Data
categorical categorical continuous class Machine Learning: Decision trees Splitting Attributes Refund Yes No NO MarSt Married Single, Divorced TaxInc NO > 80K < 80K NO YES Model: Decision Tree Training Data
Machine Learning: Classification • Binary classification • (Instances, Class labels): (x1, y1), (x2, y2), ..., (xn, yn) • yi {1,-1} - valued • Classifier: provides class prediction Ŷ for an instance • Outcomes for a prediction: True class Predictedclass
Machine Learning: Classification • P(Ŷ = Y): accuracy • P(Ŷ = 1 | Y = 1): true positive rate • P(Ŷ = 1 | Y = -1): false positive rate • P(Y = 1 | Ŷ = 1): precision True class Predictedclass
Machine Learning: Classification • Consider diagnostic test for a disease • Test has 2 possible outcomes: • ‘positive’ = suggesting presence of disease • ‘negative’ • An individual can test either positive or negative for the disease
Machine Learning: Classification Individuals without the disease Individuals with disease Test Result
Call these patients “negative” Call these patients “positive” Machine Learning: Classification Test Result
Call these patients “negative” Call these patients “positive” Machine Learning: Classification True Positives Test Result without the disease with the disease
Call these patients “negative” Call these patients “positive” Machine Learning: Classification False Positives Test Result without the disease with the disease
Call these patients “negative” Call these patients “positive” Machine Learning: Classification True negatives Test Result without the disease with the disease
Call these patients “negative” Call these patients “positive” Machine Learning: Classification False negatives Test Result without the disease with the disease
Machine Learning: Cross-Validation • What if we don’t have enough data to set aside a test dataset? • Cross-Validation: • Each data point is used both as train and test data. • Basic idea: • Fit model on 90% of the data; test on other 10%. • Now do this on a different 90/10 split. • Cycle through all 10 cases. • 10 “folds” a common rule of thumb.
Machine Learning: Cross-Validation • Divide data into 10 equal pieces P1…P10. • Fit 10 models, each on 90% of the data. • Each data point is treated as an out-of-sample data point by exactly one of the models.