150 likes | 349 Views
EVENT IDENTIFICATION IN SOCIAL MEDIA. Hila Becker, Luis Gravano Mor Naaman Columbia University Rutgers University. Social Media Sites Host Many “Event” Documents. “Event”= something that occurs at a certain time in a certain place [Yang et al. ’99]
E N D
EVENT IDENTIFICATION IN SOCIAL MEDIA Hila Becker, Luis Gravano Mor Naaman Columbia University Rutgers University
Social Media Sites Host Many “Event” Documents • “Event”= something that occurs at a certain time in a certain place [Yang et al. ’99] • Popular, widely known eventsPresidential Inauguration, Thanksgiving Day Parade • Smaller events, without traditional news coverageLocal food drive, street fair • … Photo-sharing: Flickr Video-sharing: YouTube Social networking: Facebook Social media documents for “All Points West” festival, Liberty State Park, New Jersey, 8/8/08
Identifying Events and Associated Social Media Documents • Applications • Event search and browsing • Local search • … • General approach: group similar documents via clusteringEach cluster corresponds to one event and its associated social media documents
Event Identification: Challenges • Uneven data quality • Missing, short, uninformative text • … but revealing structured context available: tags, date/time, geo-coordinates • Scalability • Dynamic data stream of event information • Unknown number of events • Necessary for many clustering algorithms • Difficult to estimate
Clustering Social Media Documents • Social media document representation • Social media document similarity • Social media document clustering • Clustering task: definition • Ensemble algorithm: combining multiple clustering results • Preliminary evaluation
Social Media Document Representation Title Description Tags Date/Time Location All-Text
Social Media Document Similarity Title • Text: tf-idf weights, cosine similarity Title Description A A A B B B Description • Time: proximity in minutes Tags Tags Date/Time-Keywords time Location-Keywords Date/Time • Location: geo-coordinate proximity Date/Time-Proximity Location Location-Proximity All-Text All-Text
Social Media Document Clustering Framework Social media documents Document feature representation Event clusters
Clustering: Ensemble Algorithm Ctitle Ensemble clustering solution Consensus Function: combine ensemble similarities Wtitle f(C,W) Wtags Ctags Wtime Ctime Learned in a training step
Clustering: Measuring Quality • Homogeneous clusters ✔ • Complete clusters ✔ • Metric: Normalized Mutual Information (NMI)Shared information between clustering solution and “ground truth”
Experimental Setup • Data: >270K Flickr photos • Event labels from Yahoo!’s “upcoming” event database • Split into 3 parts for training/validation/testing • Clusterers: single pass algorithm with centroid similarity • Weighing scheme: Normalized Mutual Information (NMI) scores on validation set • Consensus function: weighted average of clusterers’ binary predictions • Final prediction step: single pass clustering algorithm
Preliminary Evaluation Results • Individual clusterer performance • Highest NMI: Tags, All-Text • Lowest NMI: Description, Title • Ensemble performance, compared against all individual clusterers • Highest overall performance in terms of NMI • More homogenous clusters: each event is spread over fewer clusters Details in paper
Future Work: Alternative Choices Document similarity metric • Ensemble approach • Weight assignment • Choice of clusterers • Train a classifier to predict document similarity • Features correspond to similarity scores • All-text, title, tags, time, location, etc. • Numeric values in [0,1] • State-of-the-art classifiers: SVM, Logistic Regression, …
Future Work: Alternative Choices • Final clustering step • Apply graph partitioning algorithms Requires estimating the number of clusters • Evaluation metrics: beyond NMI • Datasets • Flickr LastFM, YouTube • Exploit social network connections
Conclusions • Identified events and their corresponding social media documents • Proposed a clustering solution • Leveraged different representations of social media documents • Employed various social media similarity metrics • Developed a weighted ensemble clustering approach • Reported preliminary results of our event identification approach on a large-scale dataset of Flickr photographs