140 likes | 282 Views
Using to Save Lives. Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir Khakpour, and John Felix. Outline. Explanation. Digg is a social web-media discovery tool based on user submitted content. 1 or 2 submissions a minute
E N D
Using to Save Lives Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir Khakpour, and John Felix
Explanation • Digg is a social web-media discovery tool based on user submitted content. • 1 or 2 submissions a minute • Half-life of “interest” is about a day • Digg aggregates “interesting” content. • But how do we find interesting Events and know their Themes?
Motivation • Collaborative nature of Social Media can scour the WWW very thoroughly. • But, this generates A LOT of data (you’ll see). • It would be cool to find emergencies, or critical situations based on this collaborative media. • Apple seems like a pretty good starting point.
Preprocessing • Digg API • REST API • http://services.digg.com/stories/topic/apple?count=10 • XML response • <?xml version="1.0" encoding="utf-8" ?><users timestamp="1176998598" total="1" offset="0" count="1"> <user name="sbwms" icon="http://digg.com/img/user-large/user-default.png" registered="1135702996" profileviews="3104" /></users></xml> • Limitations • 100 results per request • 1 Hour of time series data • Can’t go fast, or else.
Preprocessing • Time Series • Each digg is the event (only 100 at a time) • Rows • Each story’s digg count • Columns • Every hour (2,207 of them from August 08 – November 08) • Clustering • Rows • Each story that was digged at any point in the time series • Columns • The words in the title and description of this story
Preprocessing - Challenges • SLOW • Really Dirty Data • Different Formats of Data • REALLY SLOW
Introduction to Document Clustering • Challenges of clustering of text documents unlike structured data are: • Volume • Dimensionality • Sparsity • Complex semantics • In information retrieval and text mining, text data is represented in a common representation model, e.g. Vector Space Model (VSM) • Huge sparse matrix, we just store non-zero values Text Text documents are converted to Am,n where for m documents and total number of n words (or phrases), each element xi,j represents the frequency of the jth term in the ith document.
Clustering • Dataset • Number of stories (m) : 25470 • Total number of unique words (n): 55557 • Nonzero values: 469323 (0.03214%) • Clustering using Cluto Software • Using Kmeans, bisecting Kmeans • Calculating Centroids and SSE • A C++ program is run on “black”
Document Clustering by Optimizing Criterion Functions • According to Zhao et .al, to have a good clustering for documents we can use some Criterion Function and use optimization to find clusters: • Internal Criterion Functions (I) • Maximizing the internal similarity function: • External Criterion Functions (E) • Minimizing the external similarity function: • Hybrid Criterion Functions (H) • Maximizing
Experiments • SSE for I (K-Means vs Bisecting K-Means)
Visualization • What we used • jQuery • Database query library for javascript • PHP/MySQL • Scripting language and database backend • Google Visualization API • Time Series Graph • Zoomable • Timepedia Chronoscope • Clickable
Conclusions • Success? • Of course we think so • Future Work • Save lives? • Better clustering • Cleaner data • More data • Make it scalable, and dynamic • On-line and on the fly?