Using to Save Lives

Using to Save Lives Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir Khakpour, and John Felix

Outline

Explanation • Digg is a social web-media discovery tool based on user submitted content. • 1 or 2 submissions a minute • Half-life of “interest” is about a day • Digg aggregates “interesting” content. • But how do we find interesting Events and know their Themes?

Motivation • Collaborative nature of Social Media can scour the WWW very thoroughly. • But, this generates A LOT of data (you’ll see). • It would be cool to find emergencies, or critical situations based on this collaborative media. • Apple seems like a pretty good starting point.

Approach

Preprocessing • Digg API • REST API • http://services.digg.com/stories/topic/apple?count=10 • XML response • <?xml version="1.0" encoding="utf-8" ?><users timestamp="1176998598" total="1" offset="0" count="1"> <user name="sbwms" icon="http://digg.com/img/user-large/user-default.png" registered="1135702996" profileviews="3104" /></users></xml> • Limitations • 100 results per request • 1 Hour of time series data • Can’t go fast, or else.

Preprocessing • Time Series • Each digg is the event (only 100 at a time) • Rows • Each story’s digg count • Columns • Every hour (2,207 of them from August 08 – November 08) • Clustering • Rows • Each story that was digged at any point in the time series • Columns • The words in the title and description of this story

Preprocessing - Challenges • SLOW • Really Dirty Data • Different Formats of Data • REALLY SLOW

Introduction to Document Clustering • Challenges of clustering of text documents unlike structured data are: • Volume • Dimensionality • Sparsity • Complex semantics • In information retrieval and text mining, text data is represented in a common representation model, e.g. Vector Space Model (VSM) • Huge sparse matrix, we just store non-zero values Text Text documents are converted to Am,n where for m documents and total number of n words (or phrases), each element xi,j represents the frequency of the jth term in the ith document.

Clustering • Dataset • Number of stories (m) : 25470 • Total number of unique words (n): 55557 • Nonzero values: 469323 (0.03214%) • Clustering using Cluto Software • Using Kmeans, bisecting Kmeans • Calculating Centroids and SSE • A C++ program is run on “black”

Document Clustering by Optimizing Criterion Functions • According to Zhao et .al, to have a good clustering for documents we can use some Criterion Function and use optimization to find clusters: • Internal Criterion Functions (I) • Maximizing the internal similarity function: • External Criterion Functions (E) • Minimizing the external similarity function: • Hybrid Criterion Functions (H) • Maximizing

Experiments • SSE for I (K-Means vs Bisecting K-Means)

Visualization • What we used • jQuery • Database query library for javascript • PHP/MySQL • Scripting language and database backend • Google Visualization API • Time Series Graph • Zoomable • Timepedia Chronoscope • Clickable

Conclusions • Success? • Of course we think so • Future Work • Save lives? • Better clustering • Cleaner data • More data • Make it scalable, and dynamic • On-line and on the fly?

Using to Save Lives

Using to Save Lives

Presentation Transcript

Safety Belts Save Lives

Seat Belts Save Lives!

Relationships Save Lives

Hospice should attempt to save lives

DOES MEDICARE SAVE LIVES?

Save Two lives

Down To Zero ̶ Eliminate Falls/Save Lives

Using Ingenious Med’s CrossCover Function to Save Lives and Save Time

Down To Zero ̶ Eliminate Falls/Save Lives

Save lives and save (or restore) livelihoods…..

LACE UP SAVE LIVES

Save Lives Save Money Saving Our Environment

Why raise awareness – to save lives?

Donate blood, save lives

Smoke Alarms Save Lives.

Smoke Alarms Save Lives.

Clean Hands Save Lives

Prevent Fire. Save Lives.

Diesel Generators Help to Save Lives

Skipping to Save Lives

Clean Hands Save Lives

Use Heart to Save Lives