220 likes | 356 Views
On Frequent Chatters Mining. Claudio Lucchese 1 st HPC Lab Workshop. Frequent Patterns Mining. Claudio Lucchese, Salvatore Orlando, Raffaele Perego : Mining Top-K Patterns from Binary Datasets in Presence of Noise . SDM 2010. How may patterns do you see in the following dataset ?.
E N D
On Frequent Chatters Mining Claudio Lucchese 1st HPC Lab Workshop 1st HPC Workshp - Claudio Lucchese
Frequent Patterns Mining Claudio Lucchese, Salvatore Orlando, RaffaelePerego: Mining Top-K Patterns from Binary Datasets in Presence of Noise. SDM 2010 How may patterns do you see in the following dataset ? 1st HPC Workshp - Claudio Lucchese
Frequent Patterns Mining 1st HPC Workshp - Claudio Lucchese
Frequent Patterns Mining • usually rows and cols are not in “good-looking” order 1st HPC Workshp - Claudio Lucchese
State of the art • Most recent approaches try to discover the top-k patterns that optimize different cost functions: • Minimize Noise (“holes”) or • Minimize MDL • encoding(Patterns) + encoding(Data|Patterns) • Maximize Information Ratio: • Number of bits of information w.r.t. to the Maximum Entropy Model built on the basis of rows and cols marginal distribution • Minimize length of patterns and the amount of noise (our approach =) 1st HPC Workshp - Claudio Lucchese
Evaluation • Unsupervised: • Measure how well the proposed algorithm optimizes the proposed cost function • What is the best cost function ? • We are investigating supervised measures: • Unsupervised extraction: extract patterns from classification/clustering dataset without class/cluster labels information • Supervised evaluation: measure how well the patterns can predict/match classes/clusters • Preliminary result: • Fancy cost functions might not be the best ones 1st HPC Workshp - Claudio Lucchese
Information Overload in News Gianmarco De Francisci Morales, Aristides Gionis, Claudio Lucchese: From chatter to headlines: harnessing the real-time web for personalized news recommendation. WSDM 2012. 1st HPC Workshp - Claudio Lucchese
Can we exploit Twitter? • Timeliness • Personalization Number of mentions of “Osama Bin Laden” 1st HPC Workshp - Claudio Lucchese
News Get Old Soon • 90% of the clicks happen within 2 days from publication • Only a few occur early! 1st HPC Workshp - Claudio Lucchese
T.Rex (Twitter-based news recommendation system) • Builds a user model from Twitter • Signals from user generated content, social neighbors and popularity across Twitter and news • Entity-based representation (overcomes vocabulary mismatch) • Learn a personalized news ranking function: • Pick up candidates from a pool of related or popular fresh news,rank them and present top-k to the user 1st HPC Workshp - Claudio Lucchese
Recommendation Model • Ranking function is user and time dependent • Social model + Content model + Popularity model • Popularity model tracks entity popularity by the number of mentions in Twitter and news (with exponential forgetting) • Content model measures relatedness of a bag-of-entities representation of a users’ tweet stream and of a news article • Social model weights the content model of every social neighbor by a truncated PageRank on the Twitter network 1st HPC Workshp - Claudio Lucchese
System Overview • Designed to be streaming and lightweight (just counting) • User model is updated continuously 1st HPC Workshp - Claudio Lucchese
Learning the Weights • Learning to rank approach with SVM • Each time the user clicks on a news, we learn a set of preferences (clicked_news > non_clicked_news): • Prune the number of constraints for scalability: • only news published in the last 2 days • only take the top-k news for each ranking component • Can optionally include additional features for news articles: • click count, age, etc... (T.Rex+) 1st HPC Workshp - Claudio Lucchese
Predicting Clicked News • User generated content is a very good predictor albeit very sparse • Click Count is a strong baseline but does not help T.Rex+ 1st HPC Workshp - Claudio Lucchese
Predicting Clicked Entities 1st HPC Workshp - Claudio Lucchese
Future works (?) • Explain a set of news showing how the main topicsinteracted with each other over time. 1st HPC Workshp - Claudio Lucchese
Future works (?) • Explain a set of news showing how the main topicsinteracted with each other over time. • Example: European sovereign-debt crisis Fiscal Compact EuroBond Berlusconi Obama New Italiangovernment Monti Merkel Loan EU France Greece time 1st HPC Workshp - Claudio Lucchese
Future works (?) • Explain a set of news showing how the main topicsinteracted with each other over time. • Applications: • Given the news the user is currently reading, provide an explanation of the related facts that precede that news • Given a query, provide an explanation of the documents related to that query • Given a set of topics, explain their relations over time • Browse a collection of news, by changing the topics of interest, the time window, the granularity 1st HPC Workshp - Claudio Lucchese
Future works (?) • Explain a set of news showing how the main topicsinteracted with each other over time. • A topic is a named entity relevant over time • An interaction is a cluster of news related to some event and relevant in a small time window • It might be important to cover the given time window, but recent events might be more interesting 1st HPC Workshp - Claudio Lucchese
Future works (?) • Explain a set of news showing how the main topicsinteracted with each other over time. • Given a maximum number of main topics and interactions, maximize: • Topic coverage and diversity • Eventstime coverage • Cluster similarity • Main topicsconnectivity 1st HPC Workshp - Claudio Lucchese
Future works (?) • Explain a set of news showing how the main topicsinteracted with each other over time. • Its is different from news clustering: • Even if you had a good clustering, might not be trivial to select which events and which topicsto show in order to maximize the amount of information delivered to the user • There is some interesting related work • aimed at finding chains of news,we are more interested in topic evolution 1st HPC Workshp - Claudio Lucchese
Thank you ! 1st HPC Workshp - Claudio Lucchese