Research Updates

Research Updates He Xiangnan (PhD student) 11/2/2012

Research Topic • General topic: Leveraging UGC in Web2.0 to improve some IR related tasks • Current task: • Leveraging user comments to enable popularity-aware rank of items in Web2.0

Popularity-aware rank • Based on the current states of items, ranking items to reflect their popularity in the future. • Motivation of popularity-aware rank: • Unequally distribution of popularity • Huge temporal dynamics of popularity • A rank of items that can forecast their future popularity will improve the user experience, especially for some temporal-related queries. Examples..

Examples(I) • - Search “nba” to YouTube at 6/22/2012 night • (NBA final games at that day morning) • - None of the top results are not about the championship of the Miami Heat

Examples(II) • Search “nobel prize China” at 10/12/2012 to Google domain search(YouTube) • None of the top results are about Mo Yan’s Nobel Literature Prize

Challenges • Intuitive way: • Utilizing the visiting histories of items, treating them as time-series and performing prediction • Difficulties: • Visiting histories are difficult to get and maintain (expensive) • Traditional time-series prediction approaches are easy to fail in the case that items are experiencing bursts • My proposal: • Leveraging the user comments

Observation in YouTube • Observation: the comment history is highly correlated with the view history

Pre-Analysis(I) • YouTube dataset (14,509 videos of ten queries). • Pearson correlation of comment history and view history: More than 80% videos with correlation more than 0.5 Conclusion: the comment history is highly correlated with the view history!

Pre-Analysis(II) • Have shown the tight correlation of comments and views • A natural question: how the past comments reflect the future comments? • Autocorrelation of a series: • Measure the correlation of a time series at different distances apart (lags) • Autocorrelation@lagK is the correlation of series {x_1, x_n-k} and {x_k+1, x_n}

Results of Autocorrelation of Comment Series Exhibits a short-term correlation (r_1 is large and r_k decreases very fast) Conclusion: the recent comments reflect most of the future and the predicting ability decreases with time.

Intuitions • More comments an item has, more popular it is. • Each comment has a contribution to the item’s popularity(or importance) • Different user’s commenting behavior has different influence on the item’s popularity. • Social interfaces in Web2.0 systems. • More active the user is, more influence it is. • More popular the commented item is, more influence the user is.

Method Overview

User-Item Temporal Bipartite Graph Model • The edge weight (decay function with time): • The weight matrix of the graph:

Random Walk Process(I) • Transition matrix: • The nature iterative process (HITS): • Problem: if the graph is sparse and disconnected, it will be trapped into local optima.

Random Walk Process(II) • Add the smoothing to avoid the local optima case: • The process in the bipartite graph can be converted into a random walk in homogeneous-node graph and it will converge (Proof ignored.)

Experiments • Crawled 3 datasets(20k size) to give a comprehensive evaluation of the performance in general Web2.0 systems. • Have not done the whole experiments yet, show an experimental result on Last.fm

Preparation • 2 time points: • 2012.10.19 (t0) • 2012.10.22 (t1) • Goundtruth is the #views in (t1-t0) • Comparing methods: • Comm_Oracle: #comment in the future days(t1-t0). • VC: View Count in the day t0 • CCP: Comment Count in the Past 3 days of t0 • Sum_Tscore: the sum of all comments’ contribution(all users have the same weights) • TPR: my approach

Overall Performance

Split by different popularity • Preparation: • Sort all items by the #view. (Large -> Small) • Split the 17000 items into 5 folds, each with the same size • Evaluate each fold. • Report the average performance of all folds

Average Performance of Splitted Folds Observation: For the 1st fold, VC is the best; for the 2-5 folds, TPR is the best. Possible Reason: for the Last.fm dataset, the past extreme popular artists still attract many visits without attracting many new comments, such as The Beatles, Muse.

To do... • Finish the experiments in the other datasets. • Refinement of the approach for different types of data. such as: • For extreme popular but old items, using personalized vector to have a bias.

Questions && Suggestion? Thanks!

Research Updates