620 likes | 724 Views
Patterns amongst Competing Task Frequencies: S u p e r – L i n e a r i t i e s , & t h e A l m o n d -D G m o d e l. Danai Koutra B.Aditya Prakash Vasileios Koutras Christos Faloutsos. PAKDD, 15-17 April 2013, Gold Coast, Australia. Questions we answer (1). Patterns :
E N D
Patterns amongst Competing Task Frequencies: S u p er – L ine a rities , &the A lmond-D G model Danai Koutra B.AdityaPrakash VasileiosKoutras Christos Faloutsos PAKDD, 15-17 April 2013, Gold Coast, Australia
Questions we answer (1) • Patterns: If Bob executes task xfor nx times, how many times does he execute task y? • Modeling: Which 2-d distribution fits 2-d clouds of points? # of # of Danai Koutra (CMU)
Questions we answer (2) • Patterns: If Bob executes task xfor nx times, how many times does he execute task y? • Modeling: Which 2-d distribution fits 2-d clouds of points? ‘Smith’ (100 calls, 700 sms) # of # of Danai Koutra (CMU)
Let’s peek...… at our contributions • Patterns: • power laws between competing tasks • log-logistic distributions for many tasks • Modeling: Almond-DG distribution for 2-d real datasets • Practical Use: spot outliers; what-if scenarios ln(tweets) ln(comments) Danai Koutra (CMU)
Let’s peek...… at our contributions • Patterns: • power laws between competing tasks • log-logistic distributions for many tasks • Modeling: Almond-DG distribution for 2-d real datasets • Practical Use: spot outliers; what-if scenarios Danai Koutra (CMU)
Let’s peek...… at our contributions • Patterns: • power laws between competing tasks • log-logistic distributions for many tasks • Modeling: Almond-DG distribution for 2-d real datasets • Practical Use: spot outliers; what-if scenarios Danai Koutra (CMU)
Roadmap • Data • Observed Patterns • Related Work • Proposed Distribution • Goodness of Fit • Conclusions Danai Koutra (CMU)
Data 1: TencentWeibo • micro-blogging website in China • 2.2 million users • Tasks extracted • Tweets • Retweets • Comments • Mentions • Followees Danai Koutra (CMU)
Data 2: Phonecall Dataset • phone-call records • 3.1 million users • Tasks extracted: • Calls • Messages • Voice friends • SMS friends • Total minutes of phonecalls Danai Koutra (CMU)
Roadmap • Data • Observed Patterns • Super Linear Relative Frequency • Log-logistic Marginals • Proposed Distribution • Goodness of Fit • Conclusions Danai Koutra (CMU)
Pattern 1 - SuRF: Super Linear Relative Frequency (1) Intuition: 2x tweets, 16x retweets ln(tweets) ‘Smith’ (1100 retweets, 7 tweets) 0.23 ln(retweets) Danai Koutra (CMU)
Pattern 1 - SuRF: Super Linear Relative Frequency (1) Intuition: 2x tweets, 16x retweets ln(tweets) ‘Smith’ (1100 retweets, 7 tweets) 0.23 • Logarithmic Binning Fit [Akoglu’10] • 15 log buckets • E[Y|X=x] per bucket • linear regression on conditional means ln(retweets) Danai Koutra (CMU)
Pattern 1 – SuRF (2) Intuition: 2x tweets, 4x comments ln(tweets) 0.304 ln(comments) Danai Koutra (CMU)
Pattern 1 – SuRF (3) Intuition: 2x tweets, 4x mentions ln(tweets) 0.33 ln(mentions) Danai Koutra (CMU)
Pattern 1 – SuRF (4) Intuition: 2x followees, 16x retweets ln(followees) 0.25 ln(retweets) Danai Koutra (CMU)
Pattern 1 – SuRF (5) Intuition: super-linearity; more calls, even more minutes ln(total_mins) 1.18 ln(calls_no) Danai Koutra (CMU)
Pattern 1 – SuRF (6a) Intuition: 2x friends, 3x phonecalls ln(voice_friends) 0.79 ln(calls_no) Danai Koutra (CMU)
Pattern 1 – SuRF (6b) ln(voice_friends) Telemarketers? 0.79 ln(calls_no) Danai Koutra (CMU)
Pattern 1 – SuRF (7) ln(sms_friends) Intuition: 2x friends, 5x sms 0.21 ln(sms_no) Danai Koutra (CMU)
Contributions revisited (1) • Patterns: • power laws between competing tasks • log-logistic distributions for many tasks • Modeling: Almond-DG distribution for 2-d real datasets • Practical Use: spot outliers; what-if scenarios. ln(tweets) ln(comments) Danai Koutra (CMU)
Roadmap • Data • Observed Patterns • Super Linear Relative Frequency • Log-logistic Marginals • Proposed Distribution • Goodness of Fit • Conclusions Danai Koutra (CMU)
Pattern 2: log-logistic marginals (1) Marginal PDF NOT power law ln(frequency) ln(retweets) Danai Koutra (CMU)
Pattern 2: log-logistic marginals (2) Marginal PDF NOT power law ln(frequency) ln(comments) Danai Koutra (CMU)
Pattern 2: log-logistic marginals (3) Marginal PDF power law ln(frequency) ln(mentions) Danai Koutra (CMU)
Pattern 2: log-logistic marginals (3) Marginal PDF power law ln(frequency) How to capture both??? ln(mentions) Danai Koutra (CMU)
Contributions revisited (2) • Patterns: We observe • power law relationships between competing tasks • log-logistic distributions for many tasks • Modeling: We propose the Almond-DG distribution for fitting 2-d real world datasets • Practical Use: spot outliers;what-if scenarios. Danai Koutra (CMU)
Roadmap • Data • Observed Patterns • Proposed Distribution • Problem Definition • Almond-DG • Background: copulas • Goodness of Fit • Conclusions Danai Koutra (CMU)
Problem definition Given: cloud of points Find: a 2-d PDF, f(x,y),that captures (a) the marginals (b) the dependency # of # of # of # of Danai Koutra (CMU)
Solutions in the Literature? • Multivariate Logistic [Malik & Abraham, 1973] • Multivariate Pareto Distribution [Mardia, 1962] • Triple Power Law [Akoglu et al., 2012] bivariate distribution for modeling reciprocity in phonecall networks Danai Koutra (CMU)
Solutions in the Literature? • Multivariate Logistic [Malik & Abraham, 1973] • Multivariate Pareto Distribution [Mardia, 1962] • Triple Power Law [Akoglu et al., 2012] bivariate distribution for modeling reciprocity in phonecall networks BUT none of them captures the marginals AND dependency / correlation!!! Danai Koutra (CMU)
Roadmap • Related Work • Data • Observed Patterns • Proposed Distribution • Problem Definition • Almond-DG • Background: copulas • Goodness of Fit • Conclusions Danai Koutra (CMU)
Problem definition Given: cloud of points Find: a 2-d PDF, f(x,y),that captures (a) the marginals (b) the dependency # of # of # of # of Danai Koutra (CMU)
STEP 1: How to model the marginal distributions? Marginal PDF • A: Log-logistic! • Q: Why? • A: Because it • mimics Pareto • captures the top concavity • matches reality ln(frequency) ln(retweets) Danai Koutra (CMU)
Reminder:Log-logistic (1) BACKGROUND • CDF: F(x; α, β) = 1/[1 + (x/α)−β], x, α, β ≥ 0 • Intuition: The longer you survive the disease, the even longer you survive. • memoryless • 2 parameters: scale (α) and shape (β) ✗ a=1 β= Danai Koutra (CMU)
Reminder:Log-logistic (2a) BACKGROUND • In log-log scales, it looks like hyperbola PDF β = shape parameter a = scale param = median Danai Koutra (CMU)
Reminder:Log-logistic (2b) BACKGROUND • In log-log scales, looks like hyperbola By truncating the top concavity, we get a power law. PDF β = shape parameter a = scale param = median Danai Koutra (CMU)
Parameter Estimation:Log-logistic (3) BACKGROUND • linear log-odd plots real Theory -ln(odds) α = 2.07 β = 1.27 Prob(X<=x)Prob(X>x) ln(mentions) Danai Koutra (CMU)
Problem definition Given: cloud of points Find: a 2-d PDF, f(x,y),that captures (a) the marginals (b) the dependency ✔ # of ✔ ✔ # of # of # of Danai Koutra (CMU)
STEP 2a: How to model the dependency? • A: weborrow an idea from survival models, financial risk management, decision analysis • COPULAS! Danai Koutra (CMU)
Copulas in a nutshell BACKGROUND • Modeling dependence between r.v.’s (e.g., X = # of , Y = # of ) Danai Koutra (CMU)
Copulas in a nutshell BACKGROUND • Model dependence between r.v.’s (e.g., X = # of , Y = # of ) • Create multivariate distribution s.t.: • the marginals are preserved • the correlation (+, -, none) is captured # of # of Danai Koutra (CMU)
STEP 2b: Which copula? • A: among the many copulas • Gaussian • Clayton • Frank Archimedean family • Joe - explicit formula • Independence - 1 parameter • Gumbel • …. Danai Koutra (CMU)
Applications ofGumbel’s copula BACKGROUND Modeling of: • the dependence between loss and lawyer’s fees in order to calculate reinsurance premiums • the rainfall frequency as a joint distribution of volume, peak, duration etc. • … Danai Koutra (CMU)
Gumbel’s copula:Example 1 BACKGROUND • Uniform marginals • No dependence # of # of Danai Koutra (CMU)
Gumbel’s copula:Example 2 BACKGROUND • Skewed marginals • No correlation # of # of Danai Koutra (CMU)
Gumbel’s copula:Example 3 BACKGROUND • Skewed marginals • ρ = 0.7 # of # of Danai Koutra (CMU)
Problem definition Given: cloud of points Find: a 2-d PDF, f(x,y),that captures (a) the marginals (b) the dependency # of ✔ ✔ # of # of # of Danai Koutra (CMU)
Proposed Continuous Distribution: Almond where θ = ( 1– ρ )-1captures the dependence ρ= Spearman’s coefficient ρ=0 ρ=0.4 ρ=0.7 ρ=0 ρ=0.2 ρ=0.7 αx=αy=1 βx=βy=1αx= 6.5 αy=2.1 βx=1.6 βy=1.27 Danai Koutra (CMU)
Proposed Discrete Distribution: Almond-DG - DG 1. We discretize the values of Almond (floor(X), floor(Y)) 2. and truncate them i.e., keep the pairs with X>=1 and Y>=1. Discrete #’s … Danai Koutra (CMU)
Contributions revisited (3) • Patterns: We observe • power laws between competing tasks • log-logistic distributions for many tasks • Modeling: Almond-DG distribution for 2-d real datasets • Practical Use: spot outliers; what-if scenarios. Danai Koutra (CMU)