610 likes | 789 Views
Studying users behavior in chat rooms. DANSS January 25, 2004 Michael Rochkind. Agenda. Motivation Project goals What was done Results Conclusions. Motivation.
E N D
Studying users behavior in chat rooms DANSS January 25, 2004 Michael Rochkind
Agenda • Motivation • Project goals • What was done • Results • Conclusions
Motivation • Need for simulations of interactive end-users to evaluate algorithms and system designs (e.g algorithms for estimation of multicast group size) • Difficulty to get real data (both technical and administrative) • Most researchers use trace collected for audio multicast of IETF conference talks in 1996
Problems with the trace • Complete research field is based on a single trace • The trace is quite old (from 1996) • Collected from one specific type of service (audio conference). The exact nature of users is unknown. The behavior is not necessary the same as in other applications. • Impossible to validate the data or collect new one • Relatively little activity of members • Percentage of spurious joins/leaves is very high
Statistical analysis of the trace • Different researchers got different statistical models for various parameters. • Ammar and Almeroth (the original trace creators) obtained exponential model for most parameters and Zipf distribution for long session stay time. • Aluf, Altman, Nain recently obtained from the same long trace lognormal distribution for both inter-arrival times and stay times. For short multicast session they obtained Weibull distributions for both inter-arrival and stay times. • Assumed uniform distribution of users (spatial)
Project goals • To find a publicly available system which reasonably approximates multicast users behavior. • To develop tools for data retrieval so that it can be run by anyone, anytime. • To analyze the collected data
Parameters of interest • Inter-arrival time • Session duration (on-time) • Number of logged in users (group size) • Users’ activity (messages, bytes) • Geographical distribution of users • Lifespan of multicast event (for short events) • Comparison with the “famous trace”
First try - message boards (Yahoo) • Difficult to define term of user session. Many users send just one message. • Only active users can be seen (writers) • A lot of information is missing (about 50%) • Activity peaks when outstanding events happen
Chat rooms • The model is similar to multicast group • Users explicitly join the room and leave it • Join/leave time and stay time are well-defined. • Every message sent to the room is received by all room members
IRC- Internet Relay Chat protocol • Run over TCP/IP • Text-based teleconferencing • Client-server model • Can run in distributed fashion • Five big networks with many tens of thousands users and thousands of channels (rooms)
IRC Servers • Form a backbone of IRC network • Connected together without circles (in the form of a spanning tree) • Handle clients connections • Each server knows about all other servers and all clients. C2 C1 S5 S1 S2 S3 S4 C3 C4 S6
IRC clients • IRC client is anything connected to IRC server which is not another IRC server. • Any TCP enabled device can be IRC client • Distinguished by unique nickname • Each IRC server has the following info about each IRC client: • Nickname • Real name of the host where the client is running • Username of the client on that host • IRC server to which the client is connected
IRC Channels • Parallel to the term “Chat room” • Named group of one or more users which will all receive messages addressed to that channel. • Created when first user joins the channel • Ceases to exits when last users leaves it • In case of network split the channel on each side has only those clients connected to the servers in the corresponding side. After network reconnection the channel is joined again.
IRC network example C1 S5 S1 S2 C2 S4 S3 C3 C4 S6
IRC message sending C1 S5 S1 S2 C2 S4 S3 C3 C4 S6
IRC – new member joins to a channel • Channel X with members C1, C2, C3 • Client C4 joins the channel X join c4 C1 join c4 S5 S1 join c4 S2 join c4 C2 join c4 join c4 S4 join c4 S3 1. Join X join c4 C3 2. names c1, c2, c3 S6 C4
IRC Channel Monitoring • Monitoring client written in Perl running under cron • We choose randomly 3 channels from the group of all channels with more than 100 users – #israel, #canada, #bosnia • Channel activity data was collected for a period of about 6 weeks.
Log file format • <time> START • <time> EXIT • <time> JOIN <nickname> <country> • <time> PART|QUIT|KICK <nickname> • <time> PUBLIC <nick> <size> <country> • <time> NICK <old nick> <new nick> • <time> NAMES <list of nicks>
1053586971 START 1053587032 JOIN wponiw IL 1053587032 NAMES wponiw Teo_ i-NA mr_shark ^_kNibAL_ kaye_22 Old-Man^ CHA_555 klent Leila19f [Dan] kalanko1 Manifa21f jennider1 eu_sunt mangko18 hot^guy holly20f sad_beaut swimgirl ghazde ^^swt_guy pseudonym bing_23 topgirl23 sexYica creatza sergio9 ZaRa glance cookie^^ aileen` Ugly-GirL AFNAN EclipseM laurra-f garden cai applej SHUNSY fatcock kikelph mhaelee16 aGaTa Ercko lonebabe shellaine juulia priti2 HuntI2ess 1053587032 NAMES gienah Amanda^^ Jamali lishat18 cute_ashf jhen Horbit Sana18 AloneMan3 Errikka ext-ex Maysmile ynet02 poem_37M ann3 jelle love_less dreeve18 indai` adze LiWeiYi TokyoBoy blossom dummee man__ marichu earp danone jackdaw ^faraz^ ANGELA25 boby27 leah_ jossie shyrgil jade-17 kian arnulpo ally16 FiNG Carmina42 bangd sohail Janine33 anne--- joyce22 LUIE_M Travioli corn HOMBREJ2 sexybabes spyk2000 ^barbi3^ 1053587032 NAMES tumbleWED Gaby3 chynna^^ babyTH lenjie jherome Certified dj_france jane36 micay shah goerge24 bluediamo master_po Jypsy bassma Bobson^^ Fil24f dimple2 _THERE_ AloneGirL Naked_f shark_nyk morena23 Danniel_m Arwen_ ofw_park jimbern m40usa restie @PacZzZzZz blackstud davis He11razor +MultiMind mater Fearless Adnan_pk Er`mya Helena BrainDead CStrixAW` wooden birkof Cute_Girl Lisa_-- Megaframe barbara- 1053587032 NAMES Simple Loren23 Diana27 Cozzo NateDogg legendh Angel19 Mariah19 fedfed SUNSEEKER PRONET7 bestofmi D0gGi3` +Don_Juan MrNylons teapot SkiPerZ +Br0Th4 Linu|tech ShowerMia JenJen Mariahhh optimist @X 1053587032 JOIN D-A-D-I IN 1053587045 JOIN sydneyguy AU 1053587047 PUBLIC Certified 17 US 1053587053 PUBLIC Certified 13 US 1053587059 JOIN Mckay28 MT 1053587063 NICK CHA_555 ^zHTe 1053587068 PUBLIC Certified 31 US 1053587076 PART ^zHTe 1053587080 JOIN villain PH 1053587082 JOIN cryn PH 1053587095 JOIN static}x{ US 1053587098 PUBLIC Certified 31 US
Inter-Arrival distribution – #bosnia occurrences Time (in sec) occurrences Time (in sec)
Inter-Arrival distribution – #israel occurrences Time (in sec) occurrences Time (in sec)
Inter-Arrival distribution – #canada occurrences Time (in sec) occurrences Time (in sec)
Inter-Arrival distribution • Distrubution looks similar for all three channels • The distribution is heavy-tailed from two main reasons: • Network splits - add zero values (during reconnection) and big values (during the split) • Periods of low activity add tail (more actual for channels with non-uniform geographical distribution – like #bosnia)
Inter-arrival time fits #israel • LogNormal distribution is the best in almost all cases • The only exception is InvGauss distribution using A-D and K-S for #israel • Exponential distribution is very far from being optimal #canada #bosnia
The audio trace – inter-arrival fits • Inter-arrival time distribution is similar to IRC Channels • LogNormal/ InvGauss
Session duration distribution- #israel occurrences Duration (10^5 sec) occurrences Duration (in sec)
Session duration distribution- #canada occurrences Duration (10^5 sec) occurrences Duration (in sec)
Session duration distribution- #bosnia occurrences Duration (10^5 sec) occurrences Duration (in sec)
Session duration distribution • Very heavy tail for two reasons: • Many users spent a lot of time in the channel • Robots
Session duration fits #israel • BetaGeneral distribution gives best fit using Chi-Square and K-S tests any time that we limit the data samples • LogNormal is always on the second place (and best fit using A-D tests) • When we don’t limit the data samples LogNormal is the best. • Exponential is very far from being optimal #canada #bosnia
The audio trace – session duration fits • Session durations is not similar -extremely heavy tail. • 90th percentile similar to IRC channels occurrences Time (in sec)
The audio trace – session durations Long sessions (>1 min) • Long sessions are similar to IRC channels • The phenomenon of short sessions is unique to the audio trace. No analog in the IRC Channels Short sessions (< 1min)
Main affecting factors • Network failures (splits) • Robots and long staying users • Geographical distribution of users
IRC network splits • Any IRC server failure or link failure causes split. • For channel member a split looks like massive leave of users and reconnection looks as massive join of users. • Contribute big number of zeros to inter-arrival time (about 2 percent of joins come in groups) • Decrease session durations • Most splits lasts for up to 20 minutes
Short (temporal) Splits • Heuristic: Find group of quits followed by a group of joins with the same users. • Finds only part of failures
Split durations occurrences Duration (sec)
Robots We define robot as any client who is logged in more than 8 hours in day in average. • Add constant to number of logged users • Add heavy tail to session durations • Don’t affect inter-arrival and join statistics
Distribution of logged robots number occurrences Number of bots
User traffic (Israel) Joins per hour Hour of day Channel size Hour of day
User traffic (bosnia) Joins per hour Hour of day Channel size Hour of day
User traffic (canada) Joins per hour Hour of day Channel size Hour of day
User traffic as function of time of day – observations • The function is very stable over different days • The graph shape is mainly defined by geographical distribution of users • Has grate influence on other parameters distribution like number of on-line users, number of joins per hour.
Joins per hour distribution - #israel Joins in hour occurrences Joins in hour