450 likes | 730 Views
Things about Trace Analysis. Wei-jen Hsu In class presentation for CIS6930 wjhsu@ufl.edu (Advisor: Ahmed Helmy). Objective. More background knowledge related to trace-based study Details about the trace format – an intro for one of the assignments Share the experience in trace analysis.
E N D
Things about Trace Analysis Wei-jen Hsu In class presentation for CIS6930 wjhsu@ufl.edu (Advisor: Ahmed Helmy)
Objective • More background knowledge related to trace-based study • Details about the trace format – an intro for one of the assignments • Share the experience in trace analysis
Why trace analysis? • Traces provide the “realism” of how the system work • Verification of established system • Diagnosis of system operation (identify faults) • Identifying design flaws • Large-scale properties (e.g. self-similar traffic) • Understand how a new system works • Provide domain knowledge for analysis work • Verifying an idea
Typical Work Flow for Trace Analysis • Build the system • Identify point(s) of trace collection and the methodology used • Obtain the data • Clean-up and sanity check • Analyze the data and post processing • Explain the results • Apply the results to further study or modify the existing system
WLAN Traces Study • It starts back around 2000 • WLAN was new, people wanted to understand how people used it (usage study) • Surveys v.s. trace • Work by Tang and Baker (’00), Kotz and Essien (’02) are pioneer examples • Statistics of usage (# of users, amount of traffic, etc.)
WLAN Traces Study • Mobility-related • MIT work (home location, prevalence, and persistence) • UCSD (PDA users) • WLAN mobility model (INFOCOM05, T-model, T++-model) • Other user properties • Handoff • Pause time distribution
Trace Format • For association • Usually with format (Node_id, start_time, location, end_time) • But with various ways to get you there…. • Syslog: Event-based • SNMP: Polling • USC raw trace • Wireless association (time start/stop switch-port MAC) • DHCP log (time MAC IP) • Traffic log
Trace Format Example • USC wireless association trace (Time Start/Stop Switch_IP Switch_port MAC_of_node) Mon Oct 10 01:16:52 Start 172.16.8.245 31005 0:30:65:f9:c0:ae Mon Oct 10 01:17:00 Stop 172.16.8.245 21044 0:e:35:99:64:d1 Mon Oct 10 01:17:02 Start 172.16.8.245 31015 0:11:24:df:c0:3a • USC DHCP trace (Time IP_of_nodeMAC_of_node) Jan 27 00:21:19 207.151.229.50 0:18:f3:10:ea:4c Jan 27 00:21:20 207.151.232.184 0:18:de:33:7:92 Jan 27 00:21:20 207.151.229.50 0:18:f3:10:ea:4c • USC traffic trace (Start_time End_time Destination_IP_port Source_IP_port protocol(TCP=6, UDP=17) “?” Packet_number Data_size) 0127.23:59:42.925 0127.23:59:44.905 128.125.253.143 53 207.151.239.208 1795 17 0 3 1368 0127.23:59:42.925 0127.23:59:52.677 63.236.56.237 80 207.151.239.208 3257 6 2 4 192
Work with the Trace • An exercise: “Does the Encounter-Relationship graph change with respect to time??” • From WLAN traces, We find “encounters” to measure inter-node relationship Note: Is this a good assumption??
0.5 Not many for WLAN users. On avg. only 2%~7% of population Encounter distribution • How many other nodes does a node encounter with? Prob. (unique encounter fraction > x)
loner Group of good friends… Cliques with random links to join them Encounter-Relationship graph • Imagine that there is a link to connect the node pairs if they ever encounter with each other … What does the graph look like? But, is ER grapha connected graph? What are its properties?
In most cases DR reaches close to final value in less than 1 day. Encounter-Relationship graph • To our surprise, ER graphs are connected!! Disconnected Ratio (%)
Random Graph - Low path length, - Low clustering SmallWorld graph Regular Graph - High path length - High clustering Encounter-Relationship graph • What are the graph properties of the relationship graphs? High clustering as regular graph Low path length as random graph
Encounter-Relationship graph • Relationship graphs are SmallWorld graph • High clustering coefficient, low avg. path length Normalized CC and PL
Work with the Trace • An exercise: “Does the Encounter-Relationship graph change with respect to time??” • Chop the trace into multiple segments • Analyze the average clustering coefficient and average path length of the resultant graph • How to deal with changing population? • Does the encounter duration matter?
Work with the Trace • Ask questions! What to look for from the trace? • Its importance • Its implication • Its potential usage • Its alternative solutions • Apply new techniques to look into the data • Find/Create interesting data sets
Lessons Learned • You need a lot of patience and care • Exceptions in the data • Flaws in your assumption • You need a lot of hard-drive space too! • You need good questions • For each question there are multiple ways to come up with an answer • New questions require new data sets and tools • You need to read a lot of papers
More Potential Direction • Mobility modeling/prediction • Data mining and clustering • Behavior-aware service/advertisements • Behavior-aware routing • Caveat: Over-generalization from WLAN to futuristic networks (such as DTN)? • Re-examine assumptions in earlier work
Related Skills • General programming (C/C++) • Perl/shell script/awk • Matrix manipulation (MATLAB) • Statistics software (R) • http://www.r-project.org/ • Clustering/Machine learning • Principal component analysis/ Singular value decomposition • http://www.cs.cmu.edu/~elaw/papers/pca.pdf • Data mining? Database analysis?
Good Online Resources • MobiLib http://nile.cise.ufl.edu/MobiLib • Links to various traces, USC trace and some processing tools download • CRAWDAD http://crawdad.cs.dartmouth.edu/ • Various traces download, related papers
References • [Stanford] D. Tang and M. Baker, “Analysis of a Local-area Wireless Network” • [Stanford2] D. Tang and M. Baker, “Analysis of a Metropolitan-area Wireless Network” • [Dartmouth] D. Kotz and K. Essien, “Analysis of a Campus-wide Wireless Network” • [Dartmouth2] T. Henderson, D. Kotz, and I. Abyzov, “The Changing Usage of a Mature Campus-wide Wireless Network” • [MIT/IBM] M. Balazinska and P. Castro, “Characterizing Mobility and Network Usage in a Corporate Wireless Local-area Network”
References • [UCSD] M. McNett and G. Voelker, “Access and Mobility of Wireless PDA Users” • [UCLA] X. Meng, S. Wong, Y. Yuan, and S. Lu, “Characterizing Flows in Large Wireless Data Networks” • [USC] D. Bhattacharjee, A. Rao, C. Shah, M. Shah, and A. Helmy, “Empirical Modeling of Campus-wide Pedestrian Mobility: Observations on the USC Campus” • [USC2] K. Merchant, W. Hsu, H. Shu, C. Hsu, and A. Helmy, “Weighted Waypoint Mobility Model and Its Impacts on Ad Hoc Networks”
References • [Dartmouth] M. Kim and D Kotz, “Methodology for Classifying Mobile Users and Access Points” • [Dartmouth] L. Song, D. Kotz, R. Jain, and X. He, “Evaluating location predictors with extensive Wi-Fi mobility data” • [SIGCOMM01] A. Balachandran, G. Voelker, P. Bahl, and V. Rangan, “Characterizing User Behavior and Network Performance in a Public Wireless LAN” • [INFOCOM05] C. Tuduce and T. Gross, “A Mobility Model Based on WLAN Traces and its Validation” • [T++-model] D Lelescu, UC Kozat, R Jain, M Balakrishnan, “Model T++: an empirical joint space-time registration model” • [T-model] R Jain, D Lelescu, M Balakrishnan, “Model T: an empirical model for user registration patterns in a campus wireless LAN”
Skewed location visiting preferences Nodes spend 95% of time at top 5 preferred locations. Heavily visited “preferred spots” Periodical re-appearance Nodes show up repeatedly at the same location after integer multiples of days. Periodical “daily/weekly schedules” Mobility Observations from WLANs
Mobility Observations from WLANs • Problems of simple random models (random walk, random waypoint, random direction) • No preferred locations in spatial domain (uniform nodal distribution across space) • No structure in time domain (homogeneous behavior across time) • Nodes behave statistically identical to one another • Benefit: Math analysis tractability • Can we improve realism and not sacrifice math tractability?
Time-variant Community Model • Skewed location visiting preferences • Create “communities” to be the preferred destination • Each node can have its own community • Periodical re-appearance • Create structure in time – Periods • Node move with different parameters in periods • Repetitive structure 75% 25%
Prob of re-appearance Avg. fraction of online time Avg. fraction of online time Time gap (days) Time-variant Community Model • Major trends of mobility characteristics preserved (extensions later) • In addition, mathematical tractability is retained
Introduction • Wide-spread WLAN deployments create large-scale infrastructures. • Large number of users lead to large scale management and design issues. • We need methods to quantify, summarize, and compare long-run trends (in the order of months) of individual user associations • Usage model / association model • Personalized services • Behavior aware ads / monetization • Behavior-aware routing protocols
Questions • Q1. How to quantify user association consistency? • (Challenge) What is a proper representation of user association, and how do we measure consistency? • Q2. How do we summarize long run user association patterns? • (Challenge) How to utilize existing data reduction techniques? • Q3. How to group users with similar association patterns? • (Challenge) How to quantify the similarity of user association patterns? • How to reduce computational complexity? • Contribution: Generic methods to address these questions and empirically validated using USC and Dartmouth WLAN traces.
(library, 1:30PM-2:30PM) (office, 10AM-12PM) (class, 6PM-8PM) Representation of User Association Patterns • We choose to represent summary of user association in each day by a single vector. • For a given day d, user association vector is defined by a n-element vector a = {aj : the percentage of online time the user i spends at APj on day d}. • The elements of a vector sum to 1. • Use zero vector for off-line users. • The elements in the vectors quantify the relative importance (or, attraction) of the AP to the user. Association vector: (library, office, class) =(0.2, 0.4, 0.4)
Q1. User Association Consistency • User i is consistent, if its daily association vectors can be grouped into few clusters (e.g., less than 10% of the number of days). • Evaluation: use hierarchical clustering with Manhattan distance measure (L1) • Distance between two vectors is at most 2.
Q1. User Association Consistency • Hierarchical Clustering • Start: Each vector is a single-member cluster. • Recursion: Two closest clusters are merged. • End: Until remaining clusters have distances larger than a threshold
Q1. User Association Consistency Distribution of Number ofclusters under cut-offthreshold 0.9 80% of users show at most9 clusters of “behavior modes”during the 94-day trace *complete link: Distance between clusters =distance between the furthest components inthe considered clusters Observation: many users are multimodal but with much less association modes than total number of days in the trace period.
Daily association vector Q2. Summarizing user associations • Association matrix: concatenate user association vectors for all days into a matrix. • To summarize, perform SVD and store the top-k eigen values/vectors. • What value of k we have to use for a good representation of the matrix? • Captured matrix power = • How much is the reconstruction error? • Matrix norms ||X-Xk||p/||X||pwhere
Q2. Summarizing user associations Only top 6 singular vectorsare needed to capture at least90% of power for more than 95% of association matrices Reconstruction error of low-rank approximationis low (5 singular vectorsgive error < 0.05) Observation: although users are multi-modal,a few major modes dominate its behavior
Daily association vector Q2. Summarizing user associations • Association matrix: concatenate user association vectors for all days into a matrix. • To summarize, perform SVD and store the top-k eigen values/vectors. • What value of k we have to use for a good representation of the matrix? • Captured matrix power = • How much is the reconstruction error? • Matrix norms ||X-Xk||p/||X||pwhere
Q2. Summarizing user associations Only top 6 singular vectorsare needed to capture at least90% of power for more than 95% of association matrices Reconstruction error of low-rank approximationis low (5 singular vectorsgive error < 0.05) Observation: although users are multi-modal,a few major modes dominate its behavior
Q3. Similarity Metrics between Users • Naive method to compare similarity between user i and j: • Intuition: for every daily association vector of i, if there is a similar association vector for j, then (i,j) have similar behavior. • From user i, pick association vector aid of user i on day d. • Find the association vector of user j, denoted by ajd’ , which is the nearest to aid • Find average of |ajd’ - aid| over all days d. • Drawback: expensive • O(nd^2) for each pair • Lots of file reads for large dataset …. Read raw data • Need a faster method which reads summaries
Q3. Similarity Metrics between Users • Compare the similarity of the eigen-vectors obtained from SVD. • Similarity between users determined by weighted inner products of eigen vectors. • wi = proportion of power of singular vector • D(U,V) = 1 - Sim(U,V) • Are the 2 metrics similar? • 0.911 correlation coefficient for studied users.
Q3. Similarity Metrics between Users • Are we able to get clusters with similar users? • Compare the PDF/CDF for inter- and intra- cluster users (Example: 200 clusters).
Q3. Similarity Metrics between Users • Take users in the same clusters and concatenate the asso. matrices, and perform SVD and find power captured by top k eigen vectors. • Also take random users and concatenate the eigenvectors and do the same. • There is a clear distinction between the 2 clustering methods. *straight-forward = similarity decided based onpair-wise comparison of association vectors *feature-based = similarity decided based on singular vectors
Q3. Similarity Metrics between Users • For all clusters, use a scatter plot to show the power captured by top-4 eigenvectors. (distance-based cluster vs random cluster)