Things about Trace Analysis

Things about Trace Analysis Wei-jen Hsu In class presentation for CIS6930 wjhsu@ufl.edu (Advisor: Ahmed Helmy)

Objective • More background knowledge related to trace-based study • Details about the trace format – an intro for one of the assignments • Share the experience in trace analysis

Why trace analysis? • Traces provide the “realism” of how the system work • Verification of established system • Diagnosis of system operation (identify faults) • Identifying design flaws • Large-scale properties (e.g. self-similar traffic) • Understand how a new system works • Provide domain knowledge for analysis work • Verifying an idea

Typical Work Flow for Trace Analysis • Build the system • Identify point(s) of trace collection and the methodology used • Obtain the data • Clean-up and sanity check • Analyze the data and post processing • Explain the results • Apply the results to further study or modify the existing system

WLAN Traces Study • It starts back around 2000 • WLAN was new, people wanted to understand how people used it (usage study) • Surveys v.s. trace • Work by Tang and Baker (’00), Kotz and Essien (’02) are pioneer examples • Statistics of usage (# of users, amount of traffic, etc.)

WLAN Traces Study • Mobility-related • MIT work (home location, prevalence, and persistence) • UCSD (PDA users) • WLAN mobility model (INFOCOM05, T-model, T++-model) • Other user properties • Handoff • Pause time distribution

Trace Format • For association • Usually with format (Node_id, start_time, location, end_time) • But with various ways to get you there…. • Syslog: Event-based • SNMP: Polling • USC raw trace • Wireless association (time start/stop switch-port MAC) • DHCP log (time MAC IP) • Traffic log

Trace Format Example • USC wireless association trace (Time Start/Stop Switch_IP Switch_port MAC_of_node) Mon Oct 10 01:16:52 Start 172.16.8.245 31005 0:30:65:f9:c0:ae Mon Oct 10 01:17:00 Stop 172.16.8.245 21044 0:e:35:99:64:d1 Mon Oct 10 01:17:02 Start 172.16.8.245 31015 0:11:24:df:c0:3a • USC DHCP trace (Time IP_of_nodeMAC_of_node) Jan 27 00:21:19 207.151.229.50 0:18:f3:10:ea:4c Jan 27 00:21:20 207.151.232.184 0:18:de:33:7:92 Jan 27 00:21:20 207.151.229.50 0:18:f3:10:ea:4c • USC traffic trace (Start_time End_time Destination_IP_port Source_IP_port protocol(TCP=6, UDP=17) “?” Packet_number Data_size) 0127.23:59:42.925 0127.23:59:44.905 128.125.253.143 53 207.151.239.208 1795 17 0 3 1368 0127.23:59:42.925 0127.23:59:52.677 63.236.56.237 80 207.151.239.208 3257 6 2 4 192

Work with the Trace • An exercise: “Does the Encounter-Relationship graph change with respect to time??” • From WLAN traces, We find “encounters” to measure inter-node relationship Note: Is this a good assumption??

0.5 Not many for WLAN users. On avg. only 2%~7% of population Encounter distribution • How many other nodes does a node encounter with? Prob. (unique encounter fraction > x)

loner Group of good friends… Cliques with random links to join them Encounter-Relationship graph • Imagine that there is a link to connect the node pairs if they ever encounter with each other … What does the graph look like? But, is ER grapha connected graph? What are its properties?

In most cases DR reaches close to final value in less than 1 day. Encounter-Relationship graph • To our surprise, ER graphs are connected!! Disconnected Ratio (%)

Random Graph - Low path length, - Low clustering SmallWorld graph Regular Graph - High path length - High clustering Encounter-Relationship graph • What are the graph properties of the relationship graphs? High clustering as regular graph Low path length as random graph

Encounter-Relationship graph • Relationship graphs are SmallWorld graph • High clustering coefficient, low avg. path length Normalized CC and PL

Work with the Trace • An exercise: “Does the Encounter-Relationship graph change with respect to time??” • Chop the trace into multiple segments • Analyze the average clustering coefficient and average path length of the resultant graph • How to deal with changing population? • Does the encounter duration matter?

Work with the Trace • Ask questions! What to look for from the trace? • Its importance • Its implication • Its potential usage • Its alternative solutions • Apply new techniques to look into the data • Find/Create interesting data sets

Lessons Learned • You need a lot of patience and care • Exceptions in the data • Flaws in your assumption • You need a lot of hard-drive space too! • You need good questions • For each question there are multiple ways to come up with an answer • New questions require new data sets and tools • You need to read a lot of papers

More Potential Direction • Mobility modeling/prediction • Data mining and clustering • Behavior-aware service/advertisements • Behavior-aware routing • Caveat: Over-generalization from WLAN to futuristic networks (such as DTN)? • Re-examine assumptions in earlier work

Related Skills • General programming (C/C++) • Perl/shell script/awk • Matrix manipulation (MATLAB) • Statistics software (R) • http://www.r-project.org/ • Clustering/Machine learning • Principal component analysis/ Singular value decomposition • http://www.cs.cmu.edu/~elaw/papers/pca.pdf • Data mining? Database analysis?

Good Online Resources • MobiLib http://nile.cise.ufl.edu/MobiLib • Links to various traces, USC trace and some processing tools download • CRAWDAD http://crawdad.cs.dartmouth.edu/ • Various traces download, related papers

References • [Stanford] D. Tang and M. Baker, “Analysis of a Local-area Wireless Network” • [Stanford2] D. Tang and M. Baker, “Analysis of a Metropolitan-area Wireless Network” • [Dartmouth] D. Kotz and K. Essien, “Analysis of a Campus-wide Wireless Network” • [Dartmouth2] T. Henderson, D. Kotz, and I. Abyzov, “The Changing Usage of a Mature Campus-wide Wireless Network” • [MIT/IBM] M. Balazinska and P. Castro, “Characterizing Mobility and Network Usage in a Corporate Wireless Local-area Network”

References • [UCSD] M. McNett and G. Voelker, “Access and Mobility of Wireless PDA Users” • [UCLA] X. Meng, S. Wong, Y. Yuan, and S. Lu, “Characterizing Flows in Large Wireless Data Networks” • [USC] D. Bhattacharjee, A. Rao, C. Shah, M. Shah, and A. Helmy, “Empirical Modeling of Campus-wide Pedestrian Mobility: Observations on the USC Campus” • [USC2] K. Merchant, W. Hsu, H. Shu, C. Hsu, and A. Helmy, “Weighted Waypoint Mobility Model and Its Impacts on Ad Hoc Networks”

References • [Dartmouth] M. Kim and D Kotz, “Methodology for Classifying Mobile Users and Access Points” • [Dartmouth] L. Song, D. Kotz, R. Jain, and X. He, “Evaluating location predictors with extensive Wi-Fi mobility data” • [SIGCOMM01] A. Balachandran, G. Voelker, P. Bahl, and V. Rangan, “Characterizing User Behavior and Network Performance in a Public Wireless LAN” • [INFOCOM05] C. Tuduce and T. Gross, “A Mobility Model Based on WLAN Traces and its Validation” • [T++-model] D Lelescu, UC Kozat, R Jain, M Balakrishnan, “Model T++: an empirical joint space-time registration model” • [T-model] R Jain, D Lelescu, M Balakrishnan, “Model T: an empirical model for user registration patterns in a campus wireless LAN”

More on Mobility Modeling

Skewed location visiting preferences Nodes spend 95% of time at top 5 preferred locations. Heavily visited “preferred spots” Periodical re-appearance Nodes show up repeatedly at the same location after integer multiples of days. Periodical “daily/weekly schedules” Mobility Observations from WLANs

Mobility Observations from WLANs • Problems of simple random models (random walk, random waypoint, random direction) • No preferred locations in spatial domain (uniform nodal distribution across space) • No structure in time domain (homogeneous behavior across time) • Nodes behave statistically identical to one another • Benefit: Math analysis tractability • Can we improve realism and not sacrifice math tractability?

Time-variant Community Model • Skewed location visiting preferences • Create “communities” to be the preferred destination • Each node can have its own community • Periodical re-appearance • Create structure in time – Periods • Node move with different parameters in periods • Repetitive structure 75% 25%

Prob of re-appearance Avg. fraction of online time Avg. fraction of online time Time gap (days) Time-variant Community Model • Major trends of mobility characteristics preserved (extensions later) • In addition, mathematical tractability is retained

More on Matrix-based Analysis

Introduction • Wide-spread WLAN deployments create large-scale infrastructures. • Large number of users lead to large scale management and design issues. • We need methods to quantify, summarize, and compare long-run trends (in the order of months) of individual user associations • Usage model / association model • Personalized services • Behavior aware ads / monetization • Behavior-aware routing protocols

Questions • Q1. How to quantify user association consistency? • (Challenge) What is a proper representation of user association, and how do we measure consistency? • Q2. How do we summarize long run user association patterns? • (Challenge) How to utilize existing data reduction techniques? • Q3. How to group users with similar association patterns? • (Challenge) How to quantify the similarity of user association patterns? • How to reduce computational complexity? • Contribution: Generic methods to address these questions and empirically validated using USC and Dartmouth WLAN traces.

(library, 1:30PM-2:30PM) (office, 10AM-12PM) (class, 6PM-8PM) Representation of User Association Patterns • We choose to represent summary of user association in each day by a single vector. • For a given day d, user association vector is defined by a n-element vector a = {aj : the percentage of online time the user i spends at APj on day d}. • The elements of a vector sum to 1. • Use zero vector for off-line users. • The elements in the vectors quantify the relative importance (or, attraction) of the AP to the user. Association vector: (library, office, class) =(0.2, 0.4, 0.4)

Q1. User Association Consistency • User i is consistent, if its daily association vectors can be grouped into few clusters (e.g., less than 10% of the number of days). • Evaluation: use hierarchical clustering with Manhattan distance measure (L1) • Distance between two vectors is at most 2.

Q1. User Association Consistency • Hierarchical Clustering • Start: Each vector is a single-member cluster. • Recursion: Two closest clusters are merged. • End: Until remaining clusters have distances larger than a threshold

Q1. User Association Consistency Distribution of Number ofclusters under cut-offthreshold 0.9 80% of users show at most9 clusters of “behavior modes”during the 94-day trace *complete link: Distance between clusters =distance between the furthest components inthe considered clusters Observation: many users are multimodal but with much less association modes than total number of days in the trace period.

Daily association vector Q2. Summarizing user associations • Association matrix: concatenate user association vectors for all days into a matrix. • To summarize, perform SVD and store the top-k eigen values/vectors. • What value of k we have to use for a good representation of the matrix? • Captured matrix power = • How much is the reconstruction error? • Matrix norms ||X-Xk||p/||X||pwhere

Q2. Summarizing user associations Only top 6 singular vectorsare needed to capture at least90% of power for more than 95% of association matrices Reconstruction error of low-rank approximationis low (5 singular vectorsgive error < 0.05) Observation: although users are multi-modal,a few major modes dominate its behavior

Daily association vector Q2. Summarizing user associations • Association matrix: concatenate user association vectors for all days into a matrix. • To summarize, perform SVD and store the top-k eigen values/vectors. • What value of k we have to use for a good representation of the matrix? • Captured matrix power = • How much is the reconstruction error? • Matrix norms ||X-Xk||p/||X||pwhere

Q2. Summarizing user associations Only top 6 singular vectorsare needed to capture at least90% of power for more than 95% of association matrices Reconstruction error of low-rank approximationis low (5 singular vectorsgive error < 0.05) Observation: although users are multi-modal,a few major modes dominate its behavior

Q3. Similarity Metrics between Users • Naive method to compare similarity between user i and j: • Intuition: for every daily association vector of i, if there is a similar association vector for j, then (i,j) have similar behavior. • From user i, pick association vector aid of user i on day d. • Find the association vector of user j, denoted by ajd’ , which is the nearest to aid • Find average of |ajd’ - aid| over all days d. • Drawback: expensive • O(nd^2) for each pair • Lots of file reads for large dataset …. Read raw data • Need a faster method which reads summaries

Q3. Similarity Metrics between Users • Compare the similarity of the eigen-vectors obtained from SVD. • Similarity between users determined by weighted inner products of eigen vectors. • wi = proportion of power of singular vector • D(U,V) = 1 - Sim(U,V) • Are the 2 metrics similar? • 0.911 correlation coefficient for studied users.

Q3. Similarity Metrics between Users • Are we able to get clusters with similar users? • Compare the PDF/CDF for inter- and intra- cluster users (Example: 200 clusters).

Q3. Similarity Metrics between Users • Take users in the same clusters and concatenate the asso. matrices, and perform SVD and find power captured by top k eigen vectors. • Also take random users and concatenate the eigenvectors and do the same. • There is a clear distinction between the 2 clustering methods. *straight-forward = similarity decided based onpair-wise comparison of association vectors *feature-based = similarity decided based on singular vectors

Q3. Similarity Metrics between Users • For all clusters, use a scatter plot to show the power captured by top-4 eigenvectors. (distance-based cluster vs random cluster)

Things about Trace Analysis

Things about Trace Analysis

Presentation Transcript

Trace elements analysis

Trace

Particle Trace Analysis Script

Things about Chile

Advances in Trace Element Analysis

Things About Me!! ;-)

First step into Trace Analysis

Trace -

Things about me

Automatic trace analysis with Scalasca

Trace Gas Dispersion Analysis

Trace

Trace

trace

Trace Metal Analysis of Drinking Water.

Things About Taiwan

Things about trampoline

Three things investors should know about technical analysis

Things To Know About Stock Market Analysis

A Few Things About Forex Technical Analysis

Chapter 9 – Trace Evidence Analysis

First step into Trace Analysis