340 likes | 493 Views
User Behavior Analysis in Wi-Fi network. Anna Rosenberg Supervisor: Orly Avner. Overview. The goal of this project: to analyze a Wi-Fi network’s APs to model the wireless clients using the network The contributions of this project: analysis of Access Points
E N D
User Behavior Analysis in Wi-Fi network Anna Rosenberg Supervisor: OrlyAvner
Overview • The goal of this project: • to analyze a Wi-Fi network’s APs • to model the wireless clients using the network • The contributions of this project: • analysis of Access Points • the use of k-means and g-means algorithms for clustering the network’s users
Previous Work • "Modeling client arrivals at access points in wireless campus-wide networks (Maria Papadopouli, HaipengShen, ManolisSpanakis)" • models of the arrival processes of clients at APs as a time-varying Poisson process with different arrival-rate function • analyzing the traffic load characteristics (e.g., bytes, number of packets, associations, distinct clients, type of clients) • clustering the APs based on their visit arrival and on the building type
Previous Work • Characterizing user behavior and network performance in a public wireless LAN. In Proceedings of the ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, 2002. (AnandBalachandran, Geoffrey Voelker, ParamvirBahl, and VenkatRangan) • Their overall analysis of user behavior shows that: • Users are evenly distributed across all APs and user arrivals are correlated in time and space • User arrivals can be correlated into the network according to a two-state Markov-Modulated Poisson Process (MMPP). • There is an implicit correlation between session duration and average data rates. Longer sessions typically have very low data requirements. Most of the sessions with high average data rate are very short.
Previous Work • Modeling users’ mobility among Wi-Fi access points.( Minkyong Kim, David Kotz) • Networks messages were collected on the Dartmouth campus • Modeling user movements between APs • Clustering the APs based on their peak hour
Data • Router (Sniffer) • Packets: • MAC address of the access points • MAC address of the user • Source/Destination IP addresses • Size of the packet • The time it was received
IEEE 802.11 Architecture • Cells (called Basic Service Set or BSS) • Base Station (called Access Point or in short AP). • Access Points are connected through backbone (called Distribution System or DS) • The examined network: 16 APs
Arrival Rate at APs • AP1, AP8: Active from midday till the evening Active only in the evening
Users • 3273 users • The transmission rate :
Coherence with the time of lectures and breaks • Users are active during the breaks and not active during the lectures that last 50-55 minutes.
Visit duration • How to define a visit? We chose 30 minutes as a maximal inter-arrival time between two packets that can be considered as packets of one visit.
Features • The average characteristics: • Average visit duration • Average inter-arrival times between the visits • Average traffic • Number of visits • Total number of days in the system The stdof inter- arrival times The std of traffic The std of visit duration
Features • No typical clusters that can be found among the networks users: Av. inter visit times vs. Number of visits Av. inter visit times vs. Av. visit duration
Features Av. traffic vs. Av. visit duration Av. traffic vs. Number of visits
Clustering • Unsupervised learning problem • Finding a structure in a collection of unlabeled data • Collection of objects which are “similar” • Distance measure
K-Means Clustering • Features: • Average visit duration • Average inter-visit times • Average traffic per packets • Maximal distance between visits • Minimal distance between visits
Results of K-Means Clustering • K=2 Av. inter visit times vs. Av. traffic per packet Av. visit duration vs. Av. inter visit times Max. distance between visits vs. Min. distance between visits
Results of K-Means Clustering • K=3 Av. inter visit times vs. Av. traffic per packet Av. visit duration vs. Av. inter visit times Max. distance between visits vs. Min. distance between visits
Results of K-Means Clustering • K=4 Av. inter visit times vs. Av. traffic per packet Av. visit duration vs. Av. inter visit times Max. distance between visits vs. Min. distance between visits
K-Means Clustering: conclusion • k-means clustering algorithm based on average characteristics of networks’ users can’t produce any isolated clusters. That is why we conclude that the algorithm based on average characteristics can’t cluster well the networks’ users. • Possible reasons for unsuccessful clustering: • Using feature set that doesn’t provide enough information about the system • Not enough samples • Using Euclidian distance
G-Means Clustering Algorithm • The right number k of clusters to use is often not obvious • Based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution • The standard statistical significance level α - desired probability of incorrectly splitting
G-means • Different feature set provides more data points • Each point consists of the following components: • The visit duration • The inter time between this visits and the previous visit • Number of packets that were sent during the visit • The average amount of data that was accessed during the visit • Normalize the data components to get proper results even with simple Euclidean distance metric • 50 users with maximal number of visits: 3457 points • Users with more than 10 visits: 572 users, 15105 points
G-means • The dependence of number of clusters on α:
G-means results • 70 clusters • α = 0.0001 58 visits, 8788 packets, 30 clusters; the most common clusters:11, 20, 29 and 35. 59 visits, 28777 packets, 31 clusters; the most common cluster 30
Evaluation • Purity - the set of clusters - the set of classes Example: the majority class and number of members of the majority class for the three clusters are: x,5(cluster 1); o,4(cluster 2); and ◊,3(cluster 3). Purity is (1/17)×(5+4+3)≈0.71
Evaluation • The dependence of the purity on α
Evaluation • New Evaluation Measure the level of possibility of representing each user by one typical cluster N – total number of users - number of samples contained in the most common cluster of user i - total number of samples of user i Example: There are 3 users: x, o and ◊. Number of samples contained in the most common class and total number of samples for the three user are: 5,8(user x); 4,5(user o); and 3,4(user ◊). E=(1/3)×(5/8 +4/5 +3/4)≈0.725
Evaluation • The dependence of the evaluation measure Eon α
G-Means Clustering: conclusion • g-means clustering algorithm based on the points that consist of the 4 characteristics (that were described earlier) can’t represent each user by one typical cluster. That is why we conclude that this algorithm can’t cluster well the networks’ users. • Possible reasons for unsuccessful clustering: • Using feature set that doesn’t provide enough information about the system • Not enough samples • Using Euclidian distance
Conclusions • The Access Points’ arrival rate is coherent with the time of lectures and breaks. • The APs show low activity during the lectures and high activity during the breaks. • k-means clustering algorithm based on average characteristics of networks’ users can’t produce any isolated clusters. That is why we conclude that the algorithm based on average characteristics can’t cluster well the networks’ users. • g-means clustering algorithm based on the points that consist of the 4 characteristics (that were described earlier) can’t represent each user by one typical cluster. That is why we conclude that this algorithm can’t cluster well the networks’ users.
Future work • Select another subset of features • Use another clustering algorithm • Try to collect more data samples