Privacy-Preserving Dynamic Learning of Tor Network Traffic

Privacy-Preserving Dynamic Learning of Tor Network Traffic Adhirath Kapoor 931526192

BACKGROUND

Anonymity providing Communication System

How Tor Works ?

MOTIVATION

Current Research • Current Research only focuses on the single file model. • This model is limited to a single file download and fails to explain the following – - TOR efficacy - TOR protocol dynamics - Website structural dependencies (embedded objects) - TOR network characteristics

Research Question • How can more accurate traffic flows for use in Tor experimentation tools and research be generated?

SOLUTION

Contribution/Solution Measure TOR Learning TOR Traffic Building TOR Traffic Evaluating TOR Traffic

Measuring TOR

Measuring TOR • This is done using PrivCount which is an Open Source Privacy Preserving distributed Measurement System. • PrivCount has three components – - Tally Server - Share Keeper - Data Collector • PrivCount offers the following security properties – - Forward Secrecy - Measurement Aggregation - Private Measurement Results • The primary goal of measuring TOR was to improve upon the accuracy of previously reported PrivCount measurement results.

Measuring TOR (continued) Entry Relay - they can observe clients and circuits, and hence, the distribution of circuits per client. Exit Relays - they can observe streams

Measurement Process • Two Phases – - General Measurement - Traffic Model Measurement General Measurement allows for accuracy and to get an insight of most important traffic classes. Traffic Model Measurement focuses on measuring stream and packet hidden Markov traffic models. - Uses exit based observations (why?)

Observations (contd.) Active Circuit - if eight or more cells were sent on it • Active Client - if it has at least one active circuit • 55 to 59 percent of active circuits carry only one or two streams and about 21 to 25 percent of circuits carry 3 to 6 streams. • Another 9 to 11 percent of circuits carry 7-14 streams, and about4-11 percent carry 15 or more streams. • This suggests that Tor Browser users may experience the web differently than non-Tor users. • A single circuit per client is the most common, followed by two, three, or four circuits per client. • Most Tor users only use a handful of circuits in an average 10 minutes. • Many Tor Browser users are only lightly browse the web.

Learning TOR Traffic

Learning TOR Traffic • In this case, exits are reliable (why?) • They can observe - Stream Model events - Packet Model events - Both Stream and Packet Models • So, PrivCount’s observation of exit relay is essential for learning both the HMM models of TOR traffic

HMM • HMM is a statistical model in which the system being modeled is assumed to be a Markov process with unobservable (i.e. hidden) states. • A Markov process is a series of possible events in which the probability of each event depends only on the state attained in the previous event.

Measurement Process • Step 1: Initiate the Markov Model • Step 2: Observe TOR traffic, use PrivCount to count. • Step 3: Update Model Measure HMM path with PrivCount - Observe inter-stream delays - Most likely HMM path (Viterbi) - Count HMM frequencies - Update HMM probs. using weight parameter

Observations For both the stream arrival model and the packet model, - The later models are generally superior fits to new data than the earliest models. - Potentially anomalous measurement periods can cause some iterations to produce inferior models that then continue to improve in following iterations. - Using the results of these measurements, authors chose the stream and packet models that had the best performance as the basis for Shadow experiments.

Building TOR Traffic

Traffic Generator & Models • Tgen – Traffic Generator - Based on action dependency graphs. - Used for collecting performance benchmarks. Three Tgen configs are created. Protocol Model PrivCount Model Single File Model

Evaluating TOR Traffic

Evaluation • To evaluate the whole 4 step process, Shadow is used. • At this stage, all the three tgen models are run on Shadow. • The Shadow usually has datasets 5 years old, so RipeAtlas data was used.

Results at 22.19.00 • PrivCount model requires more computational and memory resources to run in Shadow compared to the single file and protocol models. • The generator times are negligible when compared to the time to send and receive network traffic. • Generator generated fewer than 165 streams per circuit across all generated circuits.

CRITICISM

Critique • In the first stage, the authors haven’t mentioned which traffic classes are the most important in the general measurement phase. • HMM is a generative, probabilistic model and can represent only a small fraction of distributions over the space of possible sequences. • Due to their Markovian nature, HMMs do not take into account the sequence of states leading into any given state. • All the algorithms used by HMM are expensive in terms of memory and computation time. • Shadow imitates the system behaviour and network processes in order to carry out experiments on a single machine and this is heavily subject to delays. • The middle relay is useful as it can observe the difference between types of traffic. This relay model was not included in this research.

What could be a possible solution ? • Machine learning can be an answer. • We can have a Machine Learning process with a learning phase followed by prediction phase. • This model has two advantages over the one proposed in the paper - can detect the type of service being used. - whether the user is actually using the network.

Privacy-Preserving Dynamic Learning of Tor Network Traffic

Privacy-Preserving Dynamic Learning of Tor Network Traffic

Presentation Transcript

Privacy-preserving Distributed Learning using Generative Models

Privacy-preserving collaborative network anomaly detection

Privacy-Preserving Data Mashup

M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets

Privacy preserving network forensics

Privacy Preserving In LBS

data privacy-preserving

TOR Anonymity Network

Privacy-preserving DRM

Virtual Trip Lines for Distributed Privacy-Preserving Traffic Monitoring

Network Analysis While Preserving Privacy*

Privacy Preserving Learning of Decision Trees

Dynamic Key-Updating : Privacy-Preserving Authentication for RFID Systems

Privacy Preserving OLAP

Overview of Privacy Preserving Techniques

Privacy-Preserving Computation

Building A Trustworthy, Secure, And Privacy Preserving Network

Privacy-Preserving Dynamic Learning of Tor Network Traffic

Privacy-Preserving Clustering

Privacy-Preserving Data Mining