Privacy-Preserving Dynamic Learning of Tor Network Traffic

Privacy-Preserving Dynamic Learning of Tor Network Traffic Presented by Matthew Taylor Original authors: Rob Jansen, Matthew Traudt and Nicholas Hopper

Introduction

Background Knowledge - Tor • The Onion Router • Anonymous communication protocol • Utilizes network of volunteer relays to encrypt and randomly “bounce around” traffic to hide it from online surveillance. • Each relay decrypts a layer and passes on the original message to the next relay without knowing original sender or destination • 6,000 volunteer relays as of 2018 • 100Gbit/s of traffic from over two million daily users

Background Knowledge - Terminology • Circuit - Path through Tor network • Stream - TCP connections carried over a circuit • Relay - router/node in a network

Motivation • Current tor experimentation tools work but underlying models oversimplify tor traffic • Currently look at single file (320KiB)and bulk (5MiB) downloads but neglect inner workings of tor i.e. doesn’t look at circuits etc. • Authors believe a more comprehensive model of tor traffic can be created

Problem Gain a more comprehensive understanding of Tor traffic without compromising the privacy of its users for the purposes of creating more accurate traffic in private Tor networks and simulations.

Solution

Measuring Tor • Use open-source tool PrivCount • PrivCount works by using three types of relays: one tally server, one or mores share keepers and one or more data collectors. • Data collectors collect data and add blinding value to make it unreadable • Data is given to share keepers which collect it and sum event count and add noise • Tally server used to configure and receive data from share keepers as well as removing blinding values, at the end tally server has the global noisy count of events • PrivCount deployment with 1 tally server, 3 share keepers, and 17 data collectors • Minimum privacy standards met

Measuring Tor Statistics Tor measurements can be taken at entry relays and exit relays: • Entry relays measure circuit and clients no metadata, this information is helpful in producing accurate Tor client models. • Exit relays can view streams and therefore metadata but not circuit or client data

Tor Statistics Findings • 47% of clients will be inactive at a given time • Only 6.5% of traffic was outbound to the wider Tor network • 73.5% of incoming traffic was to ports 80 and 443 • 55-59% of active circuits only contain one or two streams • 4 - 11% carry 15 or more stream • 70 - 80% of streams receive less than 16 KiB inbound • 75 - 85% of streams send less than 1KiB outbound

Modelling Tor • Need to generate model for creation of streams and traffic on stream • Modelled using Hidden Markov Model - markov model except probabilities are treated as a “black box” • Collect data to make assumptions about the form of the model, e.g. The number of states, the connectivity between states, and the form of observation distributions.

Training the Model • Usually with hidden Markov Model, trained using a single dataset • To conserve privacy this could not be done as data needs to change every 10 minutes • PrivCount used to supply data for every iteration and hope data converges

Evaluating Models Single file - repeatedly downloads a file and sometimes pauses to simulate traffic, widely used but overly simplified Protocol - web model, generates realistic traffic from http archives / bulk model, uses a dataset of torrents to simulate real torrent traffic. privCount model - model generated from all data collected previously

Evaluation • Used TGen traffic generation tool to generate traffic, tor routing and Shadow simulator to test all models. • PrivCount used again to measure results

Results • The cumulative percentage distance from ground truth distance across all nine single counters was 703% for the single file model, 1001% for the protocol model, and 408% for the PrivCount. Meaning PrivCount matched Tor the closest but was still considerably off • The cumulative percentage distance from ground truth across the six histogram counters was 150% for the single file model, 56% for the protocol model, and 95% for the PrivCount model • PrivCount more computationally expensive

Criticism

What was Good • All things considered, was good paper overall • Took care to conserve user privacy • Managed to produce an improved model

What was not so Good • No direct measurements could be taken so as to not compromise anonymity • HMM trained with multiple datasets via PrivCount instead of one and as a result didn’t converge but instead oscillated • Assuming HMM, i.e. HMM is simplest of Bayesian models, perhaps the Tor traffic should have been modelled using a more complex model • Resorted to using simulations instead of real experiments

Privacy-Preserving Dynamic Learning of Tor Network Traffic

Privacy-Preserving Dynamic Learning of Tor Network Traffic

Presentation Transcript

Privacy-preserving Distributed Learning using Generative Models

Privacy-preserving collaborative network anomaly detection

M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets

Privacy preserving network forensics

data privacy-preserving

TOR Anonymity Network

Privacy-preserving DRM

Virtual Trip Lines for Distributed Privacy-Preserving Traffic Monitoring

Network Analysis While Preserving Privacy*

Privacy Preserving Learning of Decision Trees

Dynamic Key-Updating : Privacy-Preserving Authentication for RFID Systems

Privacy-Preserving Face Recognition

Privacy Preserving OLAP

Overview of Privacy Preserving Techniques

Privacy-Preserving Computation

Building A Trustworthy, Secure, And Privacy Preserving Network

Privacy-Preserving Clustering

Protecting Users’ Privacy when Tracing Network Traffic

Quantum Technologies for Privacy-Preserving Machine Learning