560 likes | 679 Views
Uncovering Functional Networks in Internet Traffic. Mark Meiss September 25, 2006. Who am I?. Mark Meiss Ph.D. candidate in Computer Science Committee: Filippo Menczer, Alessandro Vespignani, Katy Börner, Minaxi Gupta, Kay Connelly
E N D
Uncovering Functional Networks in Internet Traffic Mark Meiss September 25, 2006
Who am I? Mark Meiss • Ph.D. candidate in Computer Science • Committee: Filippo Menczer, Alessandro Vespignani, Katy Börner, Minaxi Gupta, Kay Connelly • Researcher at the Advanced Network Management Laboratory (ANML) • http://anml.iu.edu/
What’s the agenda? The subject of today’s story: • Finding a way to improve security without compromising user privacy • A case study in applied network science This work is done with Filippo Menczer and Alessandro Vespignani.
surfing sending email playing games What do people do online? There’s what we imagine…
file sharing worms & viruses porn What do people do online? And there’s what is actually happening…
Not just a value judgment These applications all affect the health of a data network. There are legal problems, yes; but also… • Crowding out other applications. • (Napster was once over 70% of all IUB traffic) • Compromised computers are used to launch further attacks. • “Common nuisances” are on the ’Net as well.
The bottom line Network administrators need to be able to identify what applications are being used on the network. …but this can be very difficult.
A crash coursein data networks We’ll use a running example: • Buddy Bradley wants to read a web page about his favorite band at Vulgar Entertainment, Inc.
Quick summary • Each network conversation is identified by four pieces of information • Client address and port number • Server address and port number • The server uses a well-known port number • The client uses an ephemeral port number
So why is it hard to identify applications? • Well-known ports are a convention, not a rule • Web, e-mail, etc. do have ports assigned by the IANA • BitTorrent, Gnutella, Napster, etc. do not • Client and server ports share the same namespace • In practice… • Any application can use any pair of port numbers • Our focus: discovering what application is running on a port with no assigned use.
The conventional solution Let’s look inside all of those packets!
Another problem • Packet inspection doesn’t scale • Modern high-speed networks run at 10 gigabits per second or faster (that’s one full DVD every few seconds) • General-purpose computers can’t even copy that data in real time
Introducing the “flow” • We can summarize Buddy’s Web surfing as two flows: • 192.168.65.33:13029 to 10.99.205.122:80 (456 bytes) • 10.99.205.122:80 to 192.168.65.33:13029 (63,211 bytes)
Where do flows come from? • Architectural features of Internet routers allow them to export flow data • Routers can’t summarize all the data • Packets are sampled to construct the flows • Typical sampling rate is around 1:100
What can you dowith a flow? • Usual answer: • Treat a flow as a record in a relational database • Who talked to port 1337? • What proportion of our traffic is on port 80? • Who is scanning for vulnerable systems? • Which hosts are infected with this worm? • These are useful and valid questions.
What can you dowith a flow? • Our approach: • Treat a flow as a directed, weighted edge • The resulting network describes user behavior • Hold that thought for now…
The Internet2/Abilene network • TCP/IP network connecting research and educational institutions in the U.S. • Over 200 universities and corporate research labs • Also provides transit service between Pacific Rim and European networks
Why study Abilene? • Wide-area networkthat includes both domestic and international traffic • Heterogeneous user base including hundreds of thousands of undergraduates • High capacity network (10-Gbps fiber-optic links) that has never been congested • Research partnership gives access to (anonymized) traffic data unavailable from commercial networks
Flow collection Flows are exported in Cisco’s netflow-v5 format and anonymized before being written to disk.
Data dimensions • Observed Abilene on April 14, 2005 • About 200 terabytes of data exchanged • This is roughly 25,000 DVDs of information • 600 million flow records • Almost 28 gigabytes on disk • 15 million unique hosts involved
Multiple digraphs Port 80 (Web) Port 6346 (Gnutella) Port 19101 (???) Port 25 (Mail)
Application correlation • Consider the out-strength of a client in the networks for ports p and q:
Application correlation • Build a pair of vectors from the distribution of strength values:
Application correlation • Examine the cosine similarity of the vectors: • When σ = 0, applications p and q are never used together. • When σ = 1, applications p and q are always used together, and to the same extent.
Clustering applications • We now have σ(p, q) for every pair of ports • Convert these similarities into distances: • If σ = 0, then d is large; if σ = 1, then d = 0 • Now apply Ward’s hierarchical clustering algorithm
Classifying unknownapplications • To classify an unknown application, see what known applications it clusters with • Our classification experiment • Take 16 unknown ports • Guess function based on similarity data • Validate or invalidate guesses based on external evidence
Example #1 • Port 388 is coupled with FTP and Hotline • FTP is a file transfer application • Hotline is an early file-sharing application • Our guess: traditional file transfer application • Actual identity: Unidata/LDM • Used for moving large meteorological data sets
Example #2 • Port 19101 is coupled with instant messaging and P2P applications • Our guess: a P2P application that relies on individual contact for file transfers • Actual identity: Clubbox • Korean file-sharing program • Users trade large files on virtual hard drives
Overall results • For our 16 guesses: • 8 were unambiguously correct • 6 were partially correct • These turned out to be trojans and malware • We learned that IRC + P2P = evil afoot • 2 could not be confirmed or disproven • Ports were in transient use during data collection
Implications • We can identify the type of an application without examining a single packet! • Scalable • Preserves user privacy • Difficult to do with relational view of flow data