210 likes | 331 Views
Research experiences with publicly available anonymized data. John McHugh RedJack, LLC and University of North Carolina Predict Disclosure Control Workshop February 2010. A Tale of two datasets. LBL Anonymized packet header data. Papers describing data
E N D
Research experiences with publicly available anonymized data. John McHugh RedJack, LLC and University of North Carolina Predict Disclosure Control Workshop February 2010
A Tale of two datasets • LBL Anonymized packet header data. • Papers describing data • The Devil and Packet Trace Anonymization • A First Look at Modern Enterprise Traffic • I used it to investigate lossy compression of traces • Wireless pkt header traces from Dartmouth, CRAWDAD • Poorly documented, badly anonymized • Use agreement precludes attacking anonymization • I have used it for numerous class projects
Issues • To what extent does the anonymization interfere with use for research? • Conflicts with collection process • Collection problems may be orthogonal to anonymization or not. • Anonymization may make resolution of collection problems difficult or impossible • Inadvertent attacks on anonymization • Conflicts with understanding • All addresses are not created equal • Over anonymization may invite “harmless” attacks that lead to potentially harmful results
The LBNL data • Fragmentary observations (2 of 20+ router ports in rotation) • Spread over 5 days in Oct. 2004 - Jan. 2005 • Anonymization carefully constructed to counter attacks known at the time. • Scan data treated separately with different anonymization • Scan data is atypical - mostly ping / ping response • LBNL uses TRW to block SYN scans at its border
The question • For NetFlow data, scan records often constitute as many as 90% of the total. • Can we store these differently without compromising the long term utility of the data. • Lossless compression - perfect reconstruction • Lossy compression - “similar” reconstruction • From a perspective of a year later, do the precise details of a scan that targets mostly unoccupied addresses matter? • I think not.
The role of anonymization • For this question, anonymization is almost, but not quite, irrelevant. • We would like to be sure that we don’t throw away data relevant to a successful attack (next slides) • The LBNL anonymization precludes this as we cannot follow a responding victim in other activities • More importantly, the blocking of most scans at the border prevents most large SYN scan attacks or makes them much more difficult to recognize.
Scan and infection • Upper figure • Scanner targets /24 • Density of line is volume • 168.192.20.163 responds • Lower figure • Scan was 3106 TCP • MySQL password guess • Victim very active on this port for several weeks • Destination is scanner’s IP address • Also active on 139 UDP during this period, again with scanner.
Results / Conclusion • Compression of scans in LNBL data could reduce the volume of the scan records by 90% to 95% • Some loss of precise time and pacing information • Loss of serious scans at border limits benefit in archiving internally collected data • LNBL data not really suitable for study of typical scans due to preemptive measures. Collection needed outside scan filter for this, and even so, the preemptive measures may bias knowledgeable scanners to search elsewhere.
Dartmouth CRAWDAD wireless data • 18 sniffers spread over campus Nov 03 - Feb 04 • Packet headers cut after ports for TCP/UDP IP other • Prefix preserving anonymization of all addresses • All IP addresses, Platform portion of MAC addresses • 160GB+ of gzipped packet header files • Converted to “degenerate” SiLK Netflow • 1 flow / pkt • Hourly hierarchy year/month/day/<hourly sensor files> • Coded MAC addresses into flow record • Possibly good set for class projects - late 03 worms - but occasional time reversals - a few at ~3500 seconds
Problems - IP addresses • Almost no useful documentation of collection • Either no records kept or fear of breaching IRB • Much learned from detailed examination of packets / flows • IP addresses given by DHCP with fairly short leases • Question need for IP anonymization w/o DHCP logs • Can find 0.0.0.0 and 169.254.x.x ranges easily - others? • Tracking IP / MAC relationship gives strange results • Too many short transitions • Wanted to assign constant pseudo IP / platform • Is DHCP global or per access point? • Could get no answer, so tried to figure it out
Problems - Time • Occasional time reversals / gaps up to almost 1 hour • Did not show up in all sniffers • Thought might find same packet in multiple sniffers • direct case same MACs & IPs in two+ sniffers • indirect case same IP, but w->gw gw->w MACs • Situation much worse than expected (next slide) • Sniffers apparently ran w/o ntp for first 1.5 months • This could explain IP address inconsistencies
Attempt 1 - find Dartmouth at LBL • Dartmouth is 129.170/16 LBNL has 128.3.0.0 • in 10000001\8 and 10000000\8 (common 1000000x) • Anon Dartmouth is 190.84/16 in 10111110/8 • common 10111111x - LBNL in 10111111/8 or 191/8 • Found some SYNs to 191/8, so asked Vern what he had. • With {s,}dport and from 129.170/16 could match • Only had dport, no sport, 1 pair from same address • Unable to find match - Why?
Attempt 2 - backscatter telescope • Asked kc if the backscatter had 129.170/16 2/11-13/12 • As it happens, there is a limited amount for Nov 8-11 • This yielded one port match with 4 packets • There were also 4 unmatched packets • Dartmouth IP address was not active at the time. • Access drops packets if address is not associated. • No outgoing so source is spoofed • Other data did not match with wireless based on sPort / dPort. Assume either not wireless or inactive • Match rates 1 sniffer clock for 1 interval
So why should we care? • The internet is an open universe • It is constantly being probed by deliberate and accidental events. • It is constantly being observed at many locations • With a little luck and a lot of patience it may be possible to unravel the best efforts at anonymization, especially for specific targets. • Cryptography provides wholesale, not retail protection.
So why should we care? • Collecting data is very hard • Dartmouth did not document the collection well. • They did not look closely at what they had collected and apparently performed little or no general analyses before releasing the data. • Admittedly, most of the research done with CRAWDAD data addresses mobility questions, but the clock problem affects that as well. • Collection problems led to search for external interactions which could breach the anonymization, but ...
So why should we care? • Presence of scanning worms has potential to completely undo anonymization. • Several students have examined data for worm signs as term projects. Most worms active during 03/11 - 04/02 are there. • We have tried to honor use agreement and have not looked at scan patterns in detail, but • Anyone scanned by a wireless address at Dartmouth during this collection has a piece of the puzzle.
Was the data useful? • The LBNL data is useful for its intended purpose. • It was marginally useful for my analysis, but the limitation is the scan blocking, not anonymization. • The collection mode further limits research targeted at platform characterization over time. • The way LBNL operates limits the presence of interesting security events. Most do not happen. • A more complete enterprise data set for a longer duration would be very useful, but would probably endanger the anonymity as more external interactions became traceable.
Was the data useful? • My hope for CRAWDAD data was to create clean data set that could be used for pedagogical purposes. • There are too many collection related problems remaining to declare victory • I have worked at it off an on for about 4 years. • Even with its problems, the data has been useful for a network analysis course. 1 MS thesis, several conference pubs, and about 5 gainfully employed former students • We have carefully and sucessfully stepped around the anonymization requirement for the most part.
Conclusions • Both the CRAWDAD and LBNL data sets have utility beyond the purpose of collection. • The anonymization and collection practices limit the utility by closing off whole areas of interest, i.e. scan interaction for LBNL, ICMP for CRAWDAD • Documenting collection and ensuring data soundness are orthogonal to anonymization, but they have the ability to interact in interesting ways. • These interactions may limit the effectiveness of anonymization.