What Data Do We Need and Why Do We Need It?

What Data Do We Needand Why Do We Need It? Jim Pepin Chief Technology Officer University of Southern California

Network Data: Research Depends on It • Solutions depend on understanding the problem… • Advances in many areas depend on analysis of real data • Network Management: Traffic engineering, net design • Network Control:Improving routing protocols • High Performance: Better transport protocols • Security:Tracking/stopping DoS and worm attacks • Over 30% of papers in top networking conference (SIGCOMM’04) depended on data collected by others • Most common providers: • ISPs (e.g., ATT, Sprint, I2) • Service Providers (e.g., Akamai) • Individual campuses (e.g., UNC, UOregon, USC – some campuses give data only to local researchers)

Network Data: More than Just Packet Traces • Some data more sensitive than others • Dynamic routing information: routing protocol advertisements • Static design information: Router configuration files, peering arrangements, policies • Operational events: alarms, trouble tickets (very few sources of this important info!) • Traffic logs: netflow records, packet header traces • Application data: URLs, p2p filenames, DNS queries • Tension – how much correlation to permit? • Data that can be correlated across multiple sites most valuable in measuring network-wide events, e.g. worms • Techniques for privacy anonymize and blur identity

Example of Data Provider • DHS PREDICT • DHS support for network research • Not for operational use by DHS • Major Players • Peer review ground rules • Generic sources for legitimate research • LANDER Project • Example of PREDICT supplier • Joint project of USC-ISI networking division and USC/ISD Center for High Performance Computing and Communications • USC-HPCC is manager of WAN for USC/CIT/JPL. • ISI provides networking research background • HPCC provides data storage and computational resources • We work together on ground rules and MOUs • LANDER funds collection systems, support staff and disk/tape space

What is hard and easy • LANDER ground rules • Scrambled headers is primary product today • Requires MOU with researcher • No collection of data payloads. • Working on very strict MOU for very limited use of non-scrambled header data for very select uses in very controlled environment. • Build collection management system integrated with other PREDICT sites. • How we do this • Very close co-operation between ISI, ISD and university legal • MOUs will be very clear and understandable for the researcher • USC can reject any application • USC will review any publication based on unscrambled headers and all work processing these headers will be done inside HPCC

Why would we do this • The Internet needs to be studied and engineered • What is the modern equivalent of Bell Labs for phone system? • How did we get to where we are today? • Co-operation between researchers and operators. • We can’t allow ourselves to have complete bunker mentality • We need to be selective in what we provide, but in case of demonstrated need provide what is needed consistent with policies • If we don’t do this no one will • The risks can be managed if we take the time and effort to work with campus management (legal, CIOs etc) to mitigate • Researchers can be brought into these discussions if cast correctly • If we don’t study how the network works our ability to manage it will degrade to zero over time

What Data Do We Need and Why Do We Need It?

What Data Do We Need and Why Do We Need It?

Presentation Transcript

Chapter 2: Data Preprocessing

Chapter 2 Data Warehousing

Data Mining

Data Mining

Chapter 14

Data Quality

Data Mining

Data Warehousing

Best practices to ensure efficient data models, fast data activation, and performance of your SAP NetWeaver BW 7.3 data

BIG DATA

DATA ANALYSIS

Chapter 2. Aerodrome Data

Data Quality

Data Mining: Concepts and Techniques — Chapter 2 —

Chapter 5: The Data Link Layer

Data Preprocessing

Data Mining with DB

DATA MINING

Introduction to Data Structures

UNIT-II Data Preprocessing

Chapter 2: Data Preprocessing