140 likes | 158 Views
This paper explores the challenges and solutions associated with creating data repositories for network management research and promoting data sharing. It discusses the barriers to entry in network management research, the sensitivity and security concerns surrounding data sharing, and positive examples of data sharing in the field. The paper also addresses the need for community-wide efforts, such as developing guidelines for the IRB process and creating shared repositories. It emphasizes the importance of active involvement from industry and operators in data collection and sharing.
E N D
“Creating Data Repositories..” Sanjay Rao ECE Dept, Purdue University
Group Members • Dave Maltz • Rebecca Issacs • Ratul Mahajan • Yin Zhang • Aditya Akella • David Kotz • Charles DiFatta • …..
Motivation • Network Management Research: • Barrier to entry is high • Data/insights from operators/industry critical • Examples: • Failure characterization of enterprise network • VLAN characterization and use • Configuration Management
What happens today..? • End-user centric measurement studies • Network “black-box”: no operator involvement • Real need: “white-box” • Campus Networks • Difficulties in bootstrapping relationships with operators • Enterprise/Operator Network • Sprint or AT&T (Microsoft with end-user) • Limited pool of researchers • Data across multiple enterprises?? • Trends over many years ??
Bottomline • Need a data repository • Contributors from operators, researchers, industry • Accessible to all researchers • Facilitate research much like Planetlab • Vital to have “critical mass” of researchers on Network Management • Research along high-impact real problems
Data Sharing: what inhibits it? • Sensitivity of data • Security Issues (firewall policies, network structure) • Privacy Issues (records of individual activity) • Proprietary nature of data • E.g. how many calls got, mobility models • Possible to have others use it? • “Secret weapon” for research • Competition Vs. collaboration • Inertia/ too much effort
Solutions • Carrots/sticks to promote data sharing • “Must release data” to publish • IMC: best paper award only to work releasing data. • Technical ways to addressing concerns with sharing
Positive Example • Example: • HSARPA “PREDICT”: make research on network security possible. • Firewalls and IDS network security data
Research: Anonymization • Hiding provider, hiding individual information • Need framework to reason about it • What trade-offs do you make? • What risks are posed? • How to expose trade-offs in a way we can appreciate? • Anonymization very domain specific • E.g. configuration file Vs. packet trace • Are there common themes? • Other Models: • NDA-based • “Give me a question” -> “return answer” • “Exploratory” nature of research
Community effort: Cooperate on IRB • Social Sciences: • Lots of experience with IRB • Networking: • Lack of clear guidelines on IRB process • Admins feel happier if IRB can “sanction” things • As community: • Must appreciate need/process for IRB • Develop guidelines for IRB process • Share IRB documents
Creating shareable data • 75% of time spent figuring how to use data • Researcher needs vary • Different forms of datum • Historical Vs. Streaming • Dated? Trending? • Assumptions made/gaps in data • “timing info crucial at sub-RTT level”? • Sharing hard, many idiosyncrasies • Data collection infrastructure, annotate
User Diagnostics • One-on-one: exact data provided • Create shared repository(ies) • What data do most users want? • Is that 20% of stuff most critical to provide? • Data Collection Tools • Meta-data part of problem • Create data in standard formats • “Observatory”: • How to discover, describe, explain data • Access policy, use policy
Other • Streaming Data: Online Vs Offline • Scalable collection: • What to collect? Over how long? • Compression techniques • Fine-grained: overhead, coarse-grained: information loss • What does it take to build this infrastructure? • Get all types of data as painlessly as possible • Massage, orchestrate data to fit researcher needs • Simple APIs to get data out – fast analysis tools • Federated Access • DataManagement - Lifecycle of data
Action Items • Community-Wide Efforts: • Initiate efforts to create data repository • How to manage? Who contributes? Who arbitrates • How much storage? Lifecycle - How long to store data? • Create IRB guidelines for networking data • Research: • Anonymization • Usage diagnostics -> what to collect,release: widely applicable • Data Collection Tools, metadata information • Industry,operators must be as actively involved as possible