1 / 14

“Creating Data Repositories..”

This paper explores the challenges and solutions associated with creating data repositories for network management research and promoting data sharing. It discusses the barriers to entry in network management research, the sensitivity and security concerns surrounding data sharing, and positive examples of data sharing in the field. The paper also addresses the need for community-wide efforts, such as developing guidelines for the IRB process and creating shared repositories. It emphasizes the importance of active involvement from industry and operators in data collection and sharing.

lchristian
Download Presentation

“Creating Data Repositories..”

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. “Creating Data Repositories..” Sanjay Rao ECE Dept, Purdue University

  2. Group Members • Dave Maltz • Rebecca Issacs • Ratul Mahajan • Yin Zhang • Aditya Akella • David Kotz • Charles DiFatta • …..

  3. Motivation • Network Management Research: • Barrier to entry is high • Data/insights from operators/industry critical • Examples: • Failure characterization of enterprise network • VLAN characterization and use • Configuration Management

  4. What happens today..? • End-user centric measurement studies • Network “black-box”: no operator involvement • Real need: “white-box” • Campus Networks • Difficulties in bootstrapping relationships with operators • Enterprise/Operator Network • Sprint or AT&T (Microsoft with end-user) • Limited pool of researchers • Data across multiple enterprises?? • Trends over many years ??

  5. Bottomline • Need a data repository • Contributors from operators, researchers, industry • Accessible to all researchers • Facilitate research much like Planetlab • Vital to have “critical mass” of researchers on Network Management • Research along high-impact real problems

  6. Data Sharing: what inhibits it? • Sensitivity of data • Security Issues (firewall policies, network structure) • Privacy Issues (records of individual activity) • Proprietary nature of data • E.g. how many calls got, mobility models • Possible to have others use it? • “Secret weapon” for research • Competition Vs. collaboration • Inertia/ too much effort

  7. Solutions • Carrots/sticks to promote data sharing • “Must release data” to publish • IMC: best paper award only to work releasing data. • Technical ways to addressing concerns with sharing

  8. Positive Example • Example: • HSARPA “PREDICT”: make research on network security possible. • Firewalls and IDS network security data

  9. Research: Anonymization • Hiding provider, hiding individual information • Need framework to reason about it • What trade-offs do you make? • What risks are posed? • How to expose trade-offs in a way we can appreciate? • Anonymization very domain specific • E.g. configuration file Vs. packet trace • Are there common themes? • Other Models: • NDA-based • “Give me a question” -> “return answer” • “Exploratory” nature of research

  10. Community effort: Cooperate on IRB • Social Sciences: • Lots of experience with IRB • Networking: • Lack of clear guidelines on IRB process • Admins feel happier if IRB can “sanction” things • As community: • Must appreciate need/process for IRB • Develop guidelines for IRB process • Share IRB documents

  11. Creating shareable data • 75% of time spent figuring how to use data • Researcher needs vary • Different forms of datum • Historical Vs. Streaming • Dated? Trending? • Assumptions made/gaps in data • “timing info crucial at sub-RTT level”? • Sharing hard, many idiosyncrasies • Data collection infrastructure, annotate

  12. User Diagnostics • One-on-one: exact data provided • Create shared repository(ies) • What data do most users want? • Is that 20% of stuff most critical to provide? • Data Collection Tools • Meta-data part of problem • Create data in standard formats • “Observatory”: • How to discover, describe, explain data • Access policy, use policy

  13. Other • Streaming Data: Online Vs Offline • Scalable collection: • What to collect? Over how long? • Compression techniques • Fine-grained: overhead, coarse-grained: information loss • What does it take to build this infrastructure? • Get all types of data as painlessly as possible • Massage, orchestrate data to fit researcher needs • Simple APIs to get data out – fast analysis tools • Federated Access • DataManagement - Lifecycle of data

  14. Action Items • Community-Wide Efforts: • Initiate efforts to create data repository • How to manage? Who contributes? Who arbitrates • How much storage? Lifecycle - How long to store data? • Create IRB guidelines for networking data • Research: • Anonymization • Usage diagnostics -> what to collect,release: widely applicable • Data Collection Tools, metadata information • Industry,operators must be as actively involved as possible

More Related