250 likes | 625 Views
phi. public health for the internet joe hellerstein intel research & uc berkeley. agenda. three visions driving j building block: the PIER query engine challenges, synergies. vision 1: shift network security from medicine to public health. security tools focused on “medicine”
E N D
phi public health for the internet joe hellerstein intel research & uc berkeley
agenda • three visions driving j • building block: the PIER query engine • challenges, synergies
vision 1: shift network security from medicine to public health • security tools focused on “medicine” • vaccines for viruses • improving the world one patient at a time • weakness/opportunity in the “public health” arena • public health: population-focused, community-oriented • epidemiology: incidence, distribution, and control in a population • j: a new approach • enable population-wide measurement • engage end users: education and prevention • understand risky behaviors, at-risk populations.
a center for disease control? • [staniford/paxson/weaver 2002] • am I being targeted? • is this remote host a “bad guy”? • is there a new type of activity? • is there global-scale activity • who owns the center? what do they control? • this will be unpopular at best • electronic privacy for individuals • the internet as “a broadly surveilled police state”? • dan geer, former cto of @Stake • provider disincentives • Transparency = maintenance cost • and hardly ubiquitous • can monitor the chokepoints (isp’s) • but inside intranets?? • e.g. corporate IT • e.g. berkeley dorms • e.g. grassroots WiFi agglomerations?
energizing the end-users • endpoints are ubiquitous • internet, intranet, hotspot • toward a uniform architecture • end-users will help • populist appeal to home users is timely • enterprise IT can dictate endpoint software • differentiating incentives for endpoint vendors • the connection: peer-to-peer technology • harnessed to the good! • ease of use • built-in scaling • decentralization of trust and liability p2p technology is ripe. a noble app here with significant uptake?
vision 2: shared network monitoring • endpoint monitoring becoming a trend • NETI@Home (GA Tech) • DIMES (TAU) • ForNet (Polytechnic) • DShield • DOMINO (Wisconsin) • we share the vision! • but all facing key challenges in getting uptake • what’s in it for the community members? • disincentives: privacy & security risks
a communal approach • enable multiple efforts with a single distributed infrastructure • extensible endpoint “sensors” and visualizations • shared engine connecting them up • a group bands together on the hard systems and crypto • cost-effective data processing and analysis • verifiable data and processing • distributed resource limiting • toolkit of privacy-preserving, distributed dataflow components • a theme: dissemination is as important as collection • attract end-users with visible community information • enable real-time swapping across research teams • there may be much more here (see next vision!) • intel research is prepared to invest in this community • as we did with planetlab
vision 3: the network oracle • imagine that you knew everything about the internet, at every moment • network maps • link loading • point-to-point latency and bandwidth • event detections (e.g., from firewalls) • naming (DNS, ASes, etc.), • end-system software configuration information • router configurations and routing tables • how would this change things? • the design of protocols • the design of networked applications • network and system management (performance and security) • the economy (and policy) of nw clients and isp’s • etc.
a dirty (not-so) secret • we’re sneaking up on the oracle already • overlays are a subversive attempt to wrest control from ISPs • overlays compute and disseminate measurements • measurement and functionality appetite growing • everybody’s favorite planetlab exercise: all-pairs ping • detour routing a la RON • custom routing a la i3/ROSE • but this is not being done systematically • every overlay does its own thing, opaquely • granularity of aggregation in time and space not well explored • measurement & dissemination often 2ndary/implicit • algorithmic/architectural choices abound, little exploration • and the brass ring remains…
wrapping up: 3 visions • multiple rationales to pursue this agenda • commonalities • many networked sensors • many computational agents for data processing • many destinations for result dissemination • decentralized infrastructure: • organic scaling • no centralized maintenance • no single unified repository of raw data (privacy ramifications) • differences (invariably!) • desired data granularities, in time and space • “reach” of querying and dissemination • sensitivity to privacy issues • goal: a shared infrastructure • shared effort to develop and extend it, seeded by intel research • shared bootstrap deployment (planetlab and beyond)
agenda • three visions driving j • building block: the PIER query engine • challenges, synergies
pier: p2p information exchange & retrieval • a wide-area distributed dataflow engine • designed to scale to thousands or millions of nodes • outfitted with “streaming” relational operators, recursive graph queries • fully extensible dataflow graphs, SQL-like interface for convenience • built on distributed hash table (DHT) overlays • a put()/get() hashtable interface for the Internet. • content-based routing, soft-state semantics • pier is DHT-agnostic (CAN chord bamboo) • a very different design point than DB2, Oracle, etc. • scale = # machines, not necessarily # bytes • relaxed consistency a requirement (not really a dataBASE at all) • organic scaling • data lives in its natural habitat
initial pier applications • φintrusion app • real-time snort aggregation from ~300 planetlab nodes • identification of top-10 attackers (validating DOMINO) • real time joins: “who are my attackers attacking” • plausible end-user visualizations • transitive closures and other graph algorithms • distributed gnutella crawler • distributed web crawler • shortest paths queries (distance vector routing) • improved filesharing for rare items • deployed as hybrid gnutella ultrapeer on 50 planetlab nodes • intercepts gnutella queries, identifies “rare items and publishes” • 18% decrease in number of unnecessarily empty query results • 66% possible with better “rare item” identification • upshot: reasons to believe the generality is real
pier in the j context • goal is for pier to serve as an information plane • gather data from “sensors” • perform basic filtering, aggregation, combination • though aggregation can be rather fancy (e.g. wavelet encoding) • disseminate the right “cooked” data to the right people • and do so in a “trusted” way • privacy and security • manageability • but … only a piece of the puzzle • active probing • mapping • backbone monitors • network forensics, tomography • honeypots • etc. • we won’t do all of this ourselves! • gathering playmates
agenda • three visions driving j • building block: the PIER query engine • challenges, synergies
Declarative Queries Security Privacy Quality of Service GeneralChallenges Overlay Network Query Plan Query Optimization Multi-Query Optimization Catalogs Persistent Storage Recursion on graphs Physical Network Query Dissemination Replication Soft-State Quality of Service Net-Embedded functions Resilience Route Flapping Efficiency Challenges
current limitations of pier • query per client • no systematic sharing of computation/results across queries • locality control forfeited to dht • difficult to express local gossiping rules • queries, not triggers • alerts currently supported via polling • loose query semantics • network dynamics and timing make guarantees hard • active monitoring • we can do it, but it’s not systematic • security/privacy • we’re attacking many of these now
so, is pier the “right” infrastructure • not today • though many of the decisions seem sound • level of indirection between task specification and execution • non-hierarchical model provides flexibility and simplicity • vs. domain hierarchy (a la ip naming) • vs. data hierarchies (a la xml) • extensible aggregation + relational operators covers a lot of territory • monitoring • routing
potential synergies • design of shared info plane • scenarios & requirements • architectural brickbats • built-in components • complementary components • and requirements for integration • understanding the opportunity • what if the network oracle existed • fostering the community • leveraging each other’s efforts to get mindshare • resources • if the intel genie granted you a wish… • (think about building/leveraging community)
A Note on Structured Data on Networks • Industrial Revolution for Information • Mechanized data generation • Sensing the physical world • Monitoring software, networks, machines • Tracking objects, processes, behaviors • Uniformity of products • Mass Transport of Data and Computation • Data generators and consumers spread over the Internet and the Planet • Happening at both extremes • Compare to hand-generation of text