Dr. Vijay Raghavan Defense Advanced Research Projects Agency Information Exploitation Office

Network Embedded Systems Technology(NEST) Workshop on Extreme Scaling Arlington, VA Dr. Vijay Raghavan Defense Advanced Research Projects Agency Information Exploitation Office November 12, 2003

Meeting Purpose • To discuss the NEST plan to create a robust 10,000 node autonomous ad hoc sensor network in an operationally relevant environment for demonstrating the capabilities in the monitoring and protection of long linear structures, by the end of FY ’04 • Analysis: • To discuss the NEST plan  Evaluate platform, identify technological challenges, identify crucial tasks, formulate metrics, and create a schedule • Robust 10,000 node autonomous ad hoc sensor network  Not a toy demonstration, two orders of magnitude larger than before • Operationally relevant environment  Not a laboratory demonstration, not a simulation, it will be in a real system • Demonstrating the capabilities  Okay, not a real fieldable system, but having all the capabilities for the mission (TRL 4.5) • Monitoring and Protection  The network detects, tracks, and classifies intruders • Long linear structures  Pipelines, borders, … • End of FY ’04  Okay, we can extend this a little, but …

Deliverables of the meeting • Initial specification of hardware • Identification of tasks to be performed • Schedule of activities for FY ’04

Agenda

Extreme Scaling - 2004 Securing a Large Distributed Perimeter Norman Whitaker NEST Extreme Scaling Workshop November 12, 2003 Cano - Limon Pipeline 7/9/2003 Colombia

Scaling Exfiltration ~ N log N Information Flow N sensors • Unit cost (N) • Information flow (N) • Unit Memory( N) • Network Bandwidth(N)

SYRIA 300 km border AL QAIM IRAQ IRAQ / SYRIA Border October 2003

2 400 nodes/km “Kansas Pipeline” – Canonical 10,000 Node Problem Sensor node with 50m range Both sides Exfiltration node (PDA) dirt road 25 km Schedule 3 month 1000 6 month 3000 6000 9 month 500m Full system demonstration 12 month Specs • Dismounts – armed and unarmed • Vehicles • 2 second latency • < $150/node • Pfa < 1 alarm/day • Pd = 0.95 at walking speed 500m 1% of Total Laydown

Sensor Range Require a magnetometer The sensor range needs to be ~1.3x the node spacing. In this case ~65 m. Magnetometer ranges of 65 m are beyond the state of the art. Earth’s magnetic “noise” exceeds the signal by ~10x. Over-the-Air Loading This is a complex service: Reliable multicast (i.e., with local retry). May need group service. Very strong motivation towards incremental loading. Some application may need to run while others load. Implies dynamic linking. Requires OS-style security The Big Three Problems - MITRE

Sentry Service (power) The required “on” subset of the network will be too large. Alternative drives up required sensor range. Extending system life by raising node density is inferior to just using bigger battery Need temporal sampling. Need power savings of ~100x. Need the almost always off communication model. Maybe comm. deadline and comm. scheduling. The Big Three Problems - MITRE

Feedback from 9-22-03 Conference Call • Timeline is too aggressive June 04 100 December 04 1000 June 05 10,000 2. Sensor mix: a. Magnetics unlikely to detect weapons at ranges > 1m b. PIR unlikely to be a cure-all c. Consider McKuen radars 3. Comms range seen as problematic 4. Need 4 X more PDA’s – Pfa / lifetime / latency very aggressive

Feedback from 9-22-03 Conference Call (cont.) • Need for rock-solid over air loading capability needed to demonstrate scalability – and this is not currently in hand. Architecture work needed. • Power is a huge problem for a demonstration • Always on needed to achieve Pfa • Debugging and software loads are battery intensive • Lifetimes are currently far too short 7. Solid packaging needed (“one touch only”) • Need a separate contractor to actually camp in desert and deploy sensors 9. Next tier PDA work needed , as well as highest tier (GUI)

Feedback from 9-22-03 Conference Call (cont.) • Dynamic tagging seen as an alternate approach that demonstrates scalability without problems in static sensor laydown 11. Rush to production seen as a costly mistake. A disciplined, Step-by-step process with formalized acceptance criteria at each Stage was recommended to avoid a failure.

Systems that Scale Well 1. Self-contained sensors 2. Self-contained static sensor clusters 3. Sensor clusters with handoff interactions

Systems that Scale Well (cont.) 4. Sensor clusters with dynamic assignments 5. Robust systems that adjust to sensor dropouts, etc. 6. Systems which maintain only local state

Systems That Do Not Scale Well 1. Any system with a non-zeroPfa per sensor node 2. Any non-robust system that requires multiple touches per sensor 3. Any system with performance that can be degraded by a malfunctioning, poorly placed, or imperfect node. 4. Any system that creates a disproportionate response to stimuli 5. Any system that does not plug into umbrella information systems

Conclusion • For most applications sensor nets are inherently scalable, but their implementations are often not • The path to practical scalable systems lies through careful extensible architecture work and bulletproof hardware/software design • It may be possible to demonstrate (but not fully test) a scalable system with fewer than 10,000 nodes

Lessons from the UCB Wireless OEP Demo and Issues in Scaling Sensor WebsCory Sharp, Robert Szewczyk, Massimo Franceschetti, Prabal Dutta, David Culler, Shankar Sastry NEST Extreme Scaling Workshop, Washington, DC 11/12/2003

Lessons from UCB Demo • Scaling Issues we set out to tackle • Things that proved very valuable • Things that proved problematic • Things we wished we had • Monitoring and Restart from the Beginning • Scalable Network Reprogramming • Foundations of Scale • Recommendations

Scaling Issues We Designed For

Key Pragmatic Issues • Solve the specific problem • within time and $ budget • Reliably • Time, Time, Time, Time • Time to design, prototype and test what you think is a working hardware & software solution • Not at scale • Time to manufacture a large number of nodes • Only get one shot (be conservative) • Time to test-fix-test-integrate-test-fix-… • Don’t see scaling issues until late in the process • ~1000 person-hours in last two weeks • Time to assemble, reassemble, … • Human scaling limits

Hardware • Driven by specific application plan • Not generic experimental like Mica • Usage factors • outside, robots running over it, in-sun, possibly wet, … • Specific sensing modalities • Board designs, Geometric orientation, … • Mechanical Design is central (co-design) • Appn => package => boards • Foresee the needs of dev, debug, deployment process • Field UI, reset • Assembly, re-assembly, recharging process • Minimize exposure • Lift radio off the ground

PEG Mechanical Design reflector Exposed components O-ring Seal Watertight compartment ultrasound Mica2 dot mag sense power battery Collision absorption Good thermal characteristics

Software • NEST service architecture as a beginning • Greater number => simpler function • Complete modularity of each subsystem • Node positioning • Sensing and report • Mobile-to-mobile routing • Pursuer hear and navigate • Not just source concept, but full interfaces to allow testing in the field of each without the others to blame • Delay Binding Wherever Possible • Complete parameterization • Thresholds, update rates, ranges, … • Management services • Assume irregular, lossy connectivity

Algorithmic Simplicity by Decomposition • Lowest-tier nodes only performed entity detection • Local processing, threshold detection, announce • Simple Leader Election • Simple aggregation for position estimation • Little calibration • Simple mobile-mobile reliable routing • Solid tree rooted at landmark • Complexity consolidated in higher tier • Disambiguation • Trajectory Estimation • Navigation • Noise-adaptive control loop • Minimalism pays • Nice to have solid theory to support doing very simple things

Steps that proved invaluable

Management Services • Ident • Ping, Version, Compile time • Command line scripting • Check all report in • Check config values • Config • Set/Get each potential parameter • Automated behaviors on update • Save/Restore to Flash • Human override • Reset • Sleep / Wakeup

Simple and Solid • Once a node is assembled, if it responds to Ping, it should work • Reliability of quick test was key • Complete testing of each major component at near-scale in isolation • Clean stimulus, observable response • Combining was trivial • Powerful receiver to see what was going on in much of network

Multiple Fall Backs • Localization: • fix the position based on network address • Transform of address • Explicit Set • Ranging • Routing • Single hop to base • 802.11 back channel to pursuer • Evader-pursuer back channel • Landmark routing • Excellent signal-strength based slow tree build • Sensing • Used most of them in testing • Kept them in the code • Reduced stress along the way

Things that Proved Problematic

Hardware • Biggest Risk Factor • Getting boards done and software working in time • Budget allowed only one production shot • Scale was limited by slowest production step • Antenna • Internal coiled antenna was dismal => external whip • Proved to be time consuming and problematic in assembly • weak mount point (10% loss) • Recharging – required disassembly (see above) • Power conditioning introduced noise issues & leakage • Unforeseen interactions • Radio + magnetometer • Used the wrong kind of wire (duh!) • No self-guided assembly • Modularity, while has many values, has a significant cost • Too hard to see LEDs • no physical reset • Multiple antennas on pioneers interfered • Differential GPS black magic • TIME, TIME, TIME • You don’t know what is wrong till late in the development

Software • Network programming fell apart for large app on many nodes • Dictated the development cycle • Config was life saver • Got solid single hop just after it was needed • Current design has deep shortcomings (below) • Software Knowledge bottleneck • Nice abstractions that too few people knew how to use • Underestimated the amount of glue code between the tiers • Wished we had • Better Regression suites • Logged Everything • Much better operational deployment protocol and bookkeeping

Towards the Extreme Scaling Effort

Monitoring and Management of the Network • Determine what went wrong and take action • Extremely simple and reliable • Ensure liveness, preserve access to network nodes, help with fault diagnosis • Networking monitoring • Enforce minimum and maximum transmission rates • Verify minimum reception rate • Node health monitoring • Beyond battery voltage – sensor data monitoring • Failure of a sensor an indicator of node health • Detectable failures impacting lifetime– GDI humidity and clock skew experience • Program integrity checks • Stack overflow • Watchdog / deadman timer • Require attention from different parts of the system, reset if abandoned • Address many different time scales • Test low-level system components – timers, and task queue to ensure basic system liveness • First use: min reception rate • Partial system shutdown • Flash/EEPROM writing under low voltage conditions, broken sensors, etc. • Fault handling • Error logging and/or reporting

Scalable Network Programming • Out of core algorithms, epidemic algorithms • Basic primitive – reliable page transfer • Both broadcast and point-to-point transfer • Fragment and transmit packets per page, vector ack, selective fix up, snoop, consistency checking • Constraints: fit in ram, match with flash transfer size • Dissemination algorithms • Fast push, pipeline series of pages in space • Epidemic maintenance • Eliminate the “wrong image” problem • New version: propagate page diffs • Potentially, multiple levels of pages to organize code and metadata • System image: version number + vector of page hashes • Organize image to reduce #pages that change • Linearization of functions, holes • Blue sky ideas • default TinyOS network reprogramming kernel as the system restore checkpoint

Conceptual Issues in Scaling

Percolation theory • Random graphs • Distributed computing • Distributed control • Channel physics • Distributed sampling • Network information theory • Network coding Practice Theory • Connectivity • Routing • Storage • Failures • Packet loss • Malicious behavior • Remote operation

Example: Percolation theory Prob(correct reception) • Beyond the disc abstraction • Random connection model • Small world networks • Effect of interference • What is connectivity ?

Transitional Region Effective Region Determines Node spacing many Clear Region Connectivity in Practice

Connection probability Connection probability 1 1 d 2r d New model Continuum percolation Beyond the disc abstraction

Squishing and squashing Shifting and squeezing What can we prove (sporadic) long links help the connectivity process CNP CNP=average number of connections per node needed for percolation

Is the disc the hardest shape to connect overall? CNP What about non-circular shapes?

What about routing ? • Routing with occasional long links • Navigation in the small world • Connection with social science ! Prob(correct reception)

Regular Small World Random Small World Networks Watts Strogatz (1998) can we use the few long links for routing?

Routing in a small world each node has only local information of the network connectivity Kleinberg (2000) New model (2003) • Nodes on the grid • Fixednumber of contacts • Probabilityscales with distance • Nodes at random on the plane • Random number contacts in a given region • Density scales with distance

Long, Good Links Valuable

Connections of z are Poisson points of density e-delivery occurs when msg is delivered within eto target Theorem about Routing in a Small World T e d S

Bottom line T e d S Build networks where the number of neighbors scales as 1/x2 to obtain efficient routing

Red, Yellow, Green Lights: The Systems View • Define the application in levels • Minimal demo • Intermediate demo • Ideal demo • Technical challenges required to reach a level determine the severity of the issue

Dr. Vijay Raghavan Defense Advanced Research Projects Agency Information Exploitation Office

Dr. Vijay Raghavan Defense Advanced Research Projects Agency Information Exploitation Office

Presentation Transcript

Defense Advanced Research Projects Agency (DARPA)

Dheerendranath Mundluru Dr. Vijay Raghavan Dr. Zonghuan Wu

Homeland Security Advanced Research Projects Agency

Defense Advanced Research Projects Agency

Defense Advanced Research Project Agency DARPA

Advanced Research Projects Agency – Energy (ARPA-E)

The Advanced Research Projects Agency – Energy: Transformative Research and Development in Bioenergy

Homeland Security Advanced Research Projects Agency

Defense Advanced Research Project Agency

Homeland Security Advanced Research Projects Agency

Sponsor: Ron Larsen, Defense Advanced Research Projects Agency, Information Technology Office

Defense Information Systems Agency

Defense Advanced Research Projects Agency (DARPA)

Dr. Vijay Raghavan Defense Advanced Research Projects Agency Information Exploitation Office

Defense Advanced Research Projects Agency (DARPA)

Defense Information Systems Agency A Combat Support Agency

Dr. Raghavan Srinivasan, HSRC

Homeland Security Advanced Research Projects Agency

Defense Advanced Research Project Agency (DARPA)

Missile Defense Agency Advanced Research Overview

Missile Defense Agency Advanced Research Overview