200 likes | 288 Views
Case Study: Debugging Multicast Problems from an Applications Perspective. Steven Senger, Ph.D. Dept. of Computer Science University of Wisconsin - La Crosse. HAVnet Project. Parvati Dev, PI, Stanford SUMMIT National Library of Medicine, NGI & SII programs since 1999.
E N D
Case Study: Debugging Multicast Problems from an Applications Perspective Steven Senger, Ph.D. Dept. of Computer Science University of Wisconsin - La Crosse
HAVnet Project • Parvati Dev, PI, Stanford SUMMIT • National Library of Medicine, NGI & SII programs since 1999. • Applications of high-performance networks to anatomical and surgical education. • http://havnet.stanford.edu • http://visu.uwlax.edu
Other Apps and Components • Information Channels • Multicast based announcement/discovery mechanism. • Supports other app requirements such as logging. • Access Grid
Potholes Along the Way • Stanford / CENIC • Multicast setup delay • WiscNet • Conflict between sender and receiver • Michigan / Merit • Multicast setup delay • Inbound flow stops after 209 secs
Stanford / CENIC … • Longstanding problem (observed in ‘01). • Large delays (~15 min) in multicast setup. • Stanford / La Crosse / NLM • Significant delays except for La Crosse / NLM • Originally thought to be at Stanford Border and RP. • 04 hardware/ios upgrades at Stanford. • Situation improved.
Stanford / CENIC … • Only Michigan to Stanford delayed, ~6 mins. • Oct 04, Phone calls, Stanford, CENIC, Vendor support, La Crosse. Escalate through 3 layers of vendor support. • Test/Debug every couple of weeks through March ‘05. • Identified as MSDP propagation delay related to encap/unencap data received by MSDP.
Stanford / CENIC • Delay occurred at each CENIC router. • At some point problem had been internally found and resolved by vendor. • Solution: upgrade OS on CENIC routers.
La Crosse / WiscNet … • First observed spring 05 using AccessGrid. • La Crosse sender and Stanford receiver OK. • Starting a La Crosse receiver breaks the flow. • WiscNet identified problem router. • Vendor support engaged. • Discovered rpd restart sufficient to fix. • Reoccurs every 2 months.
La Crosse / WiscNet … • When failing • Upstream interface on router gets set to unreasonable value. • Sender continues to send data in encapsulated PIM-register messages. • Router never sends register-stop messages.
La Crosse / WiscNet • Problem has survived router chassis upgrade. • No solution as yet.
U. Michigan / Merit … • Discovered after CENIC problem solved. • Small delay in setup for Michigan to Stanford. • Varies between 0 and 60 sec. • Similar behavior for Milwaukee to Stanford. • Does not appear to be in CENIC?
U. Michigan / Merit … • Presence of other receivers seems to change the setup delay. • Merit engaged in isolating problem. • No solution as yet.
U. Michigan / Merit • Discovered Jan ‘06 using AccessGrid. • Traffic from Stanford to MCBI/Merit starts correctly but stops after 208 seconds. • When stopped IPLSng shows as pruned. • Merit identified problem with a switch in Chicago not allowing streams to setup correctly. • Problem resolved with OS upgrade.
Diagnostic Help • Debugging strategies • Tools • Monitoring