250 likes | 426 Views
Design Exploration of a Human-machine Interface (HMI) Application. Francis Li Sam Madden. The Application. Data glove interface Wired, bulky SmartDust scenario A mote on each fingertip Investigate implementations Explore design alternatives. Proof-of-Concept Prototype.
E N D
Design Exploration of a Human-machine Interface (HMI) Application Francis Li Sam Madden
The Application • Data glove interface • Wired, bulky • SmartDust scenario • A mote on each fingertip • Investigate implementations • Explore design alternatives
Proof-of-Concept Prototype • By SmartDust group • Atmel AVR Microprocessor • RFM TR1000 Radio • 6 accelerometers • Host PC performs processing • Analysis • Power: 45 mW measured • Continuous operation of processor, accelerometers, communication with host
Application Analysis • Processing (on PC) • Do 20 times per second, for each accelerometer • Read in X and Y samples (10 bits each) • Compute rolling average to smooth input data • Convert averages to polar coordinates • Dominates cost: sqrt, acos, atan • Secondary cost: floating point operations • Periodically, calculate gesture via simple template matching (static hand positions)
Application Analysis (cont) • Communication (from Atmel to PC) • 20 samples / sec • 6 accelerometers • 4 bytes/sample 480 bytes/sec • 115.6 kb/sec RF link • Radio = 12mA @ 3V, when transmitting 1.2 mW for radio alone • Real world power >> 1.2 mW, due to software and analog overhead ( real world analysis later )
Optimization Process • Match Application to HW
Optimization Process • Match Application to HW • Match Hardware to Application
Optimization Process • Match Application to HW • Local computation to reduce communication • Match Hardware to Application
Optimization Process • Match Application to HW • Local computation to reduce communication • Floating point Fixed Point • Match Hardware to Application
Optimization Process • Match Application to HW • Local computation to reduce communication • Floating point Fixed Point • Match Hardware to Application • Distributed vs. Centralized
Optimization Process • Match Application to HW • Local computation to reduce communication • Floating point Fixed Point • Match Hardware to Application • Distributed vs. Centralized • TI vs. Atmel
Optimization Process • Match Application to HW • Local computation to reduce communication • Floating point Fixed Point • Match Hardware to Application • Distributed vs. Centralized • TI vs. Atmel • DSP
Optimization Process • Match Application to HW • Local computation to reduce communication • Floating point Fixed Point • Match Hardware to Application • Distributed vs. Centralized • TI vs. Atmel • DSP
Communication vs.Computation • Estimates of local processing cost on Atmel (via simulation of GCC program) • Average: 2223 instr. x 2 • CalcPolar: 19017 instr. 2.83x106 instructions • Report gesture once per second FindGestureError: 5444 instr. 10 gestures, 6 accelerometers 5444 • 60 3.26x105 instr. • Memory operations are 2 cyles/instruction • Total cycles ~ 3.7M 4Mhz 13.5 mW • Communication = 8 bits/sec negligible cost Loop 6•20 / sec
Communication vs.Computation 2 • Cost of communication to Host PC (measured) • 4317 nJ/bit • From Culler, Hill, Szewczyk, Woo, “System Architecture For Networked Sensors.” 4317nJ/bit • 480 bytes/sec • 8 = 16.57 mW • Processor still sucks power • Current implementation requires 13.5mW • Using sleep, only 1.17 mW 17.74 mW total
Optimization Process • Match Application to HW • Local computation to reduce communication • Floating point Fixed Point • Match Hardware to Application • Distributed vs. Centralized • TI vs. Atmel • DSP
Distributed vs. Centralized • Move some processing to each sensor • 6 processors • Each computing average, polar transform • Transmitting 4 x 8 = 32bits once/second • Using Atmel processor on each mote • Computation • ~ .5M cycles/sec 2mA @ 2.7V 5.4mW • Communication • Very small: 4317nJ • 32 = .13 mW • 5.53 mW/mote = 33.2 mW total (Bad Idea!)
Optimization Process • Match Application to HW • Local computation to reduce communication • Floating point Fixed Point • Match Hardware to Application • Distributed vs. Centralized • TI vs. Atmel • DSP
TI Microcontroller Evaluation • A microcontroller with better specs • MSP430P112 330 A/Mhz active mode1.5 A standby (6 ns wakeup) • Used IAR Systems compiler, profiler, development environment • Analysis • Centralized 3.3V, 4 Mhz: 3.8 mW • Distributed 2.5V, 1 Mhz: 0.48 mW per mote • Six processors 2.9 mW
Optimization Process • Match Application to HW • Local computation to reduce communication • Floating point Fixed Point • Match Hardware to Application • Distributed vs. Centralized • TI vs. Atmel • DSP
TI DSP Evaluation • TMS320C54x • Used TI Code Composer Studio, compiler, simulator • Power • Active Mode, 3.3V 10 Mhz: 33 mW • IDLE1, 0.36 mW • Analysis • Centralized: 7.8 mW • Distributed: 1.6 mW per mote • Six processors = 9.6 mW total
TI DSP Evaluation Part 2 • TMS320C55x (two parallel MACs) • Same tools, with C55x compiler, simulator • Power: No details available... • Advertised: 0.9V, 0.05 mW/Mhz • Analysis • Centralized: 1170240 cycles (vs 2290440 54x) • 2 Mhz: 0.1 mW • Distributed: 195040 cycles (vs 381740 54x) • 1 Mhz: 0.05 mW • Six processors: 0.3 mW total
Other Explorations • Hand optimized code • Possible to massively reduce computation cost • FP/Transcendentals conspicuously painful • Outside scope of our exploration • Radio Hardware • Bluetooth ~ 100 times more efficient • Reconfigurable Computing • Other circuitry (e.g. accelerometers)
Results Summary • Cost, in mW of various implementations 17.74 using sleep mode, 28 without • 31/104 % improvement with same hardware • 170x improvement with new hardware
Conclusions • By finding better mappings from SW HW Application, big performance gains are possible. • Effective use of local processor resources can reduce communication overheads, which are significant. • DSPs and other specialized processors can be a big win and don’t require hand-coded assembly or reconfigurable design