80 likes | 205 Views
Sphinx on Handhelds. David Huggins-Daines dhuggins@cs.cmu.edu. Sphinx on Handhelds?. Handheld/embedded devices are pretty speedy these days LVCSR on them is not unreasonable An open-source one does not exist yet CALO’s new focus on mobility S2S translation projects could use it
E N D
Sphinx on Handhelds David Huggins-Daines dhuggins@cs.cmu.edu
Sphinx on Handhelds? • Handheld/embedded devices are pretty speedy these days • LVCSR on them is not unreasonable • An open-source one does not exist yet • CALO’s new focus on mobility • S2S translation projects could use it • Sublime, smartphone applications, etc • ISL has it, so should we!
Handheld challenges • CPU speed • Typically 200-400MHz ARM/XScale • Faster than the workstations Sphinx started out on • No hardware floating-point instructions • ARM has very fast and sophisticated integer ISA • Memory and storage capacity/speed • DRAM is very limited (32 or 64MB) • Storage is very slow (typically CF cards) • Inefficient and clumsy operating systems • WinCE has no stdio, broken malloc, 32MB limit • PalmOS is much, much worse!
Plan for Sphinx on Handhelds • Start out with Sphinx2 • It’s fast • People use it already • Convert “hot spots” to integer math • Precompute model files • Avoid parsing (no stdio, remember) • Allow memory-mapped I/O (subvert the 32MB limit on WinCE) • Disable non-useful features in libraries • e.g. flat lexicon search, CDHMM
Current Status • Sphinx2 on Sharp Zaurus • Linux, 40MB system RAM, 206MHz ARM • Performance on RM1: 1.7x realtime • No degradation in accuracy • Integer front-end and GMM code complete • Front end also has a “faster” mode • 10% faster, 10% degradation in accuracy • Memory consumption is too high • WSJ5k can just barely run • Sphinx2 consumes about 16MB of heap space • Requires quantized mixture weights (-8bsen) • Sphinx3.x is much smaller … and slower
Implementation details • FFT is done with 16:16 fixed point • Bits 31:16 are whole part and sign • Bits 15:0 are fractional part • I.e. all numbers scaled by 65536 • Lossless multiplication done using 4 integer shift-multiply-accumulates (ARM is really good at this) • Mel-spectrum calculated in log scale • Using base 1.0001 in order to exploit existing add-table implementation • “Faster” mode uses 28:4 fixed point instead • Overflows saturated to INT_MAX • Zeroes floored to log(2-4) - very important!
Implementation details • Abstract types for intermediate values • mfcc_t, powspec_t, mean_t, var_t • #define FIXED_POINT to make them ints • Arithmetic macros (fixpoint.h) • fixed32 type analogous to float32 • addition and subtraction work as expected • MFCCMUL(), MFCC2FLOAT(), FLOAT2MFCC() macros become no-ops in floating-point build • GMMADD(), GMMSUB() do saturating addition and subtraction • ARM has special instructions for this too! Wow!
Future Work • Rationalize the file formats • General WinCE porting (Mohit) • Front-end optimization • Implement fixed-point FHT • Investigate Sphinx 3.x for embedded • SubVQ and GS can make it fast and cut memory consumption even more • Much nicer architecture • But not widely used, API not as stable