I'm assuming that puts an end to my dreams of Wit.ai supporting device-local voice recognition with context-limited vocabularies to drive user interfaces though...
I found pocketsphinx pretty easy to work with. I use it for command voice recognition for home automation stuff. It's even pretty accurate when using limited vocabulary models, after some tweaking. There are python bindings, though they lag behind the C api slightly, and even the C api is well-commented and the code is clean.
Do you have a write-up/blog post or article about pocketsphinx? How do you create/train the vocabulary models? If you use Text-to-speech too (CMU Flite, eSpeak), maybe you know some good resources there as well?
I haven't written anything up yet. To summarize, I use the standard acoustic model (hub4wsj_sc_8k) with a combination of keyword activation and a fixed grammar (in jsgf format). It's normally listening for a wakeup keyword, and when it find one, switches into grammar mode until it hears a complete utterance, or times out, then switches back to keyword mode. It works pretty well, though tuning the keyword sensitivity is annoying.
The pocketsphinx-specific code is actually quite simple:
You can see the keyword and jsgf files in that directory, for reference. The pronunciation dictionary is generated from one of the standard dictionaries, selecting just the works present in the grammar.
Note: If you look in the whole directory, there's a mix of decent code and ugly hacks in there, and I didn't make any attempt at making it customizable. It's just for me. Btw, it all runs on a raspberry pi.
Basically, you can define vocabulary sets to check against, and you provide a callback to know when a word has been detected. Depending on the complexity of the vocabularies, you might need to spring for the Rejecto plugin to improve the results.
I'm assuming that puts an end to my dreams of Wit.ai supporting device-local voice recognition with context-limited vocabularies to drive user interfaces though...