The pi isn't really fast enough to process the speech in real time. deepspeech by mozilla was cited as an offline alternative to the Google speech API but it's difficult to set up with Mycroft and doesn't work very well (lack of data and lag - https://mycroft.ai/voice-mycroft-ai/). Because of this, Mozilla set up Common Voice (https://commonvoice.mozilla.org/en) to help build open datasets of voice recordings.
> The pi isn't really fast enough to process the speech in real time.
If you've got an iPhone... put it in to airplane mode so that it is local only. You'll note that Siri no longer works when you do this. However... open up the notes app and tap the microphone. Do some interesting text...
> Mister Smith said that he wanted a two by four and half of a pie.
(if you don't have an iDevice, it transcribes this as:
> Mr. Smith said he wanted a 2 x 4 and 1/2 of a pie
That is without a network and done in real time. We can compare the relative processing capabilities of an iPhone and the RPi, but offline speech to text is feasible on a device of limited capabilities.
Yeah, but this is the closed source Apple implementation of speech to text versus Mozilla's abandoned deepspeech. I'm sure its possible to get it working well on a pi but I don't have the time to create and maintain a personalised speech training set and then optimise the resultant models.
Fair 'nuff... though I was after a "even with an older model iPhone, and no net connection, there the ability to do speech to text (and even with some interesting transformations of "two by four"), it can be done locally."
> In order to provide an additional layer of privacy for our users, we proxy all STT requests through Mycroft's servers. This prevents Google's service from profiling Mycroft users or connecting voice recordings to their identities.
I didn't know the specifics of it, that has a lot more information and is is an interesting read.
One of the bits in there caught my eye...
> We created a language-specific phonetic specification of the "Hey Siri" phrase. In US English, we had two variants, with different first vowels in "Siri"—one as in "serious" and the other as in "Syria." We also tried to cope with a short break between the two words, especially as the phrase is often written with a comma: "Hey, Siri." Each phonetic symbol results in three speech sound classes (beginning, middle and end) each of which has its own output from the acoustic model.
And the British version getting false positives on wake up with world politics.
The specifics of the wake up and that its done with a ML model rather than a low power wake word chip akin to https://www.syntiant.com/post/syntiant-low-power-wake-word-s... is also interesting - and impressive that they were able to get it to be that low power.
DeepSpeeech is very old software. Vosk works just fine https://github.com/alphacep/vosk-api. People even run tiny Whisper on Pi, though they have to wait ages.