The pi isn't really fast enough to process the speech in real time. deepspeech b...

shagie · on Nov 22, 2022

> The pi isn't really fast enough to process the speech in real time.

If you've got an iPhone... put it in to airplane mode so that it is local only. You'll note that Siri no longer works when you do this. However... open up the notes app and tap the microphone. Do some interesting text...

> Mister Smith said that he wanted a two by four and half of a pie.

(if you don't have an iDevice, it transcribes this as:

> Mr. Smith said he wanted a 2 x 4 and 1/2 of a pie

That is without a network and done in real time. We can compare the relative processing capabilities of an iPhone and the RPi, but offline speech to text is feasible on a device of limited capabilities.

Additionally, you can do a limited vocabulary speech to text on chip ( https://www.imagesco.com/articles/hm2007/SpeechRecognitionTu... - https://www.amazon.com/HM2007-Speech-Recognition-Integrated-... ). This can handle the specific incantation common tasks (think closer to how a car voice control works - say exactly these words in this order), but that can help with performance for things that are often done.

arbol · on Nov 24, 2022

Yeah, but this is the closed source Apple implementation of speech to text versus Mozilla's abandoned deepspeech. I'm sure its possible to get it working well on a pi but I don't have the time to create and maintain a personalised speech training set and then optimise the resultant models.

one-another-dev · on Nov 22, 2022

> If you've got an iPhone... put it in to airplane mode so that it is local only. You'll note that Siri no longer works when you do this

This is not true anymore. Latest iPhone models have offline Siri working to some extent

shagie · on Nov 23, 2022

Fair 'nuff... though I was after a "even with an older model iPhone, and no net connection, there the ability to do speech to text (and even with some interesting transformations of "two by four"), it can be done locally."

What's more... consider https://mycroft-ai.gitbook.io/docs/using-mycroft-ai/customiz...

> In order to provide an additional layer of privacy for our users, we proxy all STT requests through Mycroft's servers. This prevents Google's service from profiling Mycroft users or connecting voice recordings to their identities.

If Mycoft goes offline or doesn't pay their STT bill ( https://cloud.google.com/speech-to-text ), the device loses a bit of its functionality.

artificial · on Nov 23, 2022

You may be interested in a little background on Siri on device with CoreAudio https://machinelearning.apple.com/research/hey-siri

shagie · on Nov 24, 2022

I didn't know the specifics of it, that has a lot more information and is is an interesting read.

One of the bits in there caught my eye...

> We created a language-specific phonetic specification of the "Hey Siri" phrase. In US English, we had two variants, with different first vowels in "Siri"—one as in "serious" and the other as in "Syria." We also tried to cope with a short break between the two words, especially as the phrase is often written with a comma: "Hey, Siri." Each phonetic symbol results in three speech sound classes (beginning, middle and end) each of which has its own output from the acoustic model.

And the British version getting false positives on wake up with world politics.

https://youtu.be/-W_uiPVsTVM

https://youtu.be/2YjxiBqV6Kc

The specifics of the wake up and that its done with a ML model rather than a low power wake word chip akin to https://www.syntiant.com/post/syntiant-low-power-wake-word-s... is also interesting - and impressive that they were able to get it to be that low power.

nshm · on Nov 24, 2022

DeepSpeeech is very old software. Vosk works just fine https://github.com/alphacep/vosk-api. People even run tiny Whisper on Pi, though they have to wait ages.

arbol · on Nov 25, 2022

Thanks, I might try to get this working with picroft