Audio Adversarial speech-to-text

I don’t usually go in for detailed technical papers on stuff that’s not directly relevant to what I’m working on, but I made an exception for this. Here’s the abstract:

We construct targeted audio adversarial examples on automatic speech recognition. Given any audio waveform, we can produce another that is over 99.9% similar, but transcribes as any phrase we choose (at a rate of up to 50 characters per second). We apply our white-box iterative optimization-based attack to Mozilla’s implementation DeepSpeech end-to-end, and show it has a 100% success rate. The feasibility of this attack introduce a new domain to study adversarial examples.

In other words, the researchers managed to fool a neural network devoted to speech recognition into transcribing a phrase different to that which was uttered.

So how does it work?

By starting with an arbitrary waveform instead of speech (such as music), we can embed speech into audio that should not be recognized as speech; and by choosing silence as the target, we can hide audio from a speech-to-text system

The authors state that merely changing words so that something different occurs is a standard adverserial attack. But a targeted adverserial attack is different:

Not only are we able to construct adversarial examples converting a person saying one phrase to that of them saying a different phrase, we are also able to begin with arbitrary non-speech audio sample and make that recognize as any target phrase.

This kind of stuff is possible due to open source projects, in particular Mozilla Common Voice. Great stuff.

Source: Arxiv