Tag: audio

There’s no viagra for enlightenment

This quotation from the enigmatic Russell Brand seemed appropriate for the subject of today’s article: the impact of so-called ‘deepfakes’ on everything from porn to politics.

First, what exactly are ‘deepfakes’? Mark Wilson explains in an article for Fast Company:

In early 2018, [an anonymous Reddit user named Deepfakes] uploaded a machine learning model that could swap one person’s face for another face in any video. Within weeks, low-fi celebrity-swapped porn ran rampant across the web. Reddit soon banned Deepfakes, but the technology had already taken root across the web–and sometimes the quality was more convincing. Everyday people showed that they could do a better job adding Princess Leia’s face to The Force Awakens than the Hollywood special effects studio Industrial Light and Magic did. Deepfakes had suddenly made it possible for anyone to master complex machine learning; you just needed the time to collect enough photographs of a person to train the model. You dragged these images into a folder, and the tool handled the convincing forgery from there.

Mark Wilson

As you’d expect, deepfakes bring up huge ethical issues, as Jessica Lindsay reports for Metro. It’s a classic case of our laws not being able to keep up with what’s technologically possible:

With the advent of deepfake porn, the possibilities have expanded even further, with people who have never starred in adult films looking as though they’re doing sexual acts on camera.

Experts have warned that these videos enable all sorts of bad things to happen, from paedophilia to fabricated revenge porn.


This can be done to make a fake speech to misrepresent a politician’s views, or to create porn videos featuring people who did not star in them.

Jessica Lindsay

It’s not just video, either, with Google’s AI now able to translate speech from one language to another and keep the same voice. Karen Hao embeds examples in an article for MIT Technology Review demonstrating where this is all headed.

The results aren’t perfect, but you can sort of hear how Google’s translator was able to retain the voice and tone of the original speaker. It can do this because it converts audio input directly to audio output without any intermediary steps. In contrast, traditional translational systems convert audio into text, translate the text, and then resynthesize the audio, losing the characteristics of the original voice along the way.

Karen Hao

The impact on democracy could be quite shocking, with the ability to create video and audio that feels real but is actually completely fake.

However, as Mike Caulfield notes, the technology doesn’t even have to be that sophisticated to create something that can be used in a political attack.

There’s a video going around that purportedly shows Nancy Pelosi drunk or unwell, answering a question about Trump in a slow and slurred way. It turns out that it is slowed down, and that the original video shows her quite engaged and articulate.


In musical production there is a technique called double-tracking, and it’s not a perfect metaphor for what’s going on here but it’s instructive. In double tracking you record one part — a vocal or solo — and then you record that part again, with slight variations in timing and tone. Because the two tracks are close, they are perceived as a single track. Because they are different though, the track is “widened” feeling deeper, richer. The trick is for them to be different enough that it widens the track but similar enough that they blend.

Mike Caulfield

This is where blockchain could actually be a useful technology. Caulfield often talks about the importance of ‘going back to the source’ — in other words, checking the provenance of what it is you’re reading, watching, or listening. There’s potential here for checking that something is actually the original document/video/audio.

Ultimately, however, people believe what they want to believe. If they want to believe Donald Trump is an idiot, they’ll read and share things showing him in a negative light. It doesn’t really matter if it’s true or not.

Also check out:

Noise cancelling for cars is a no-brainer

We’re all familiar with noise cancelling headphones. I’ve got some that I use for transatlantic trips, and they’re great for minimising any repeating background noise.

Twenty years ago, when I was studying A-Level Physics, I was also building a new PC. I realised that, if I placed a microphone inside the computer case, and fed that into the audio input on the soundcard, I could use software to invert the sound wave and thus virtually eliminate fan noise. It worked a treat.

It doesn’t surprise me, therefore, to find that BOSE, best known for its headphones, are offering car manufacturers something similar with “road noise control”:

With accelerometers, multiple microphones, and algorithms, it’s much more complicated than what I rigged up in my bedroom as a teenager. But the principle remains the same.

Source: The Next Web

Audio Adversarial speech-to-text

I don’t usually go in for detailed technical papers on stuff that’s not directly relevant to what I’m working on, but I made an exception for this. Here’s the abstract:

We construct targeted audio adversarial examples on automatic speech recognition. Given any audio waveform, we can produce another that is over 99.9% similar, but transcribes as any phrase we choose (at a rate of up to 50 characters per second). We apply our white-box iterative optimization-based attack to Mozilla’s implementation DeepSpeech end-to-end, and show it has a 100% success rate. The feasibility of this attack introduce a new domain to study adversarial examples.

In other words, the researchers managed to fool a neural network devoted to speech recognition into transcribing a phrase different to that which was uttered.

So how does it work?

By starting with an arbitrary waveform instead of speech (such as music), we can embed speech into audio that should not be recognized as speech; and by choosing silence as the target, we can hide audio from a speech-to-text system

The authors state that merely changing words so that something different occurs is a standard adverserial attack. But a targeted adverserial attack is different:

Not only are we able to construct adversarial examples converting a person saying one phrase to that of them saying a different phrase, we are also able to begin with arbitrary non-speech audio sample and make that recognize as any target phrase.

This kind of stuff is possible due to open source projects, in particular Mozilla Common Voice. Great stuff.

Source: Arxiv