Practice English Speaking&Listening with: How to dub a video with AI

Difficulty: 0

DALE MARKOWITZ: So here on "Making With Machine Learning,"

I record all my videos in English,

because it's the only language that I speak.

But you guys?

You're from all over the world and you speak

a lot of different languages.

Right now, we do translate it as subs,

but I wondered if we could use AI to translate it dubs.


DALE MARKOWITZ: I got this idea from my extremely inspiring

co-worker here at Google named Markku.

A couple of months ago, Markku filmed a YouTube video

showing you how you could create real-time translated subtitles

that would actually be projected on your body,

and even helped me build my own version.

So I thought, OK.

He did speech to text, but why not

add AI text-to-speech like an AI voice actor?

Now, I know a lot of you said in the comments

you hate computer voices.

SPEAKER 1: But I don't mind them at all.

DALE MARKOWITZ: So here's how I did it.

This project is actually kind of straightforward.

First, I'll pull the audio out of my videos

and then I'll use the Speech-to-Text

API to transcribe them.

Next, I'll use the Translation API

to convert the transcripts to any language that I want.

And then finally, I'll use the Text-to-Speech API to speak

those translated transcripts.

OK, so the first step is to enable some Google Cloud APIs.

For this project, we'll need the Speech-to-Text API,

the Text-to-Speech API, and the Translation API.

I decided to do this project in Python,

and you can find all of my code in the Making with Machine

Learning GitHub repo.

First, I use a library called pydub to extract

the audio from the video file and save it as a wav.

Next, I temporarily uploaded that file to the cloud

so that it could be used with the Speech to Text API.

Here's where I actually call the Google Cloud Speech-to-Text


There are a couple of different ways

that you can configure this API.

First, you specify the language_code,

which is what language you're transcribing from.

enable_automatic_punctuation tells

the API to look out for punctuation,

like question marks or periods.

enable_word_time_offsets gives you

both the word and the exact time the speaker

said them. diarization_config tells the API to look out

for multiple speakers.

And you also have the ability to use enhanced, more accurate

models for certain languages.

By the way, here's an example of what

the data that actually comes out of that API looks like.

Next, we have to chunk the transcript output

so that we can feed it to the Translation API.

At first, I thought I could do this

by splitting words into sentences based on punctuation

mark, and then translating each individual sentence.

But then, like a dummy, I realized that that only

works for Romance languages.

And in Japanese, you can't just split sentences by period.

So instead, I had to check for gaps

in when the speaker started and stopped speaking.

And then those were the chunks that I

passed to the Translation API.

Actually calling the Translation API only takes

a couple of lines of code.

Here's what the results of the Translation API

look like on my transcripts.

Finally, I called the Text-to-Speech API

to speak the translated words.

You can configure the API to use different computer voices,

and you can control the speaking rate, which

I used to make sure that the computer voice sped

up or slowed down to match the rate

of the speaker in the video.

SPEAKER 1: Sometimes, I talk very fast.

And sometimes, I talk more slow.

DALE MARKOWITZ: OK, so it's about midnight right now,

but I just got all of the individual pieces of the dubber


So I'm going to try it off for the first time.


SPEAKER 2: Welcome to the no time testing center shed.

We are here collecting learning and various sports

using this picture scooter.

So send the helicopter into the air and start collecting data.

DALE MARKOWITZ: There are actually two places

where errors could come from.

First of all, the Speech-to-Text API could make a mistake.

It could mishear me and mistranscibe me.

But also, translation can be inaccurate.

So I'm going to show you two approaches

to upping the accuracy.

First, we can up the quality of Speech-to-Text by using

a feature called phraseHints.

Here, you can specify words or phrases

that you think are more likely to appear in the video,

and then the API will be more likely to recognize them,

especially if they're uncommon words or proper nouns.

For translation, we can use something

called a glossary, which is just a CSV that

specifies the way we want certain words or phrases to be


For example, in Markku's video, the translations

for drone and tarp were a little off.

So even though Markku said tarp in Finnish,

the API was translating it to shed.

Using a glossary, we can make spot fixes like this.

Putting these two tools together,

I was able to make the accuracy of the dubs a lot better.

They're not perfect, but here's what they sound like.

SPEAKER 2: Welcome to the AI coach testing center tarp.

We are here to collect learning data on various sports using

this image scooter.

So send the drone into the air and start collecting data.

DALE MARKOWITZ: Let's listen to a few other examples.


SPEAKER 2: I have to make a New Year's resolution

to eat less treats and play more sports.


SPEAKER 4: There have been companies,

non-profit organizations, and educational institutes.

DALE MARKOWITZ: So I always thought

it would make sense to take something simpler,

like software engineering.


DALE MARKOWITZ: What do you think?

Would you watch one of these AI-dubbed videos?

Let me know in the comments below.

And I'll see you next time.

The Description of How to dub a video with AI