DALE MARKOWITZ: So here on "Making With Machine Learning,"
I record all my videos in English,
because it's the only language that I speak.
But you guys?
You're from all over the world and you speak
a lot of different languages.
Right now, we do translate it as subs,
but I wondered if we could use AI to translate it dubs.
DALE MARKOWITZ: I got this idea from my extremely inspiring
co-worker here at Google named Markku.
A couple of months ago, Markku filmed a YouTube video
showing you how you could create real-time translated subtitles
that would actually be projected on your body,
and even helped me build my own version.
So I thought, OK.
He did speech to text, but why not
add AI text-to-speech like an AI voice actor?
Now, I know a lot of you said in the comments
you hate computer voices.
SPEAKER 1: But I don't mind them at all.
DALE MARKOWITZ: So here's how I did it.
This project is actually kind of straightforward.
First, I'll pull the audio out of my videos
and then I'll use the Speech-to-Text
API to transcribe them.
Next, I'll use the Translation API
to convert the transcripts to any language that I want.
And then finally, I'll use the Text-to-Speech API to speak
those translated transcripts.
OK, so the first step is to enable some Google Cloud APIs.
For this project, we'll need the Speech-to-Text API,
the Text-to-Speech API, and the Translation API.
I decided to do this project in Python,
and you can find all of my code in the Making with Machine
Learning GitHub repo.
First, I use a library called pydub to extract
the audio from the video file and save it as a wav.
Next, I temporarily uploaded that file to the cloud
so that it could be used with the Speech to Text API.
Here's where I actually call the Google Cloud Speech-to-Text
There are a couple of different ways
that you can configure this API.
First, you specify the language_code,
which is what language you're transcribing from.
the API to look out for punctuation,
like question marks or periods.
enable_word_time_offsets gives you
both the word and the exact time the speaker
said them. diarization_config tells the API to look out
for multiple speakers.
And you also have the ability to use enhanced, more accurate
models for certain languages.
By the way, here's an example of what
the data that actually comes out of that API looks like.
Next, we have to chunk the transcript output
so that we can feed it to the Translation API.
At first, I thought I could do this
by splitting words into sentences based on punctuation
mark, and then translating each individual sentence.
But then, like a dummy, I realized that that only
works for Romance languages.
And in Japanese, you can't just split sentences by period.
So instead, I had to check for gaps
in when the speaker started and stopped speaking.
And then those were the chunks that I
passed to the Translation API.
Actually calling the Translation API only takes
a couple of lines of code.
Here's what the results of the Translation API
look like on my transcripts.
Finally, I called the Text-to-Speech API
to speak the translated words.
You can configure the API to use different computer voices,
and you can control the speaking rate, which
I used to make sure that the computer voice sped
up or slowed down to match the rate
of the speaker in the video.
SPEAKER 1: Sometimes, I talk very fast.
And sometimes, I talk more slow.
DALE MARKOWITZ: OK, so it's about midnight right now,
but I just got all of the individual pieces of the dubber
So I'm going to try it off for the first time.
MARKKU: [SPEAKING FINNISH]
SPEAKER 2: Welcome to the no time testing center shed.
We are here collecting learning and various sports
using this picture scooter.
So send the helicopter into the air and start collecting data.
DALE MARKOWITZ: There are actually two places
where errors could come from.
First of all, the Speech-to-Text API could make a mistake.
It could mishear me and mistranscibe me.
But also, translation can be inaccurate.
So I'm going to show you two approaches
to upping the accuracy.
First, we can up the quality of Speech-to-Text by using
a feature called phraseHints.
Here, you can specify words or phrases
that you think are more likely to appear in the video,
and then the API will be more likely to recognize them,
especially if they're uncommon words or proper nouns.
For translation, we can use something
called a glossary, which is just a CSV that
specifies the way we want certain words or phrases to be
For example, in Markku's video, the translations
for drone and tarp were a little off.
So even though Markku said tarp in Finnish,
the API was translating it to shed.
Using a glossary, we can make spot fixes like this.
Putting these two tools together,
I was able to make the accuracy of the dubs a lot better.
They're not perfect, but here's what they sound like.
SPEAKER 2: Welcome to the AI coach testing center tarp.
We are here to collect learning data on various sports using
this image scooter.
So send the drone into the air and start collecting data.
DALE MARKOWITZ: Let's listen to a few other examples.
MARKKU: [SPEAKING FINNISH]
SPEAKER 2: I have to make a New Year's resolution
to eat less treats and play more sports.
SPEAKER 3: [SPEAKING SPANISH]
SPEAKER 4: There have been companies,
non-profit organizations, and educational institutes.
DALE MARKOWITZ: So I always thought
it would make sense to take something simpler,
like software engineering.
SPEAKER 5: [SPEAKING RUSSIAN]
DALE MARKOWITZ: What do you think?
Would you watch one of these AI-dubbed videos?
Let me know in the comments below.
And I'll see you next time.