Practice English Speaking&Listening with: Lecture - 23 Video Coding : Basic Building Blocks

Difficulty: 0

This lecture is on the video coding and in this lecture we are going to discuss about

the basic building blocks and how to exploit the temporal redundancy.

So far we have seen the image compression standards. Now a still image as you have already

seen that a still image does not have temporal information it has got only the spatial information

and once for all we have to send a frame of image and that is it so that should be encoded

by the encoder at the transmitter end and at the receiver end it should be decoded and

the receiver can reconstruct the image whereas in the case of video we have a continuous

stream of images which will be available from a video camera or some stored video medium.

Now what are the different applications? The applications of video communications one

can broadly categorize into two major application areas: one is what is called as the conversational

video and under this conversational video category we have video telephony, video conferencing.

Video telephony and video conferencing these could be through wired or wireless media

and the other application major application domain is the streaming video like this. This video which you are watching

through the net and you are able to download it into your own terminal and able to see

as a video on demand this is a streaming video application.

Therefore, video on demand is one of the major application of streaming video where some

recorded video is already there; recorded video in the digital medium, in the computer

or in the digital medium it is there and you can just get that video stream and play it

into your system. That is what you are having as a streaming video. Streaming video means

that as the video stream is being sent you are also simultaneously watching, you need

not have to wait until the complete video is downloaded. Even if a part of it has been

sent from the encoder to the receiver, once you start receiving, once you are in a position

to play some frames, the playing can commence while the subsequent video stream would be

sent from the encoder to the decoder so that is the streaming application. So one has got

video on demand, HDTV applications then image or video database services etc those are the

streaming video applications. So these are the two major areas conversational and streaming

video which will be addressed in the video coding.

Now, what is the basic difference that we are going to have between the still image

coding or compression and video coding? The first and foremost difference is that,

in the case of video we are having the temporal aspect of the signal. The signal is not only

varying in space but the signal is also varying in time. And there is a great deal of similarity

between the signal that varies in time. Because what is being done is that the video is captured

at a particular frame rate. let us say that the frame rate of the video is 30 frames per

second, so 30 frames per second means within one second we are capturing 30 different frames;

at a time of 33 milliseconds apart we are just sampling each and every frame and if

we are comparing successive frames then we will be finding that within the successive

frames there will be a great deal of similarity because the physical movement that can take

place within a time span of 30 milliseconds is highly limited. I mean, even this video

what you are watching here how much of movement can you see from one frame to the next?

If you are really analyzing you will be finding that the background is stationary, it is only

that when a person is talking like here I am talking you will be finding that there

is a lip movement, there is eye movement, sometimes shoulder movements are there so

a limited degree of movement you will be observing so that there is a great deal of temporal

redundancy that is present and this temporal redundancy also can be exploited in addition

to the spatial redundancy.

In the case of steel image we could see that only the spatial redundancy could be exploited.

And how did we exploit that? We had transformed the steel image from its original spatial

domain into a transformed domain space like the discrete cosine transform or the discrete

wavelet transform and then we worked with those transformed coefficients and encoded

and sent the bits sent the bit-stream. We had essentially use the transform domain approach;

although there are alternatives to the transform domain approach also like just as in the case

of the speech we could use the differential pulse code modulation scheme, adaptive differential

pulse code modulation schemes etc........ images also can very well make use of that;

DPCM could be used for images also.

What is DPCM or an ADPCM whatever you can talk of? It is basically a predictive coding

technique. But that prediction has to be a spatial prediction. In the case of still image

one has to predict a pixel by observing the values of the past transmitted pixels. In

the case of speech what is the mechanism of prediction? We had temporal prediction that

the time domain samples which are arriving based on the past time domain samples we are

predicting what the next sample is going to be, so it is a temporal prediction whereas

in still image it has to be a spatial prediction. But now in the case of video we are having

both variations with respect to space and variation with respect to time.

As far as the spatial redundancy is concerned it is a proven fact that transform domain

technique can exploit the spatial redundancies much better as compared to predictive encoding

for spatial domain. But again looking at what we had learnt in the speech, we had seen that

for time domain redundancy exploitation predictive coding is also a very good and effective technique.

So why not have a combination of these two techniques? As far as the removal of the spatial

redundancy is concerned adopt a spatial domain technique and as far as removal of time domain

is concerned adopt a.......... I mean adopt a spatial domain adopt a transform domain

technique whereas in the case of removal of temporal redundancy there you adopt a predictive

coding technique. So what essentially evolves is a hybrid coding technique. So it leads

to a hybrid video codec hybrid video codec where the first and foremost thing what is

done is that the temporal redundancy temporal redundancy is exploited by predictive coding.

Now how would you perform a predictive coding? If you are accepting the video sequence then

you are also having successive frames are coming and if you have a storage if you have

a memory to store the past frame then you can make use of the past frame to predict

what the present frame is going to be. Because as I had told you that between successive

frames there may not be much of change in the information so the past frame can very

well predict what the next frame is going to be and then if you then but but the prediction

is not going to be exact definitely because there will be some changes. Even though there

are parts where there will be no change like the background the stationary background especially

but there will be areas where there will be changes. So if you have predicting only based

on the past frame you are going to have some prediction error. That prediction error would

have some kind of a spatial similarity. Means the prediction error that we have; I mean,

between successive pixels in space we will be finding that there is not much of change

in the prediction error signal and that can be very well exploited by a transform domain


Hence, spatial redundancy I can tell it like this that spatial redundancy of prediction

error signal the temporal prediction error signals that can be that is exploited by transform

domain coding. So, in the video encoding process that we are going to utilize we are going

to have a combination of predictive coding for temporal redundancy and transform domain

technique transform domain coding for residual spatial redundancy. We can call it as a residue,

why residue is because it is the temporal prediction error that we are encoding through

the transform domain. So it is a combination of predictive and transform domain coding

and that is why this is called as the hybrid video coding.

Therefore, in a typical hybrid video codec what we are going to have is like this that

here we are having input video, here we are going to have a block where the input video

will be subtracted from the predicted video.

Now how that predicted video is generated that part we will be seeing. But at least

one thing is very clear that if this is with a plus sign and this is with a minus sign

in that case what we get here is the prediction error and as we said the prediction error

or the residual signal and the residual signal has to be transform codec. So there will be

a transform block that transform block would be DCT or DWT, let us say DCT because again

DCT is a coding technique that has been used in all the popular video coding standards

that has that have evolved through late.

I will be mentioning about the video coding standards very shortly. But DCT is just a

specific case of the transform that one can use for the residual signal and this DCT has

to be followed up by a quantizer. This blocks we also had remember in the JPEG encoding

scheme. There it was an original image which was DCT encoder and then quantized, in this

case it is the residual video signal. So, the quantized signal then could be coded through

the entropy coding technique just like the way we did for the still image so this is

an entropy encoder and the entropy encoder will generate the final bit-stream.

Now this is the forward path of the encoder which looks very clear. That means to say

that what this bit-stream is encoding is not the signal but it is encoding the residual

signal the prediction error signal. So somehow that prediction error should be that information

and how to predict, that information should be conveyed from the encoder to the decoder

also so the decoder can in turn generate what the encoder I mean what the original video

was. Of course exact reconstruction will not be possible because we have a quantizer present

here; quantizer will be a lossy so exact reconstruction will not be there. But you can start thinking

about what you will be finding in the decoder because decoder will have a reversal of this


So what will a decoder block diagram look like?

The bit-stream will be the received, bit-stream will be received; the bit-stream will go through

an entropy decoder then there is a quantizer in the encoder so we have to have a de-quantizer. What is the

de-quantizer? in the process of quantization what are we doing; we are dividing the coefficient

value by some quantization matrix elements. In this case these already quantized values

have to be multiplied by the quantization matrix element so there will be de-quantizer.

And here we had the DCT or the direct transform in this case we are going to have an inverse

transform so we are going to have the inverse transform or in this specific case IDCT inverse

DCT we have to do.

But here what do we get? Here we are going to get only the error signal.

But the error signal is not the one that we want to show on to the screen on to the display.

In the display we want to show the proper video not the residual signal. But how do

we obtain the error signal? It is input minus predicted that is error. Thus, if we add the

predicted to the error what do we get, we get input; not the exact input because even

this. So if we call this signal as s (n 1, n 2, k) so I have now introduced so s (n 1,

n 2) I was so long indicating as the pixel intensity at the position (n 1, n 2); now

I am denoting the intensity as s of (n 1, n 2, k) so there are three variables that

have come in and what is the third variable the third variable is the time.

So it is a frame number so this also this is also an integer. So (n 1, n 2, k) all of

these are in integer so that essentially I can say that this (n 1, n 2, k) this belongs

to a three dimensional integer space lambda cube. So what it means to say is that this

is the input video signal and the predicted video signal also will be we can denote as

s cap of (n 1, n 2, k) so that the error we will be denoting as e (n 1, n 2, k).

Now here what we are going to get; at the IDCT output we want to get e (n 1, n 2, k)

but we will not get exactly e (n 1, n 2, k) we will get e dot (n 1, n 2, k) why; because

of this quantizer at the encoder. So we have got e dot (n 1, n 2, k) and if we have a block

where we add or sum up e dot (n 1, n 2, k) with the predicted video that is s cap (n

1, n 2, k) what are we going to get; we are going to get the reconstructed video or we

will write it as s dot (n 1, n 2, k).

So this will be what we are going to expect at the decoder and this is what we are expecting

at the encoder but I have not completed my job yet.

What I have shown here is only a part of the encoder and a part of the decoder only looking

at the forward path. But I have left this issue open that how to generate the predicted

video. The only fact that I made a brief mention is that the predicted video should be generated

from the past video frames. So I must be having the signal corresponding to s (n 1, n 2, k

minus 1) or in general k minus L where L is a number which is greater than zero. So any

frame which is s (n 1, n 2, k minus L) with L greater than 0 basically refers to the past

frames and if I get some of the past frames then I should be able to predict this one.

So how to generate the past frame. Again here I have got only the error signal and that

too at this point it is transformed already.

Now you can well argue that, well, generation of the predicted video is not a problem. You

take the input video, the input video you take directly, you store it into some memory,

you get the past video frame and you compare the past video frame with the input video

find out that how much of displacements every block of the image has undergone and then

you can generate the predicted video from here.

Why do you have to derive the signal from here; because after all if you have to derive

you have to derive from this point from the quantizer. Well, encoder can do that. But

can a decoder do that? No, because the decoder does not have this input video information

anymore. What the decoder is going to have is only this encoded information after the

entropy so by a process of entropy decoding because entropy encoding is a lossless process

at the end of entropy decoding the decoder gets back exactly what you have over here

the decoder will get back at its end.

so the best point should be the So the philosophy of generating the predicted video should be

the way decoder is going to predict in the same manner the encoder is also going to predict

so that whatever error encoder has got that error residues can be encoded and sent into

the bit-stream and the decoder will generate that in a very similar manner because the

decoder is also having the presence of this signal but the decoder is not having the presence

of this signal so because of this reason we have to tap and derive the stored video from

this part.

So how do we do that? Here we have to do a mimicking of the decoder

part. What did we do at the decoder; we had put forward an entropy decoder, de-quantizer,

IDCT and generated e dot. In this case what we have to do is we do not have to use entropy

decoder because this is before the entropy encoding where we are tapping. So what should

be the block that we are going to have over here; de-quantizer. So here we keep a de-quantizer,

the next block should be an IDCT and output of IDCT would give me e dot (n 1, n 2, k).

Now if I have a predicted video; I still do not know how have i generated predicted video;

if I have a predicted video in that case what I should do is that I should add up this this

is s cap (n 1, n 2, k) that we should add up and what do we obtain we obtain s dot (n

1, n 2, k) exactly what the decoder also is able to obtain provided there is a good mechanism

of getting the predicted video. So here we will be getting s dot (n 1, n 2, k) and now

we can have a frame store. This is frame storage. So at the output of the frame store what we

get we get s dot (n 1, n 2, k minus 1). If it is storage of only one frame then we have

s dot (n 1, n 2, k minus 1) available at the output of the frame store.

Now you have s (n 1, n 2, k) the incoming video over here and then you are having the

s dot (n 1, n 2, k minus 1) the past frame; you compare the present frame with the past

frame and using that what you can get; you can you can find out that there will be some

regions where there will be no change in the intensity and there will be some regions where

there will be some change in the intensity because of the movements.

Basically it is something like this that you have a frame and supposing you have got an

object which is at this position say the centroid of the object is some point (n 1, n 2) so

this is at the frame k or this is at the frame k minus 1. And at frame k what will happen

is that the same object moves; the very simplistic assumption is that it undergoes a translational

movement that is a simplistic assumption. If it does a translational movement in that

case what we are going to have; translational movement is going to have an x component and

a y component or since we assume the directions as (n 1, n 2) the translational movement will

be having one component along n 1 direction and one component along n 2 direction so that

the translational movement can be specified as a two element vector. So the translational

movement can be specified by a two element vector and we are going to call that as two

element displacement vector and we can denote it as d vector where d vectors elements

are (d 1, d 2) transpose.

So d 1 is the displacement along the n 1 direction; d 2 is displacement along n 2 direction and

this is the displacement vector or also known as the motion vector. Basically the motion

vector refers to the same displacement vector only. Now, if we compare this k 1 frame with

k then what we will be observing is that there will be some regions where the displacement

is zero and there will be some moving objects where the displacements will be nonzero. So

we have to estimate those displacements. How we can do is by matching; by doing a proper

matching of the past frame will the present frame and in the process of matching what

we will be getting is that...........the ideal situation would be that if we can generate

a pixel to pixel matching but that will be highly time consuming and not very practicable

simply because just by comparing a pixels intensity with another frames pixel intensity

you cannot be sure that this pixel corresponds to that pixel because by fluke or by accident

the two pixels can have the same intensity but only if there is a matching not only with

the pixel but along with the pixel its neighborhoods also match with the pixel and its neighborhood

in the other image then only we can be confident of saying that this is a proper matching.

So it is not only just a pixel matching but a neighborhood matching.

Now how to specify the neighborhood? Well, one can define a small region. But instead

of specifying a region arbitrarily then again the issue will come as to how to specify the

region do we take circular regions. Do we take rectangular regions; do we take any arbitrary

regions? Well, the simplest of that could be a square or a rectangular region and because

we are already familiar with the partitioning of an image into non-overlapping blocks, we

did that in the DCT or in the transform domain coding we were subdividing the image into

a set of non-overlapping blocks applying a very similar philosophy we can divide the

frame into a set of non-overlapping blocks and for each block we can measure that how

much of displacement it has undergone.

So the process of finding out this displacement it is basically a search technique and this

technique is popularly known as motion estimation. This aspect was completely missing in the

case of still image completions. Still image did not have anything called motion estimation

but in this case, in the case of video we are going to have a motion estimation process

so that the displacements undergone by every block can be measured with respect to its

past frame.

What we are going to have is that this frame store output has to be compared with the input

video and we are going to call that block as motion estimator.

And what this motion estimator is going to generate?

The motion estimator will find out the motion vectors for every block. It will be able to

generate the motion vectors for every block and this motion vector why is it useful to

us? If I know that how much of motion is undergone by which block then I can fictitiously apply

that motion into my stored frame and then generate the predicted frame. If I know that

this is expected to move this much so I can generate a fictitious displacement and I can

predict a frame. So what we do is that there will be a motion prediction motion prediction

block and the motion prediction block will work on the stored frame by applying the motion

vectors; I call it as MV and by applying the frame stored on to the motion prediction we

can generate the predicted video. So this is how we complete the path.

Now this is in the encoder part and if you look at this no not this sorry take out this part let me

draw it in a different color. The path that I have shown within this red dotted border

can also be used in the decoder; am I right?

There is no difficulty; only this motion estimation I cannot do because motion estimation requires

input video and I do not have input video at the decoder. For motion prediction what

do we require; we require the use of the stored frame and motion vector. That means to say

that the decoder must get the information about this motion vector. So this motion vector

also has to be sent into the bit-stream. So we must derive a tapping from here just to

avoid clumsiness in the drawing I use a different color so this is the motion vector which will

go into this place where the motion vector also will be having some similarities between

themselves that means to say that motion vector of one block to the motion vector of the next

block or in the neighboring blocks are going to have some kind of similarity so you can

exploit that in the entropy encoding process so the motion vectors can go through the entropy

encoder scheme and go into the bit-stream. So this bit-stream what you are containing

what we are sending is not only the encoded error bit-stream but also the encoded motion

vector bit-stream.

Therefore at the decoder block diagram this bit-stream can have the two components: one

is the motion vector and the other is the signal.

Now, as far as the signal is concerned so after the entropy decoding we make a bifurcation.

The signal part goes to the de-quantizer and IDCT to derive this e dot (n 1, n 2, k) and

the motion vector that can be used in this motion vector motion prediction and frame

store. So I can in fact draw it as a single block so motion prediction plus frame store.

Now what does frame store require? Frame store requires s dot (n 1, n 2, k).

In the encoder it required s dot (n 1, n 2, k); decoder also we have got s dot (n 1, n

2, k) is available from here so s dot (n 1, n 2, k) could be fed to the frame store block,

the motion vector is coming over here so the motion prediction plus frame store could be

done from here and this is generating the predicted video s cap (n 1, n 2, k) just like

the way it is done even here also. All that we have done is that this frame store and

motion prediction we have combined into one single block. But the output of the motion

prediction block is the predicted video s cap (n 1, n 2, k). So even here also it will

be s cap (n 1, n 2, k).

This is the complete decoder. This is the basic philosophy behind the hybrid video coding.

Anybody having any doubts at this stage, before we proceed further into this? Yes Well, the

question that you are asking basically goes back to the issue of transmission of both

audio and video in a synchronized form, so that is a different problem altogether which

is addressed as a part of the multimedia communication so I am not addressing that issue in this

course but it is a very intricate issue that yes, the audio and video bit-streams which

are independently generated they have to be properly synchronized and then multiplexed

into the bit-stream and there are techniques there are standards for that so one has to

follow those to do this.

The idea that I am pursuing here is only pertaining to the video coding. Now there are a few issues

that need discussion at this stage. Now what are the new elements that we learnt from this

block diagram? One major aspect that we have learnt is that a very efficient motion estimation

technique has to be carried out at the encoder. Why efficient; because you see, as we had

seen just by reference to an example like this that we have to determine the motion

vectors for every individual blocks.

Now if I subdivide this image into non-overlapping blocks like this let us say of 8 by 8 size;

supposing I subdivide the image into 8 by 8 non-over lapping blocks like this and then

I estimate the motion vector for each individual block just try to see the computational intricacies

which are involved in that.

To find out the motion vector we have to do a matching and we do not know a priori that

whether this particular block of pixels have move to the left or to the right or to the

top or to the bottom; it can move in any of these directions I mean, assuming a translational

movement; it has got two degrees of freedom, it can move in positive x negative x positive

y negative y. That means to say that one has to have a search and the search is complicated

in the sense that whenever we are matching one block with a block that is present in

the past frame, the past frame is actually used as a reference.

In this case, in the block diagram what we have essentially done is that the image that

we obtained from the frame store that is the past reference frames whereas the input video

is being called as the candidate frame and this is the past frame. Candidate frame is

one which we are going to encode and past frame is what we are using as a reference

so the two terms that we use is the candidate and the reference.

Now, what we have to do while searching is that we have to search within a range and

then when we have to search, for every position we have to find out that what is the similarity

in this block of pixels with that in the reference frame; candidate block to be compared with

a reference block now and if the block size is 8 by 8 then we have to compare the values

for 64 pixels 8 by 8 that is 64 pixels and that we have to do for only one search position

but there will be several search positions.

If I specify a search rate that okay, any unlimited search range one need not have to

use because as I have told that the movement between the consecutive frames are going to

be quite restricted that is why we need not have to search over a very long search range

or very large search range, we can restrict the search range. But even within the restricted

search range also we have to do lot of searches. So there will be good computational issues

that will be involved and there are some techniques in order to have very efficient search technique

computation; although the optimal search technique is always the one which is carried out at

exacting search or what is known as the full search block motion. This way of estimating

the motion by subdividing the pixel into blocks like this it is called as the block matching

technique or block motion estimation block motion estimation technique whereby we are

subdividing the image into small blocks and estimating its motion and whenever we are

searching over the entire search range and estimating the motion we call it as full search

block motion or in short form it is called as FSBM.

Now FSBM guarantees that you will be able to find out the optimal position at which

the search is the best. Now how to quantify that search or rather how to measure the quality

of that search that I will cover in the next lecture and then we will also talk about some

efficient techniques for that which had been used in some of the popular standards and

we will also talk about some of the popular video compression standards, thank you.

The Description of Lecture - 23 Video Coding : Basic Building Blocks