The correct answer here is D. The correlation coefficient is going to

be somewhere around negative five. How do you know that?

Well, a lot of these choices are obviously incorrect so negative one and

positive one, clearly all the dots don't lie on the line.

the correlation is also not zero, because you can see that the line has some slope.

And E would be incorrect because you can see that the line has some amount of

modest correlation here. It's certainly not strong but it's

negative, because our slope is negative. And it's 0.5 here, although as the line

is going down as 0.5, because it's kind of a modest correlation.

The correct answer here is B, a covariance of zero indicates no

correlation. a covariance is a standardized

covariance, not correlation coefficient, that's why A is incorrect.

a correlation C is incorrect, because a correlation coefficient with a highly

significant P-value does not necessarily indicate a strong correlation in terms of

the magnitude of the correlation. You can have an R of 0.001 that was

highly statistically significant if your sample size was big enough.

and D is incorrect because it's the correlation coefficient that's unitless,

the covariance has units. So, in this, question we're given an

outcome of cholesterol that's, linear regression equation.

Which has an intercept of 150 plus two times, our age past, 42 is our slope

here. Notice the units are in age past 40, so

you have to be careful about that. If we wanted to get the predicted

cholesterol for a 60 year old his or her age past at 40 is only 20.

So we would plug 20 in to the model here. So our yi hat, our predicted value for

this person would be 150 plus two times 20, which is equal to 190.

In this question, you have to pick out the regression equation from the output

like I did in some of the videos. So what is our model here?

Our model is that our outcome, our optimism score, is equal to the

intercepted 77.97, I'm finding that right here.

plus our Beta is 0.276 and our exercise is in units of hours per week.

So, which of the following is true? So, A is actually correct, because it

says a person who exercises 0 hours per week, is predicted to have an optimism

score of 77.97. Well that's indeed true, you can just

plug in zero here. This is actually going to go out of the

model. So that's what an intercept means.

The intercept is the value for which you get at an exercise level if the x

variable is zero. That's your where it crosses the y axis.

Show 77.97, that would be the correct answer here.

B is incorrect, because exercise is not significantly related to optimism.

Our P value came out to be 0.12. C is incorrect, because every additional

hour of weekly exercise was associated with with a 0.276 increase in optimism,

not decrease. Otherwise, it would have a negative sign,

and D is incorrect, because, that is give, the intercept is, is mis,

constring/g, the intercept. The intercept is 77.9, not the slope.

So, we're given here that we have a regression equation cholesterol is equal

to 150 is the intercept, two is the slope.

Our units are aged past 40. So if we have 60 year old then we would

say his age past 40 is 20, so we've plug 20 into the model.

That would give him a predicted value of yi hat of 190.

This is asking for the residual, so what is his observed, his actual value?

The residual is your observed value minus your predicted value.

His, observed value, his actual value is 250.

His, predictive value is 190. So he would have a residual of positive

60, which is A. In this question, we're told that we've

run a model with DSST scores or outcome variable.

And our first model contains vitamin D and age as predictors.

The Beta for vitamin D we're given is 0.30, and we're given the 95% confidence

interval, 0.2 to 0.4. Which means we know that it's

statistically significant at the 0.5 level.

We then run another model in which we add an additional predictor into the model,

we add exercise. The new Beta for vitamin D turns out to

be 0.15, and the confidence interval goes from 0.05 to 0.25.

So, by adding exercise to the model, we drastically reduced the size of the Beta

coefficient, the slope between vitamin D and DSST.

It was reduced by 50%. That's an indication that we have

confounding, exercise is a confounder of the relationship between vitamin D and

DSST score. So, A is correct here.

If you picked B, you got misled by the P values here.

So we are, we do have a statistically significant Beta in both cases.

That is, the relationship with vitamin D doesn't completely disappear when we

adjust for exercise. But we don't look at the P values for

judging confounding. We look at the magnitude of the slope of

the Beta coefficients. In this question we're given the mean

blood pressure of three groups. So for the high exercise group it was

120. For the medium exercise group it was 125.

And for the low exercise group, it was 135.

If I fit a linear regression model where I don't have anything else in the model

other than this categorical predictor. What's the resulting linear equation,

regression equation going to look like? Well my outcome here is systolic blood

pressure I have to choose something to be my reference group.

I'm going to be dummy coding this categorical predictor.

Well, which one am I going to choose to be the reference group?

Well, you could choose any of these to be the reference group, but you're limited

in your choices in the multiple choice question.

So, you can see that in most of them, I picked 120.

The lowest blood pressure with the high exercise group to be my reference.

So I need this one my reference group, because it happened to have the lowest

blood pressure. So the intercept becomes the mean for the

reference group. then I'm going to have two dummy coded

predictors in my model. I'm going to have a dummy coded predictor

for if your in the medium excercise group, versus anybody else.

Their mean is 5 points higher than the reference group.

So there that data is going to have a value of five.

And we're going to be multiplying that times one if your a medium exercise,

that's how we dummy code. Then we're going to have to have a dummy

coded predictor in the model for the low exercise group.

The difference between the low exercise group and the high exercise group in the

mean blood pressure was 15. So they get an additional 15 for their

blood pressure. We're going to be multiplying that by

one, if you're in the low group. Everybody else will get a zero for that

dummy coded variable. So the correct answer here would be D,

because only D represents the correct equation.

This next question, we're given unadjusted means for three groups, the

high, medium and low exercise groups for blood pressure.

They were 120, 125 and 135, so, the lowest exercising people have higher

blood pressure. We then adjusted those data for age.

The adjusted estimates for the mean, then become 115, 125, and 140.

So what we notice is that when you adjust for age, the disparity between the three

groups actually increases. It looks like there's an even bigger

difference between especially the high and low exercise groups.

So what does that tell us about the relationship here with age?

In general I've told you here that blood pressure will increase with age.

So if blood pressure is increasing with age and adjusting for age makes our

differences between the groups look even bigger.

Then what we can conclude is that the high exercise group actually was older on

average than the low exercise group. When we adjusted for age there's actually

a bigger difference here. In other words if in original sample,

where we didn't adjust for age we saw a 15 point difference between the highest

and the lowest exercise groups. However, when we adjusted for age, that

went to a 25 point difference. So, what must be happening here, is that

we happen to have some older people in the high exercise group.

Some younger people in the low excersise group.

And once we adjust for age we're revealing that the difference in blood

pressure between the exercises is even bigger than we saw when we hadn't

adjusted for age. In this question we are told that we've

got a study were we've got a group of just 14 women, a sub group.

We've run a linear regression model on just those 14 women, and we get this

miraculous R squared of 99%. Well if it looks to good to be true, it

is. Once you stuff too many predictors in the

model, especially with a small sample size like this, you're at high risk of

over-fitting. That high R squared value reflects an

over-fit model. So C is correct here, the model suffers

from over-fitting.