Simple linear regression model

Moon Oulatta, PhD

Department of Economics

Simple linear regression model

Introduction

A linear regression model allows one to analyze th e empirical

relationship between two (

simple regression) or more variables

(

multiple regression) with a level of statistical conﬁdence.

The coeﬃcients of a linear regression model are linear, but the

function of the relationship between the

dependent variable and

the

independent variable can be linear or nonli n ear.

There are various linear estimators that can be used to estimate

the linear population parameters of a linear regre ssi on model. Here,

we are mainly concerned about ﬁnding the

best linear unbiased

estimator (BLUE) ? We show that the ordinary least square

estimator (OLS) is BLUE under certain conditions.

Simple linear regression model

Classical assumptions

A linear population regress ion model can be deﬁned as follows

y = γ

+ γ

z + ε (1)

Equation (1) denotes a simple population regression function,

which depicts a linear relationship between y (

dependent variable)

and z (independent variable). ε is an error term which includes

any factors that aﬀects y, which are omitted from equati on (1). γ

and γ

are th e unknown linear population parameters: these are the

parameters that we are interested in estima ti ng with sample data by

using the

OLS approach.

Simple linear regression model

Classical assumptions

is the constant of the regression, but γ

is the slope, which

measures the linear eﬀect of z on y , while holding other factors

constant:

∂y

∂z

= γ

How do we know that γ

is the true eﬀect of z on y ? we need to

make a few assumpti ons (Gauss-Markov assumptions) about the

nature of the relationship between the error term (ε) and the

independent variable (z).

Simple linear regression model

Classical assumptions

Firstly, we make an assumption about th e ﬁrst moment of the

distribution of the error term as follows:

E (ε) = 0

where the latter also implies that the conditional mean of the error

is also zero, which means that the error term is independent of the

explanatory variable (z).

Mean independence also implies that both random variables (ε,z)

are unc orrelated as well, whic h is necessary to ensure that γ

is the

true and unbiased eﬀect of z on y (see the whiteboard for

mathematical proof).

Simple linear regression model

Classical assumptions

Hence the zero conditional mean assumption:

E (ε|z) = E (ε) = cov (ε, z) = 0

Using the zero conditional mean assumption, we can show that

the average value of y changes exactly by γ

as follows

E (y |z) = γ

+ γ

Simple linear regression model

Classical assumptions

The conditional variance of the error term is assumed to be

constant. Which i m pl i e s that the error term (ε) is constant for any

value of z (this is known as

homoscedasticity):

var (ε|z) = var(ε) = σ

Consequently, we can say that the error term is normally

distributed, with a zero mean and a constant variance as follows

ε ∼ N(0, σ

)

Simple linear regression model

Ordinary least squares (method)

We cannot sample the entire population of y and z; but we can

obtain an

unbiased sample taken fr om the population. The OLS

approach requires that the sample observations satisfy the following

condition:

, y

, i = 1, 2, .. . , n) 99K i.i.d

i.i.d: implies that the sam ple observations must be independent and

identically distribut ed (each obser va ti on has as an equal chance of

being selected). To achi eve the latter, the samples should be dr awn

from a

random process (see the lecture on sampl in g design).

Simple linear regression model

Ordinary least squares (method)

Using a random sample of data for y and z, we will estimate the

sample regression mo del and obtain our predictions of the

dependent variable as follows

ˆy

= ˆγ

+ ˆγ

(2)

where ˆy

are the predicted values of y

based on the sample data.

ˆγ

are the OLS estimators of the unknown population

parameters

Simple linear regression model

Ordinary least squares (method)

There will always be some deviation between the actual data (y

)

and what we predict ( ˆy

) based on the

sample regression mo del :

these deviations are referred to as the

residuals of the regression

model. The latter can be estimated as follows

ˆε

= (y

− ˆy

) (3)

The unexplained variance in a regression mod e l measures th e

portion of variation in the actual data that is not explained by the

regression model. This is computed by ﬁnding the residuals sum of

squares as follows:

i =1

(ˆε

)

i =1

− ˆy )

Simple linear regression model

Ordinary least squares (method)

The main objective of the OLS metho d is to minimize the

unexplained variance. The OLS optim i za ti on problem (objective

function) is deﬁned as follows

arg min

ˆγ

,ˆγ

Φ =

i =1

(ˆε

)

(4)

where in e q ua ti on (4), the objective is to choose ˆγ

,ˆγ

in such a way

to minimize the

unexplained variance.

Simple linear regression model

Ordinary least squares (method)

Taking the ﬁrst-order condi ti on s from equation (4) yields the

following results:

∂Φ

∂ˆγ

99K

i =1

−2(y

− ˆγ

) = 0

∂Φ

∂ˆγ

99K

i =1

−2z

− ˆγ

) = 0

Simple linear regression model

Ordinary least squares (method)

Using the ﬁrst-order c ond i ti on s, we can solve f or the OLS estimators

of the constant and the slope as follows (

see the whiteboard for

mathematical proof

ˆγ

= ¯y − ˆγ

¯z (5)

ˆγ

i =1

− ¯z)(y

− ¯y )

i =1

− ¯z)

(6)

The OLS estimators are BLUE when the classical assumptions hold.

Another important assumption is that the variance of the

independent variable cannot be zero.

Simple linear regression model

Sampling distribution

The OLS estimators are derived from sample data, so they are

treated as

random variables with a sampling distribution. We will

rely on the characteristics of the sampling distri b ution of the OLS

estimators to make inference about the unknown coeﬃcient s of the

population regression.

Here, we will design a Monte Carlo experiment to show tha t the

sampling distribution of the O LS estimators is normally

distributed for large samples.

Simple linear regression model

Sampling distribution

First, we deﬁne the true population regressi on function as follows

y = 1 + 3z + ε (7)

where ε is the error term, which is normally distributed with a zero

mean and a constant variance. z is an indepe nd e nt variable that

follows a beta distribution. Note thaty is a linear combination of

(ε), which means that y is expected to be normally distributed.

We rand oml y sample 2,000 samples of 50 conversations without

replacement from a large population of 15000 observations. We

store the OLS estimators from the 2000 regr ess ion s to compute the

sampling distributions of the constant and the slope.

For large samples, we prove that the OLS estimators are normally

distributed and BLUE (see the RStudio example for i nst ruc ti ons

on the Monte Carlo simulations

Simple linear regression model

Sampling distribution

Here, we report the true relationship between y and z as follows:

Figure: Population Regression Function

Simple linear regression model

Sampling distribution

Here, using the Monte Carlo experiment, we show that the sampling

distribution of the OLS estimator is

normally distributed for larger

samples

Figure: Monte Carlo (Simulation)

Simple linear regression model

Sampling distribution

The sampling distribution of the OLS estimators can be derived

as follows

ˆγ

∼ N



, σ



¯z

i =1

− ¯z)



(8)

ˆγ

∼ N





i =1

− ¯z)



(9)

When the variance of the population errors is unknown, we can

estimate it by relyi ng on an unbiased estimator of the population

variance

, which is a function of the residual variance. Larger

samples provide eﬃcient est im at ors, because they reduce the

sampling variance of the OLS estimators.

Simple linear regression model

Statistical inference

We want to u se sample data collected for (y ) and (z) to estimate

the unknown population parameters in equation (1) by relying on

the OLS approach.

We will make an assumption about the true relationship between the

(y) and (z) and test this assumption by using a

test statistic that

will help us to determine if the assumption is statistically valid.

We wil l rely on the sampling distribution of the OLS estimator as

the sampling distribution of the test statistic to test the given

hypothesis about the population parameter.

The sampling distribution of the OLS estimators is normal as shown

earlier. However, because the variance of the population errors is

unknown, we will rely on the

student-t-distribution instead of the

normal d i str i b ut ion to conduct statistical i n f er en ce .

Simple linear regression model

Statistical inference

We can state the null hypothesis th a t there is true no linear

relationship between y and z; which impl i es that the true eﬀect of z

on y is equal to zero as follows

: γ

= 0

alternatively, the

alternative hypothesis can be stated as follows

: γ

= 0

Simple linear regression model

Statistical inference

We nee d to choose a conﬁdence level as follows

conﬁdence level = (1 − α)

α denotes the level of signiﬁcance; or in other words the maximum

amount of risk we are willing tolerate for making a type-1 error.

Simple linear regression model

Statistical inference

Deﬁne the test statistic as follows

T =

ˆγ

− γ

s.e( ˆγ

)

∼ t(n − 2) (10)

Using the unbiased estimator (mean square error) of the variance

of the pop u l ati on errors

ˆε

i =1

ˆε

(n − 2)

(11)

We can compute the precise estimate of the standard error as

follows

s.e( ˆγ

) =

ˆε

i =1

− ¯z)

(12)

s.e( ˆγ

) gives the accuracy of the OLS estimate of the slope.

Simple linear regression model

Statistical inference

Compute the critical test statistic (t

) by relying on the degrees

of freedom

(n − 2) and the level of signiﬁcance (se e excel ﬁle on

blackboard).

Figure: Two-Tail Test (T-Distribution)

If |T | ≤ (t

) −→ accept H

. Alternatively, when |T | > (t

) −→ we

can reject H

Simple linear regression model

Statistical inference

The conﬁdence interval provides us a range of estimates for the

population parameters, which are unknown. For instance, the

conﬁdence level for th e slope can be com pu ted as follows

CI = ˆγ

± s.e( ˆγ

(13)

We shou ld reject H

when zero is not included in the conﬁdence

interval for the parameter.

Simple linear regression model

Application (population regres si on)

Scenario: policymakers in the Latin America are interested in

examining the relationship between

credit provided by the

ﬁnancial sector

means to improve access to education . They believe that the true

relationship can be modeled as follows

s = γ

+ γ

c + ε (14)

where ε is the error term. γ

and γ

are th e unknown population

parameters: these parameters are to be estimated by applying the

OLS approach to sample data.

Simple linear regression model

Application (sample data)

Figure: Private Financing & School Enrollment

Simple linear regression model

Application (OLS estimators)

Using sample data, we can compute the following statistics:

i =1

− ¯c)

= 6956.456;

i =1

− ¯c)(s

− ¯s) = 1614.142;

¯s = 73.06; ¯c = 49.29. We can comp u te the

OLS estimators as

follows

ˆγ

−→

i =1

− ¯c)(s

− ¯s)

i =1

− ¯c)

= 0.232

ˆγ

−→ ¯s − ˆγ

¯c = 61.62

Simple linear regression model

Application (inference)

First, we state the null hypothesis that there is no linear

relationship between s and c as follows

: γ

= 0

then we deﬁne the alternative hypothesis as follows

: γ

= 0

Secondly, we choose a 95 % conﬁdence level (this is the standard).

Using the student t-distribution, we ca n ﬁnd the ﬁve-percent

critical test statistic for a two-tailed test with 11 degrees of

freedom as follows

= 2.201

Simple linear regression model

Simple linear regression example (inference)

Next, we have to compute a test statistic (T ), which requ i re s an

estimate of the

standard error of the OLS slope. First, we compute

the

residual variance as follows

ˆε

−→

i =1

ˆε

(n − 2)

−→

705.30

(13 − 2)

= 64.118

Using the fact that

ˆε

= 8.007, we can compu te the standard

error of the slope as follows

s.e( ˆγ

) −→

ˆε

i =1

− ¯c)

−→

8.007

83.4053

= .0960

The test statistic can be c omp ute d as follows

T −→

.232

.0960

∼ t(n − 2) = 2.42

Simple linear regression model

Application (conclusion)

Here, |T | > (t

) −→. Therefore, we can reject the null hypothesis

and accept that there is a signiﬁcant statistical linear relationship

between domestic credit and secondary school enrollment.

More im portantly, a one percentage point increase in the cr ed i t rate

contributes to a .23 pe r cen ta ge point increase in the sc hool

enrollment rate.

Simple linear regression model

Application (selected issues)

There are omitted variables that aﬀect secondary school enrollment,

which are also correlated to c re di t provided by the ﬁnancial sector

(for ex am pl e , GDP per capita) .

Furthermore, the relationship between school e nr ol l me nt and credi t

is not necessarily u n i di r ec ti ona l : there is a possibility for reverse

causation.

The omitted variable issue and the reverse causation problem imply

that the zero conditional mean assumption is violated, which leads

to biased estimates:

E (

) = γ

cov(c, ε)

var (c)

Simple linear regression model

Application (selected issues)

Here, the sample size is relatively small, which makes the OLS

estimates less eﬃcient. Students should rely on large samples to

obtain eﬃcient OLS estimators.

The sample size is not necessarily representative of the entire Latin

American peninsula (we need a robust sampling method). This

sample selection bias makes the OLS estimators unreliable and bias.

The domestic credit variable could be subject to measurement errors.

These errors would lead to bi as ed estimates of the OLS estimators.

Students should test the homoscedasticity condition by analyzing

the pattern of the residuals against the predicted values. This is an

important diagnostic test in ensuring that the OLS estimates are

eﬃcient.

Simple linear regression model

Concluding remarks

A linear regression model i s useful to analyze the empirical

relationship between two or more variables. Here, we discussed the

key ass um p ti ons (for example, the Gaus-Markov assumptions) of a

simple linear regression model. These assumptions provide a strong

foundation for understanding the multiple linear regressi on model.

We dis cu ss ed the inferen ce part of the l i ne ar regression mo de l and

provided a practical example to understand the model.

Lastly, we discussed the key empirical issues that students are likely

to face in estimating a linear regression. Ultimately, this lecture

provides a robust exposure to the fundamenta l s of a linear regression.