Simple linear regression model
Moon Oulatta, PhD
Department of Economics
Simple linear regression model
Introduction
A linear regression model allows one to analyze th e empirical
relationship between two (
simple regression) or more variables
(
multiple regression) with a level of statistical confidence.
The coefficients of a linear regression model are linear, but the
function of the relationship between the
dependent variable and
the
independent variable can be linear or nonli n ear.
There are various linear estimators that can be used to estimate
the linear population parameters of a linear regre ssi on model. Here,
we are mainly concerned about finding the
best linear unbiased
estimator (BLUE) ? We show that the ordinary least square
estimator (OLS) is BLUE under certain conditions.
Simple linear regression model
Classical assumptions
A linear population regress ion model can be defined as follows
y = γ
0
+ γ
1
z + ε (1)
Equation (1) denotes a simple population regression function,
which depicts a linear relationship between y (
dependent variable)
and z (independent variable). ε is an error term which includes
any factors that affects y, which are omitted from equati on (1). γ
0
and γ
1
are th e unknown linear population parameters: these are the
parameters that we are interested in estima ti ng with sample data by
using the
OLS approach.
Simple linear regression model
Classical assumptions
γ
0
is the constant of the regression, but γ
1
is the slope, which
measures the linear effect of z on y , while holding other factors
constant:
y
z
= γ
1
How do we know that γ
1
is the true effect of z on y ? we need to
make a few assumpti ons (Gauss-Markov assumptions) about the
nature of the relationship between the error term (ε) and the
independent variable (z).
Simple linear regression model
Classical assumptions
Firstly, we make an assumption about th e first moment of the
distribution of the error term as follows:
E (ε) = 0
where the latter also implies that the conditional mean of the error
is also zero, which means that the error term is independent of the
explanatory variable (z).
Mean independence also implies that both random variables (ε,z)
are unc orrelated as well, whic h is necessary to ensure that γ
1
is the
true and unbiased effect of z on y (see the whiteboard for
mathematical proof).
Simple linear regression model
Classical assumptions
Hence the zero conditional mean assumption:
E (ε|z) = E (ε) = cov (ε, z) = 0
Using the zero conditional mean assumption, we can show that
the average value of y changes exactly by γ
1
as follows
E (y |z) = γ
0
+ γ
1
z
Simple linear regression model
Classical assumptions
The conditional variance of the error term is assumed to be
constant. Which i m pl i e s that the error term (ε) is constant for any
value of z (this is known as
homoscedasticity):
var (ε|z) = var(ε) = σ
2
ε
Consequently, we can say that the error term is normally
distributed, with a zero mean and a constant variance as follows
ε N(0, σ
2
ε
)
Simple linear regression model
Ordinary least squares (method)
We cannot sample the entire population of y and z; but we can
obtain an
unbiased sample taken fr om the population. The OLS
approach requires that the sample observations satisfy the following
condition:
(z
i
, y
i
, i = 1, 2, .. . , n) 99K i.i.d
i.i.d: implies that the sam ple observations must be independent and
identically distribut ed (each obser va ti on has as an equal chance of
being selected). To achi eve the latter, the samples should be dr awn
from a
random process (see the lecture on sampl in g design).
Simple linear regression model
Ordinary least squares (method)
Using a random sample of data for y and z, we will estimate the
sample regression mo del and obtain our predictions of the
dependent variable as follows
ˆy
i
= ˆγ
0
+ ˆγ
1
z
i
(2)
where ˆy
i
are the predicted values of y
i
based on the sample data.
ˆγ
0
ˆγ
1
are the OLS estimators of the unknown population
parameters
.
Simple linear regression model
Ordinary least squares (method)
There will always be some deviation between the actual data (y
i
)
and what we predict ( ˆy
i
) based on the
sample regression mo del :
these deviations are referred to as the
residuals of the regression
model. The latter can be estimated as follows
ˆε
i
= (y
i
ˆy
i
) (3)
The unexplained variance in a regression mod e l measures th e
portion of variation in the actual data that is not explained by the
regression model. This is computed by finding the residuals sum of
squares as follows:
n
X
i =1
(ˆε
i
)
2
=
n
X
i =1
(y
i
ˆy )
2
Simple linear regression model
Ordinary least squares (method)
The main objective of the OLS metho d is to minimize the
unexplained variance. The OLS optim i za ti on problem (objective
function) is defined as follows
arg min
ˆγ
0
,ˆγ
1
Φ =
n
X
i =1
(ˆε
i
)
2
(4)
where in e q ua ti on (4), the objective is to choose ˆγ
0
,ˆγ
1
in such a way
to minimize the
unexplained variance.
Simple linear regression model
Ordinary least squares (method)
Taking the first-order condi ti on s from equation (4) yields the
following results:
Φ
ˆγ
0
99K
n
X
i =1
2(y
i
ˆγ
0
ˆγ
1
z
i
) = 0
Φ
ˆγ
1
99K
n
X
i =1
2z
i
(y
i
ˆγ
0
ˆγ
1
z
i
) = 0
Simple linear regression model
Ordinary least squares (method)
Using the first-order c ond i ti on s, we can solve f or the OLS estimators
of the constant and the slope as follows (
see the whiteboard for
mathematical proof
):
ˆγ
0
= ¯y ˆγ
1
¯z (5)
ˆγ
1
=
P
n
i =1
(z
i
¯z)(y
i
¯y )
P
n
i =1
(z
i
¯z)
2
(6)
The OLS estimators are BLUE when the classical assumptions hold.
Another important assumption is that the variance of the
independent variable cannot be zero.
Simple linear regression model
Sampling distribution
The OLS estimators are derived from sample data, so they are
treated as
random variables with a sampling distribution. We will
rely on the characteristics of the sampling distri b ution of the OLS
estimators to make inference about the unknown coefficient s of the
population regression.
Here, we will design a Monte Carlo experiment to show tha t the
sampling distribution of the O LS estimators is normally
distributed for large samples.
Simple linear regression model
Sampling distribution
First, we define the true population regressi on function as follows
y = 1 + 3z + ε (7)
where ε is the error term, which is normally distributed with a zero
mean and a constant variance. z is an indepe nd e nt variable that
follows a beta distribution. Note thaty is a linear combination of
(ε), which means that y is expected to be normally distributed.
We rand oml y sample 2,000 samples of 50 conversations without
replacement from a large population of 15000 observations. We
store the OLS estimators from the 2000 regr ess ion s to compute the
sampling distributions of the constant and the slope.
For large samples, we prove that the OLS estimators are normally
distributed and BLUE (see the RStudio example for i nst ruc ti ons
on the Monte Carlo simulations
).
Simple linear regression model
Sampling distribution
Here, we report the true relationship between y and z as follows:
Figure: Population Regression Function
Simple linear regression model
Sampling distribution
Here, using the Monte Carlo experiment, we show that the sampling
distribution of the OLS estimator is
normally distributed for larger
samples
.
Figure: Monte Carlo (Simulation)
Simple linear regression model
Sampling distribution
The sampling distribution of the OLS estimators can be derived
as follows
ˆγ
0
N
γ
0
, σ
2
ε
1
n
+
¯z
2
P
n
i =1
(z
i
¯z)
2

(8)
ˆγ
1
N
γ
1
,
σ
2
ε
P
n
i =1
(z
i
¯z)
2

(9)
When the variance of the population errors is unknown, we can
estimate it by relyi ng on an unbiased estimator of the population
variance
, which is a function of the residual variance. Larger
samples provide efficient est im at ors, because they reduce the
sampling variance of the OLS estimators.
Simple linear regression model
Statistical inference
We want to u se sample data collected for (y ) and (z) to estimate
the unknown population parameters in equation (1) by relying on
the OLS approach.
We will make an assumption about the true relationship between the
(y) and (z) and test this assumption by using a
test statistic that
will help us to determine if the assumption is statistically valid.
We wil l rely on the sampling distribution of the OLS estimator as
the sampling distribution of the test statistic to test the given
hypothesis about the population parameter.
The sampling distribution of the OLS estimators is normal as shown
earlier. However, because the variance of the population errors is
unknown, we will rely on the
student-t-distribution instead of the
normal d i str i b ut ion to conduct statistical i n f er en ce .
Simple linear regression model
Statistical inference
We can state the null hypothesis th a t there is true no linear
relationship between y and z; which impl i es that the true effect of z
on y is equal to zero as follows
H
0
: γ
1
= 0
alternatively, the
alternative hypothesis can be stated as follows
H
A
: γ
1
= 0
Simple linear regression model
Statistical inference
We nee d to choose a confidence level as follows
confidence level = (1 α)
α denotes the level of significance; or in other words the maximum
amount of risk we are willing tolerate for making a type-1 error.
Simple linear regression model
Statistical inference
Define the test statistic as follows
T =
ˆγ
1
γ
1
s.e( ˆγ
1
)
t(n 2) (10)
Using the unbiased estimator (mean square error) of the variance
of the pop u l ati on errors
σ
2
ˆε
=
P
n
i =1
ˆε
2
i
(n 2)
(11)
We can compute the precise estimate of the standard error as
follows
s.e( ˆγ
1
) =
σ
ˆε
q
P
n
i =1
(z
i
¯z)
2
(12)
s.e( ˆγ
1
) gives the accuracy of the OLS estimate of the slope.
Simple linear regression model
Statistical inference
Compute the critical test statistic (t
c
) by relying on the degrees
of freedom
(n 2) and the level of significance (se e excel file on
blackboard).
Figure: Two-Tail Test (T-Distribution)
If |T | (t
c
) accept H
0
. Alternatively, when |T | > (t
c
) we
can reject H
0
.
Simple linear regression model
Statistical inference
The confidence interval provides us a range of estimates for the
population parameters, which are unknown. For instance, the
confidence level for th e slope can be com pu ted as follows
CI = ˆγ
1
± s.e( ˆγ
1
)t
c
(13)
We shou ld reject H
0
when zero is not included in the confidence
interval for the parameter.
Simple linear regression model
Application (population regres si on)
Scenario: policymakers in the Latin America are interested in
examining the relationship between
credit provided by the
financial sector
(c) and secondary school enrol lm ent (s) as a
means to improve access to education . They believe that the true
relationship can be modeled as follows
s = γ
0
+ γ
1
c + ε (14)
where ε is the error term. γ
0
and γ
1
are th e unknown population
parameters: these parameters are to be estimated by applying the
OLS approach to sample data.
Simple linear regression model
Application (sample data)
Figure: Private Financing & School Enrollment
Simple linear regression model
Application (OLS estimators)
Using sample data, we can compute the following statistics:
P
13
i =1
(c
i
¯c)
2
= 6956.456;
P
13
i =1
(c
i
¯c)(s
i
¯s) = 1614.142;
¯s = 73.06; ¯c = 49.29. We can comp u te the
OLS estimators as
follows
ˆγ
1
P
13
i =1
(c
i
¯c)(s
i
¯s)
P
13
i =1
(c
i
¯c)
2
= 0.232
ˆγ
0
¯s ˆγ
1
¯c = 61.62
Simple linear regression model
Application (inference)
First, we state the null hypothesis that there is no linear
relationship between s and c as follows
H
0
: γ
1
= 0
then we define the alternative hypothesis as follows
H
A
: γ
1
= 0
Secondly, we choose a 95 % confidence level (this is the standard).
Using the student t-distribution, we ca n find the five-percent
critical test statistic for a two-tailed test with 11 degrees of
freedom as follows
t
c
= 2.201
Simple linear regression model
Simple linear regression example (inference)
Next, we have to compute a test statistic (T ), which requ i re s an
estimate of the
standard error of the OLS slope. First, we compute
the
residual variance as follows
σ
2
ˆε
P
n
i =1
ˆε
2
i
(n 2)
705.30
(13 2)
= 64.118
Using the fact that
p
σ
2
ˆε
= 8.007, we can compu te the standard
error of the slope as follows
s.e( ˆγ
1
)
σ
ˆε
q
P
n
i =1
(c
i
¯c)
2
8.007
83.4053
= .0960
The test statistic can be c omp ute d as follows
T
.232
.0960
t(n 2) = 2.42
Simple linear regression model
Application (conclusion)
Here, |T | > (t
c
) . Therefore, we can reject the null hypothesis
and accept that there is a significant statistical linear relationship
between domestic credit and secondary school enrollment.
More im portantly, a one percentage point increase in the cr ed i t rate
contributes to a .23 pe r cen ta ge point increase in the sc hool
enrollment rate.
Simple linear regression model
Application (selected issues)
There are omitted variables that affect secondary school enrollment,
which are also correlated to c re di t provided by the financial sector
(for ex am pl e , GDP per capita) .
Furthermore, the relationship between school e nr ol l me nt and credi t
is not necessarily u n i di r ec ti ona l : there is a possibility for reverse
causation.
The omitted variable issue and the reverse causation problem imply
that the zero conditional mean assumption is violated, which leads
to biased estimates:
E (
ˆ
γ
1
) = γ
1
+
cov(c, ε)
var (c)
Simple linear regression model
Application (selected issues)
Here, the sample size is relatively small, which makes the OLS
estimates less efficient. Students should rely on large samples to
obtain efficient OLS estimators.
The sample size is not necessarily representative of the entire Latin
American peninsula (we need a robust sampling method). This
sample selection bias makes the OLS estimators unreliable and bias.
The domestic credit variable could be subject to measurement errors.
These errors would lead to bi as ed estimates of the OLS estimators.
Students should test the homoscedasticity condition by analyzing
the pattern of the residuals against the predicted values. This is an
important diagnostic test in ensuring that the OLS estimates are
efficient.
Simple linear regression model
Concluding remarks
A linear regression model i s useful to analyze the empirical
relationship between two or more variables. Here, we discussed the
key ass um p ti ons (for example, the Gaus-Markov assumptions) of a
simple linear regression model. These assumptions provide a strong
foundation for understanding the multiple linear regressi on model.
We dis cu ss ed the inferen ce part of the l i ne ar regression mo de l and
provided a practical example to understand the model.
Lastly, we discussed the key empirical issues that students are likely
to face in estimating a linear regression. Ultimately, this lecture
provides a robust exposure to the fundamenta l s of a linear regression.