Skip to article frontmatterSkip to article content

3. Multiple Regression Analysis: Estimation

Welcome to this comprehensive exploration of Multiple Regression Analysis. While simple linear regression provides valuable insights into the relationship between two variables, real-world phenomena are rarely determined by a single factor. Multiple regression allows us to analyze how several independent variables simultaneously influence a dependent variable, providing more realistic and nuanced insights.

This chapter builds upon the foundation of simple linear regression to introduce the powerful framework of multiple regression. We’ll explore the mechanics of Ordinary Least Squares (OLS) estimation in the multiple regression context, understand the crucial concept of ceteris paribus interpretation, and examine the assumptions that underpin the validity of our estimates. Through practical examples using real-world datasets from the Wooldridge package, we’ll demonstrate how multiple regression is applied across economics and behavioral sciences.

Let’s begin by setting up our environment with the necessary libraries.

import numpy as np
import pandas as pd
import patsy as pt
import statsmodels.formula.api as smf
import statsmodels.stats.outliers_influence as smo
import wooldridge as wool
from IPython.display import display

3.1 Multiple Regression in Practice

In multiple regression analysis, we extend the simple regression model to incorporate more than one explanatory variable. This allows us to control for various factors that might influence the dependent variable and isolate the effect of each independent variable, holding others constant.

The general form of a multiple linear regression model is:

y=β0+β1x1+β2x2+β3x3++βkxk+uy = \beta_0 + \beta_1 x_1 + \beta_2 x_2 +\beta_3 x_3 + \cdots + \beta_k x_k + u

Where:

  • yy is the dependent variable (the variable we want to explain).

  • x1,x2,,xkx_1, x_2, \ldots, x_k are the independent variables (or regressors, explanatory variables) that we believe influence yy.

  • β0\beta_0 is the intercept, representing the expected value of yy when all independent variables are zero.

  • β1,β2,,βk\beta_1, \beta_2, \ldots, \beta_k are the partial regression coefficients (or slope coefficients). Each βj\beta_j represents the change in yy for a one-unit increase in xjx_j, holding all other independent variables constant. This is the crucial ceteris paribus (other things equal) interpretation in multiple regression.

  • uu is the error term (or disturbance), representing unobserved factors that also affect yy. As in simple regression (Chapter 2), we assume E(ux1,,xk)=0E(u|x_1, \ldots, x_k) = 0, which implies E(u)=0E(u) = 0.

Let’s explore several examples from the textbook to understand how multiple regression is applied in practice.

Example 3.1: Determinants of College GPA

Question: What factors influence a student’s college GPA? We might hypothesize that a student’s high school GPA (hsGPA) and their score on a standardized test like the ACT (ACT) are important predictors of college performance.

Model: We can formulate a multiple regression model to investigate this:

colGPA=β0+β1hsGPA+β2ACT+u\text{colGPA} = \beta_0 + \beta_1 \text{hsGPA} + \beta_2 \text{ACT} + u
  • colGPA: College Grade Point Average (dependent variable)

  • hsGPA: High School Grade Point Average (independent variable)

  • ACT: ACT score (independent variable)

We expect β1>0\beta_1 > 0 and β2>0\beta_2 > 0, suggesting that higher high school GPA and ACT scores are associated with higher college GPA, holding the other factor constant.

gpa1 = wool.data("gpa1")

reg = smf.ols(formula="colGPA ~ hsGPA + ACT", data=gpa1)
results = reg.fit()
results.summary()
Loading...

Example 3.3 Hourly Wage Equation

Question: What factors determine an individual’s hourly wage? Education, experience, and job tenure are commonly believed to be important determinants.

Model: We can model the logarithm of wage as a function of education, experience, and tenure:

log(wage)=β0+β1educ+β2exper+β3tenure+u\log(\text{wage}) = \beta_0 + \beta_1 \text{educ} + \beta_2 \text{exper} + \beta_3 \text{tenure} + u
  • log(wage)\log(\text{wage}): Natural logarithm of hourly wage (dependent variable). Using the log of wage is common in economics as it often leads to a more linear relationship and allows for percentage change interpretations of coefficients.

  • educ: Years of education (independent variable)

  • exper: Years of work experience (independent variable)

  • tenure: Years with current employer (independent variable)

In this model, coefficients on educ, exper, and tenure will represent the approximate percentage change in wage for a one-unit increase in each respective variable, holding the others constant. For example, β1%Δwage/Δeduc\beta_1 \approx \% \Delta \text{wage} / \Delta \text{educ}.

wage1 = wool.data("wage1")

reg = smf.ols(formula="np.log(wage) ~ educ + exper + tenure", data=wage1)
results = reg.fit()
results.summary()
Loading...

Example 3.4: Participation in 401(k) Pension Plans

Question: What factors influence the participation rate in 401(k) pension plans among firms? Let’s consider the firm’s match rate and the age of the firm’s employees.

Model:

prate=β0+β1mrate+β2age+u\text{prate} = \beta_0 + \beta_1 \text{mrate} + \beta_2 \text{age} + u
  • prate: Participation rate in 401(k) plans (percentage of eligible workers participating) (dependent variable)

  • mrate: Firm’s match rate (e.g., if mrate=0.5, firm matches 50 cents for every dollar contributed by employee) (independent variable)

  • age: Average age of firm’s employees (independent variable)

We expect β1>0\beta_1 > 0 because a higher match rate should encourage participation. The expected sign of β2\beta_2 is less clear. Older workforces might have had more time to enroll in 401(k)s, or they may be closer to retirement and thus more interested in pension plans. Conversely, younger firms might be more proactive in encouraging enrollment.

k401k = wool.data("401k")

reg = smf.ols(formula="prate ~ mrate + age", data=k401k)
results = reg.fit()
results.summary()
Loading...

Example 3.5a: Explaining Arrest Records

Question: Can we explain the number of arrests a young man has in 1986 based on his criminal history and employment status?

Model (without average sentence length):

narr86=β0+β1pcnv+β2ptime86+β3qemp86+u\text{narr86} = \beta_0 + \beta_1 \text{pcnv} + \beta_2 \text{ptime86} + \beta_3 \text{qemp86} + u
  • narr86: Number of arrests in 1986 (dependent variable)

  • pcnv: Proportion of prior arrests that led to conviction (independent variable)

  • ptime86: Months spent in prison in 1986 (independent variable)

  • qemp86: Number of quarters employed in 1986 (independent variable)

We expect β1>0\beta_1 > 0 because a higher conviction rate might deter future crime. We also expect β2>0\beta_2 > 0 as spending more time in prison in 1986 means more opportunity to be arrested in 1986 (although this might be complex). We expect β3<0\beta_3 < 0 because employment should reduce the likelihood of arrests.

crime1 = wool.data("crime1")

# model without avgsen:
reg = smf.ols(formula="narr86 ~ pcnv + ptime86 + qemp86", data=crime1)
results = reg.fit()
results.summary()
Loading...

Example 3.5b: Explaining Arrest Records

Model (with average sentence length): Let’s add another variable, avgsen, the average sentence served from prior convictions, to the model to see if it influences arrests in 1986.

narr86=β0+β1pcnv+β2avgsen+β3ptime86+β4qemp86+u\text{narr86} = \beta_0 + \beta_1 \text{pcnv} + \beta_2 \text{avgsen} + \beta_3 \text{ptime86} + \beta_4 \text{qemp86} + u
  • avgsen: Average sentence served from prior convictions (in months) (independent variable). We expect β2<0\beta_2 < 0 if longer average sentences deter crime.

crime1 = wool.data("crime1")

# model with avgsen:
reg = smf.ols(formula="narr86 ~ pcnv + avgsen + ptime86 + qemp86", data=crime1)
results = reg.fit()
results.summary()
Loading...

Example 3.6: Hourly Wage Equation (Simple Regression)

Question: What happens if we omit relevant variables from our wage equation? Let’s revisit the wage equation but only include education as an explanatory variable.

Model (Simple Regression):

log(wage)=β0+β1educ+u\log(\text{wage}) = \beta_0 + \beta_1 \text{educ} + u
wage1 = wool.data("wage1")

reg = smf.ols(formula="np.log(wage) ~ educ", data=wage1)
results = reg.fit()
results.summary()
Loading...

3.2 OLS in Matrix Form

While statsmodels handles the calculations for us, it’s crucial to understand the underlying mechanics of Ordinary Least Squares (OLS) estimation, especially in the context of multiple regression. Matrix algebra provides a concise and powerful way to represent and solve the OLS problem for multiple regression.

In matrix form, the multiple regression model can be written as:

y=Xβ+uy = X\beta + u

Where:

  • yy is an (n×1)(n \times 1) column vector of the dependent variable observations.

  • XX is an (n×(k+1))(n \times (k+1)) matrix of independent variable observations, called the design matrix. The first column of XX is typically a column of ones (for the intercept), and the subsequent columns are the observations of the independent variables x1,x2,,xkx_1, x_2, \ldots, x_k.

  • β\beta is a ((k+1)×1)((k+1) \times 1) column vector of the unknown regression coefficients (β0,β1,,βk)(\beta_0, \beta_1, \ldots, \beta_k)'.

  • uu is an (n×1)(n \times 1) column vector of the error terms.

The OLS estimator β^\hat{\beta} that minimizes the sum of squared residuals can be expressed in matrix form as:

β^=(XX)1Xy\hat{\beta} = (X'X)^{-1}X'y

Where:

  • XX' is the transpose of the matrix XX.

  • (XX)1(X'X)^{-1} is the inverse of the matrix product XXX'X.

Let’s demonstrate this matrix calculation using the gpa1 dataset from Example 3.1.

gpa1 = wool.data("gpa1")

# determine sample size & no. of regressors:
n = len(gpa1)
k = 2  # Number of independent variables (hsGPA and ACT)

# Extract dependent variable (college GPA)
y = gpa1["colGPA"]
# Show shape info
shape_info = pd.DataFrame(
    {
        "Variable": ["y (colGPA)"],
        "Shape": [str(y.shape)],
    },
)
shape_info

# Method 1: Manual construction of design matrix X
# Design matrix X = [1, hsGPA, ACT] for each observation
X = pd.DataFrame(
    {
        "const": 1,  # Intercept column (β₀)
        "hsGPA": gpa1["hsGPA"],  # High school GPA (β₁)
        "ACT": gpa1["ACT"],  # ACT test score (β₂)
    },
)

# Method 2: Using patsy for automatic formula-based design matrix
# More convenient for complex models with interactions/polynomials
y2, X2 = pt.dmatrices(
    "colGPA ~ hsGPA + ACT",  # R-style formula
    data=gpa1,
    return_type="dataframe",
)

# Display design matrix structure
matrix_info = pd.DataFrame(
    {
        "Description": ["Design matrix dimensions", "Interpretation"],
        "Value": [f"{X.shape}", "(n observations × k+1 variables)"],
    },
)
display(matrix_info)
X.head()
Loading...

The code above constructs the XX matrix and yy vector from the gpa1 data. The patsy library provides a convenient way to create design matrices directly from formulas, which is often more efficient for complex models.

# Calculate OLS estimates using the matrix formula: β̂ = (X'X)⁻¹X'y

# Step 1: Convert to numpy arrays for matrix operations
X_array = np.array(X)  # Design matrix as numpy array (n × k+1)
y_array = np.array(y).reshape(n, 1)  # Dependent variable as column vector (n × 1)

# Step 2: Calculate intermediate matrices for clarity
XtX = X_array.T @ X_array  # X'X matrix (k+1 × k+1)
Xty = X_array.T @ y_array  # X'y vector (k+1 × 1)

# Display matrix operation results
matrix_ops = pd.DataFrame(
    {
        "Operation": ["X'X matrix", "X'X symmetry check", "X'y vector"],
        "Shape": [str(XtX.shape), "-", str(Xty.shape)],
        "Result": ["(k+1 × k+1)", str(np.allclose(XtX, XtX.T)), "(k+1 × 1)"],
    },
)
matrix_ops

# Step 3: Apply OLS formula
XtX_inverse = np.linalg.inv(XtX)  # (X'X)⁻¹
beta_estimates = XtX_inverse @ Xty  # β̂ = (X'X)⁻¹X'y

# Display results
coef_results = pd.DataFrame(
    {
        "Variable": ["Intercept", "hsGPA", "ACT"],
        "Coefficient (β̂)": [beta_estimates[i, 0] for i in range(3)],
    },
)
coef_results
Loading...

This code performs the matrix operations to calculate β^\hat{\beta}. The result b should match the coefficients we obtained from statsmodels earlier.

After estimating β^\hat{\beta}, we can calculate the residuals u^\hat{u}:

u^=yXβ^\hat{u} = y - X\hat{\beta}

And the estimator for the error variance σ2\sigma^2:

σ^2=1nk1u^u^=SSRnk1\hat{\sigma}^2 = \frac{1}{n-k-1} \hat{u}'\hat{u} = \frac{\text{SSR}}{n-k-1}

The denominator (nk1)(n-k-1) represents the degrees of freedom in multiple regression, where nn is the sample size and (k+1)(k+1) is the number of parameters estimated (including the intercept). The square root of σ^2\hat{\sigma}^2 is the Standard Error of the Regression (SER).

Key Insight: The division by nk1n-k-1 (not nn) corrects for the degrees of freedom lost in estimating k+1k+1 parameters, making σ^2\hat{\sigma}^2 an unbiased estimator of σ2\sigma^2 under MLR.1-MLR.5.

# residuals, estimated variance of u and SER:
u_hat = (
    y.values.reshape(-1, 1) - X.values @ beta_estimates
)  # Calculate residuals as numpy array
sigsq_hat = float((u_hat.T @ u_hat) / (n - k - 1))  # Estimated error variance (scalar)
SER = np.sqrt(sigsq_hat)  # Standard Error of Regression
SER  # Display SER
/var/folders/rg/vmn_nq41613gkxt0_9spwzx80000gp/T/ipykernel_4655/2706714141.py:3: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  sigsq_hat = float((u_hat.T @ u_hat) / (n - k - 1))  # Estimated error variance (scalar)
np.float64(0.3403157569643908)

Finally, the estimated variance-covariance matrix of the OLS estimator β^\hat{\beta} is given by:

var(β^)^=σ^2(XX)1\widehat{\text{var}(\hat{\beta})} = \hat{\sigma}^2 (X'X)^{-1}

The standard errors for each coefficient are the square roots of the diagonal elements of this variance-covariance matrix.

# estimated variance-covariance matrix of beta_hat and standard errors:
Vbeta_hat = sigsq_hat * np.linalg.inv(
    X.values.T @ X.values,
)  # Variance-covariance matrix
se = np.sqrt(np.diagonal(Vbeta_hat))  # Standard errors (diagonal elements' square root)
se  # Display standard errors
array([0.34082212, 0.09581292, 0.01077719])

The manually calculated coefficients (b), SER, and standard errors (se) should be very close (or identical, considering potential rounding differences) to the values reported in the results.summary() output from statsmodels for Example 3.1. This demonstrates that statsmodels is using the matrix-based OLS formulas under the hood.

3.4 Ceteris Paribus Interpretation and Omitted Variable Bias

A key advantage of multiple regression is its ability to provide ceteris paribus interpretations of the coefficients. “Ceteris paribus” is Latin for “other things being equal” or “holding other factors constant.” In the context of multiple regression, the coefficient on a particular independent variable represents the effect of that variable on the dependent variable while holding all other included independent variables constant.

However, if we omit a relevant variable from our regression model, and this omitted variable is correlated with the included independent variables, we can encounter omitted variable bias (OVB). This means that the estimated coefficients on the included variables will be biased and inconsistent, and they will no longer have the desired ceteris paribus interpretation with respect to the omitted variable.

Why Omitted Variable Bias Violates MLR.4: When we omit a relevant variable xkx_k, it becomes part of the error term: u=u+βkxku' = u + \beta_k x_k. If the omitted xkx_k is correlated with any included xjx_j, then E(ux1,,xk1)0E(u'|x_1, \ldots, x_{k-1}) \neq 0, violating the zero conditional mean assumption. This makes OLS biased and inconsistent.

Let’s revisit the college GPA example to illustrate omitted variable bias. Suppose the “true” model is:

colGPA=β0+β1hsGPA+β2ACT+u\text{colGPA} = \beta_0 + \beta_1 \text{hsGPA} + \beta_2 \text{ACT} + u

But we mistakenly estimate a simple regression model, omitting hsGPA:

colGPA=γ0+γ1ACT+v\text{colGPA} = \gamma_0 + \gamma_1 \text{ACT} + v

If hsGPA is correlated with ACT (which is likely - students with higher high school GPAs tend to score higher on standardized tests), then the estimated coefficient γ^1\hat{\gamma}_1 in the simple regression model will be biased. It will capture not only the direct effect of ACT on colGPA but also the indirect effect of hsGPA on colGPA that is correlated with ACT.

Let’s see this empirically. First, we estimate the “full” model (including both hsGPA and ACT):

gpa1 = wool.data("gpa1")

# parameter estimates for full model:
reg = smf.ols(
    formula="colGPA ~ ACT + hsGPA",
    data=gpa1,
)  # Order of regressors doesn't matter in formula
results = reg.fit()
b = results.params  # Extract estimated coefficients
b  # Display coefficients from full model
Intercept 1.286328 ACT 0.009426 hsGPA 0.453456 dtype: float64

Now, let’s consider the relationship between the included variable (ACT) and the omitted variable (hsGPA). We can regress the omitted variable (hsGPA) on the included variable (ACT):

# relation between regressors (hsGPA on ACT):
reg_delta = smf.ols(formula="hsGPA ~ ACT", data=gpa1)
results_delta = reg_delta.fit()
delta_tilde = results_delta.params  # Extract coefficient of ACT in this regression
delta_tilde  # Display coefficients from regression of hsGPA on ACT
Intercept 2.462537 ACT 0.038897 dtype: float64

δACT\delta_{\text{ACT}} from this regression represents how much hsGPA changes on average for a one-unit change in ACT. It captures the correlation between hsGPA and ACT.

The omitted variable bias formula provides an approximation for the bias in the simple regression coefficient γ^1\hat{\gamma}_1 when we omit hsGPA. In this case, the bias in the coefficient of ACT (when hsGPA is omitted) is approximately:

Bias(γ^1)β1×δACT\text{Bias}(\hat{\gamma}_1) \approx \beta_1 \times \delta_{\text{ACT}}

Where:

  • β1\beta_1 is the coefficient of the omitted variable (hsGPA) in the full model.

  • δACT\delta_{\text{ACT}} is the coefficient of the included variable (ACT) in the regression of the omitted variable (hsGPA) on the included variable (ACT).

Let’s calculate this approximate bias and see how it relates to the difference between the coefficient of ACT in the full model (β2\beta_2) and the coefficient of ACT in the simple model (γ1\gamma_1).

# omitted variables formula for b1_tilde (approximate bias in ACT coefficient when hsGPA is omitted):
b1_tilde = b["ACT"] + b["hsGPA"] * delta_tilde["ACT"]  # Applying the bias formula
b1_tilde  # Display approximate biased coefficient of ACT
np.float64(0.027063973943178377)

Finally, let’s estimate the simple regression model (omitting hsGPA) and see the actual coefficient of ACT:

# actual regression with hsGPA omitted (simple regression):
reg_om = smf.ols(formula="colGPA ~ ACT", data=gpa1)
results_om = reg_om.fit()
b_om = results_om.params  # Extract coefficient of ACT from simple regression
b_om  # Display coefficient of ACT in simple regression
Intercept 2.402979 ACT 0.027064 dtype: float64

Comparing b_om["ACT"] (the coefficient of ACT in the simple regression) with b["ACT"] (the coefficient of ACT in the full regression), we can see that they are different. Furthermore, b1_tilde (the approximate biased coefficient calculated using the formula) is close to b_om["ACT"].

Conclusion on Omitted Variable Bias: Omitting a relevant variable like hsGPA that is correlated with an included variable like ACT can lead to biased estimates. In this case, the coefficient on ACT in the simple regression is larger than in the multiple regression, likely because it is picking up some of the positive effect of hsGPA on colGPA. This highlights the importance of including all relevant variables in a regression model to obtain unbiased and consistent estimates and to achieve correct ceteris paribus interpretations.

3.3 The Gauss-Markov Assumptions

To establish the statistical properties of the OLS estimators in multiple regression, we need to specify the assumptions under which these properties hold. The Gauss-Markov assumptions provide the foundation for understanding when OLS produces the Best Linear Unbiased Estimators (BLUE). Let’s examine each assumption:

The Five Gauss-Markov Assumptions (MLR.1 - MLR.5)

These assumptions extend the Simple Linear Regression (SLR) assumptions from Chapter 2 to the multiple regression context.

MLR.1: Linear in Parameters (extends SLR.1) The population model is linear in the parameters:

y=β0+β1x1+β2x2++βkxk+uy = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k + u

This assumption requires that the dependent variable is a linear function of the parameters βj\beta_j, though the relationship can be nonlinear in the variables themselves (e.g., x2=x12x_2 = x_1^2, x3=log(x1)x_3 = \log(x_1)). The key is linearity in parameters, not necessarily in variables.

MLR.2: Random Sampling (extends SLR.2) We have a random sample of size nn: {(xi1,xi2,,xik,yi):i=1,2,,n}\{(x_{i1}, x_{i2}, \ldots, x_{ik}, y_i): i = 1, 2, \ldots, n\} from the population model.

This ensures our sample is representative of the population we want to study and that observations are independent across ii.

MLR.3: No Perfect Collinearity (extends SLR.3) In the sample (and therefore in the population), none of the independent variables is constant, and there are no exact linear relationships among the independent variables.

More precisely: (i) each xjx_j has sample variation (Var^(xj)>0\widehat{\text{Var}}(x_j) > 0), and (ii) no xjx_j can be written as an exact linear combination of the other independent variables. This assumption ensures that we can obtain unique OLS estimates. Perfect multicollinearity would make it impossible to isolate the individual effects of the explanatory variables. Note that high (but not perfect) correlation among regressors is allowed, though it increases standard errors.

MLR.4: Zero Conditional Mean (extends SLR.4) The error term uu has an expected value of zero given any values of the independent variables:

E(ux1,x2,,xk)=0E(u|x_1, x_2, \ldots, x_k) = 0

This is the most crucial assumption for unbiasedness of OLS estimators. It implies that all factors in uu are, on average, unrelated to x1,,xkx_1, \ldots, x_k. This assumption implies E(u)=0E(u) = 0 and Cov(xj,u)=0\text{Cov}(x_j, u) = 0 for all j=1,,kj = 1, \ldots, k. If any xjx_j is correlated with uu, OLS estimators will be biased and inconsistent. Violations typically arise from omitted variables, measurement error in regressors, or simultaneity.

MLR.5: Homoscedasticity (extends SLR.5) The error term uu has the same variance given any values of the independent variables:

Var(ux1,x2,,xk)=σ2\text{Var}(u|x_1, x_2, \ldots, x_k) = \sigma^2

This assumption ensures that the variance of the error term is constant across all combinations of explanatory variables. This assumption is required for efficiency (BLUE property) and for standard OLS standard errors to be valid. Under MLR.1-MLR.4 alone (without homoscedasticity), OLS estimators remain unbiased and consistent, but they are not BLUE, and standard errors must be corrected (e.g., using robust/heteroscedasticity-consistent standard errors, as discussed in Chapter 8).

Properties Under the Gauss-Markov Assumptions

Theorem 3.1: Unbiasedness of OLS (MLR.1-MLR.4) Under assumptions MLR.1 through MLR.4 (linearity, random sampling, no perfect collinearity, and zero conditional mean), the OLS estimators are unbiased:

E(β^j)=βj for j=0,1,2,,kE(\hat{\beta}_j) = \beta_j \text{ for } j = 0, 1, 2, \ldots, k

This means that on average, across repeated random samples from the same population, the OLS estimates equal the true population parameters. Crucially, homoscedasticity (MLR.5) is not required for unbiasedness—only the first four assumptions are needed.

Theorem 3.2: Variance of OLS Estimators (MLR.1-MLR.5) Under assumptions MLR.1 through MLR.5 (including homoscedasticity), the variance-covariance matrix of the OLS estimators β^=(β^0,β^1,,β^k)\hat{\boldsymbol{\beta}} = (\hat{\beta}_0, \hat{\beta}_1, \ldots, \hat{\beta}_k)' conditional on the sample values of X\mathbf{X} is:

Var(β^X)=σ2(XX)1\text{Var}(\hat{\boldsymbol{\beta}}|\mathbf{X}) = \sigma^2 (\mathbf{X}'\mathbf{X})^{-1}

where σ2=Var(ux1,,xk)\sigma^2 = \text{Var}(u|x_1, \ldots, x_k) is the constant conditional variance of the error term. The diagonal elements of this matrix are the variances of individual OLS estimators, and the off-diagonal elements are the covariances. Standard errors are the square roots of the diagonal elements. This result allows us to conduct valid statistical inference (hypothesis tests, confidence intervals) when homoscedasticity holds.

The Gauss-Markov Theorem

Theorem 3.3: Gauss-Markov Theorem (MLR.1-MLR.5) Under the Gauss-Markov assumptions MLR.1 through MLR.5, the OLS estimators β^0,β^1,,β^k\hat{\beta}_0, \hat{\beta}_1, \ldots, \hat{\beta}_k are the Best Linear Unbiased Estimators (BLUE) of β0,β1,,βk\beta_0, \beta_1, \ldots, \beta_k.

“Best” means that among all linear unbiased estimators (estimators that are linear functions of yy), OLS has the smallest variance for each coefficient. More precisely, for any other linear unbiased estimator β~j\tilde{\beta}_j of βj\beta_j, we have Var(β^j)Var(β~j)\text{Var}(\hat{\beta}_j) \leq \text{Var}(\tilde{\beta}_j). This theorem provides the fundamental justification for using OLS in linear regression analysis under the classical assumptions.

Intuition Behind the Gauss-Markov Theorem: The OLS estimator achieves minimum variance among linear unbiased estimators because:

  1. Orthogonality principle: OLS residuals are orthogonal to all regressors, ensuring no systematic patterns remain

  2. Efficient use of information: OLS optimally weights observations based on the variation in XX

  3. Homoscedasticity crucial: Equal error variances allow equal weighting; with heteroscedasticity, weighted least squares (WLS) would be more efficient

Important Notes:

  • BLUE property requires all five assumptions, including homoscedasticity (MLR.5)

  • Unbiasedness requires only MLR.1-MLR.4

  • Consistency requires even weaker conditions than unbiasedness (see Chapter 5 on asymptotics)

3.5 Standard Errors, Multicollinearity, and VIF

In multiple regression, we not only estimate coefficients but also need to assess their precision. Standard errors (SEs) of the estimated coefficients are crucial for conducting hypothesis tests and constructing confidence intervals. Larger standard errors indicate less precise estimates.

The standard error of a coefficient in multiple regression depends on several factors, including:

  • Sample size (n): Larger sample sizes generally lead to smaller standard errors (more precision).

  • Error variance (σ2\sigma^2): Higher error variance (more noise in the data) leads to larger standard errors.

  • Total sample variation in the independent variable (xjx_j): More variation in xjx_j (holding other regressors constant) leads to smaller standard errors for βj\beta_j.

  • Correlation among independent variables (Multicollinearity): Higher correlation among independent variables (multicollinearity) generally leads to larger standard errors.

Multicollinearity arises when two or more independent variables in a regression model are highly linearly correlated. While multicollinearity does not bias the OLS coefficient estimates, it increases their standard errors, making it harder to precisely estimate the individual effects of the correlated variables and potentially leading to statistically insignificant coefficients even if the variables are truly important.

The Variance Inflation Factor (VIF) is a common measure of multicollinearity. For each independent variable xjx_j, the VIF is calculated as:

VIFj=11Rj2VIF_j = \frac{1}{1 - R_j^2}

Where Rj2R_j^2 is the R-squared from regressing xjx_j on all other independent variables in the model.

  • A VIF of 1 indicates no multicollinearity.

  • VIFs greater than 1 indicate increasing multicollinearity.

  • As a rule of thumb, VIFs greater than 10 are often considered to indicate high multicollinearity, although there is no strict cutoff.

Let’s illustrate the calculation of standard errors and VIF using the gpa1 example again.

gpa1 = wool.data("gpa1")

# full estimation results including automatic SE from statsmodels:
reg = smf.ols(formula="colGPA ~ hsGPA + ACT", data=gpa1)
results = reg.fit()

# Extract SER (Standard Error of Regression) directly from results object:
SER = np.sqrt(
    results.mse_resid,
)  # mse_resid is Mean Squared Error of Residuals (estimated sigma^2)

# Calculate VIF for hsGPA to assess multicollinearity
# VIF measures how much the variance of β̂_hsGPA is inflated due to correlation with ACT

# Step 1: Auxiliary regression - regress hsGPA on other predictors (just ACT here)
auxiliary_regression = smf.ols(formula="hsGPA ~ ACT", data=gpa1)
auxiliary_results = auxiliary_regression.fit()
R2_auxiliary = auxiliary_results.rsquared  # R² from auxiliary regression

# Step 2: Calculate VIF using formula: VIF = 1 / (1 - R²)
VIF_hsGPA = 1 / (1 - R2_auxiliary)

# VIF Calculation for hsGPA
vif_results = pd.DataFrame(
    {
        "Metric": [
            "R² from auxiliary regression",
            "VIF calculation",
            "VIF for hsGPA",
            "Interpretation",
        ],
        "Value": [
            f"{R2_auxiliary:.4f}",
            f"1/(1-{R2_auxiliary:.4f})",
            f"{VIF_hsGPA:.2f}",
            f"Variance inflated by {VIF_hsGPA:.2f}x",
        ],
    },
)
vif_results
Loading...

The VIF for hsGPA (and similarly for ACT) will quantify the extent to which the variance of its estimated coefficient is inflated due to its correlation with the other independent variable (ACT).

Now, let’s manually calculate the standard error of the coefficient for hsGPA using a formula that incorporates the VIF. A simplified formula for the standard error of β^j\hat{\beta}_j (coefficient of xjx_j) in multiple regression, highlighting the role of VIF, can be expressed as (under certain simplifying assumptions about variable scaling):

SE(β^j)1nSERsd(xj)VIFjSE(\hat{\beta}_j) \approx \frac{1}{\sqrt{n}} \cdot \frac{SER}{\text{sd}(x_j)} \cdot \sqrt{VIF_j}

Where:

  • nn is the sample size.

  • SERSER is the Standard Error of Regression.

  • sd(xj)\text{sd}(x_j) is the sample standard deviation of xjx_j.

  • VIFjVIF_j is the Variance Inflation Factor for xjx_j.

This formula illustrates how the standard error is directly proportional to VIFj\sqrt{VIF_j}. Higher VIFs lead to larger standard errors.

# Manual calculation of SE of hsGPA coefficient using VIF:
n = results.nobs  # Sample size
sdx = np.std(gpa1["hsGPA"], ddof=1) * np.sqrt(
    (n - 1) / n,
)  # Sample standard deviation of hsGPA (using population std formula)
SE_hsGPA = 1 / np.sqrt(n) * SER / sdx * np.sqrt(VIF_hsGPA)  # Approximate SE calculation

# Display comparison of manual vs statsmodels standard errors
se_comparison = pd.DataFrame(
    {
        "Method": ["Manual Calculation", "Statsmodels"],
        "SE for hsGPA": [SE_hsGPA, results.bse["hsGPA"]],
    },
)
se_comparison
Loading...

Compare the manually calculated SE_hsGPA with the standard error for hsGPA reported in the results.summary() output. They should be reasonably close.

For models with more than two independent variables, we can use the variance_inflation_factor function from statsmodels.stats.outliers_influence to easily calculate VIFs for all regressors.

wage1 = wool.data("wage1")

# extract matrices using patsy for wage equation with educ, exper, tenure:
y, X = pt.dmatrices(
    "np.log(wage) ~ educ + exper + tenure",
    data=wage1,
    return_type="dataframe",
)

# get VIFs for all regressors (including intercept):
K = X.shape[1]  # Number of columns in X (including intercept)
VIF = np.empty(K)  # Initialize an array to store VIFs
for i in range(K):
    VIF[i] = smo.variance_inflation_factor(
        X.values,
        i,
    )  # Calculate VIF for each regressor

# VIFs for independent variables only (excluding intercept):
VIF_no_intercept = VIF[1:]  # Slice VIF array to exclude intercept's VIF (index 0)
variable_names = X.columns[1:]  # Get variable names excluding intercept
vif_df = pd.DataFrame(
    {"Variable": variable_names, "VIF": VIF_no_intercept},
)  # Create DataFrame for better presentation

vif_df  # Display VIFs for independent variables
Loading...

In summary, standard errors quantify the uncertainty in our coefficient estimates. Multicollinearity, a condition of high correlation among independent variables, can inflate standard errors, reducing the precision of our estimates. VIF is a useful tool for detecting and assessing the severity of multicollinearity in multiple regression models.

Chapter Summary

This chapter has provided a comprehensive introduction to multiple regression analysis, moving beyond the simple regression model to examine relationships between a dependent variable and multiple independent variables simultaneously. Through practical examples and theoretical foundations, we have covered the essential concepts and techniques that form the backbone of econometric analysis.

Key Concepts Mastered

Multiple Regression Framework: We learned why multiple regression is necessary for real-world analysis, where outcomes are typically influenced by several factors. The general multiple regression model allows us to examine ceteris paribus effects - the impact of each variable while holding others constant.

OLS Estimation and Interpretation: We explored how Ordinary Least Squares extends to multiple regression, both conceptually and through matrix algebra. Each coefficient in multiple regression represents the partial effect of its corresponding variable, providing more nuanced and reliable insights than simple regression.

The Gauss-Markov Assumptions: We examined the five key assumptions (MLR.1-MLR.5) that ensure OLS estimators are Best Linear Unbiased Estimators (BLUE). Understanding when these assumptions hold - and when they are violated - is crucial for proper model specification and interpretation.

Omitted Variable Bias: Through examples like the wage equations, we demonstrated how excluding relevant variables can lead to biased estimates, emphasizing the importance of careful model specification.

Statistical Properties: We covered the calculation and interpretation of standard errors, the role of multicollinearity, and tools like the Variance Inflation Factor (VIF) for assessing the precision of our estimates.

Practical Applications

Throughout this chapter, we applied multiple regression to diverse economic and social phenomena:

  • Educational outcomes (college GPA determinants)

  • Labor economics (wage determination)

  • Public policy (401k participation)

  • Criminology (factors affecting arrests)

  • Political economy (campaign spending and voting)

These examples illustrate the versatility of multiple regression across different fields and research questions.

Looking Forward

The concepts and techniques mastered in this chapter form the foundation for more advanced topics in econometrics. Understanding multiple regression estimation, interpretation, and assumptions prepares you for hypothesis testing, model building, and addressing more complex econometric challenges.

The ability to properly specify, estimate, and interpret multiple regression models is essential for empirical research in economics, business, and the social sciences. The tools and insights developed here will serve as building blocks for the more sophisticated econometric techniques explored in subsequent chapters.