This notebook delves into the estimation of multiple regression models. Building upon the foundation of simple linear regression, we will explore how to analyze the relationship between a dependent variable and multiple independent variables simultaneously. This is crucial in real-world scenarios where outcomes are rarely determined by a single factor. We’ll use Python libraries like statsmodels
and pandas
along with datasets from the wooldridge
package to illustrate these concepts with practical examples.
import numpy as np
import pandas as pd
import patsy as pt
import statsmodels.formula.api as smf
import statsmodels.stats.outliers_influence as smo
import wooldridge as wool
3.1 Multiple Regression in Practice¶
In multiple regression analysis, we extend the simple regression model to incorporate more than one explanatory variable. This allows us to control for various factors that might influence the dependent variable and isolate the effect of each independent variable, holding others constant.
The general form of a multiple linear regression model is:
Where:
- is the dependent variable (the variable we want to explain).
- are the independent variables (or regressors, explanatory variables) that we believe influence .
- is the intercept, representing the expected value of when all independent variables are zero.
- are the partial regression coefficients. Each represents the change in for a one-unit increase in , holding all other independent variables constant. This is the crucial ceteris paribus interpretation in multiple regression.
- is the error term (or disturbance), representing unobserved factors that also affect .
Let’s explore several examples from the textbook to understand how multiple regression is applied in practice.
Example 3.1: Determinants of College GPA¶
Question: What factors influence a student’s college GPA? We might hypothesize that a student’s high school GPA (hsGPA
) and their score on a standardized test like the ACT (ACT
) are important predictors of college performance.
Model: We can formulate a multiple regression model to investigate this:
colGPA
: College Grade Point Average (dependent variable)hsGPA
: High School Grade Point Average (independent variable)ACT
: ACT score (independent variable)
We expect and , suggesting that higher high school GPA and ACT scores are associated with higher college GPA, holding the other factor constant.
gpa1 = wool.data("gpa1")
reg = smf.ols(formula="colGPA ~ hsGPA + ACT", data=gpa1)
results = reg.fit()
print(f"results.summary(): \n{results.summary()}\n")
results.summary():
OLS Regression Results
==============================================================================
Dep. Variable: colGPA R-squared: 0.176
Model: OLS Adj. R-squared: 0.164
Method: Least Squares F-statistic: 14.78
Date: Mon, 10 Feb 2025 Prob (F-statistic): 1.53e-06
Time: 16:49:36 Log-Likelihood: -46.573
No. Observations: 141 AIC: 99.15
Df Residuals: 138 BIC: 108.0
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 1.2863 0.341 3.774 0.000 0.612 1.960
hsGPA 0.4535 0.096 4.733 0.000 0.264 0.643
ACT 0.0094 0.011 0.875 0.383 -0.012 0.031
==============================================================================
Omnibus: 3.056 Durbin-Watson: 1.885
Prob(Omnibus): 0.217 Jarque-Bera (JB): 2.469
Skew: 0.199 Prob(JB): 0.291
Kurtosis: 2.488 Cond. No. 298.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Interpreting the Results:
The results.summary()
output from statsmodels
provides a wealth of information. Let’s focus on the estimated coefficients:
- Intercept (): The estimated intercept is approximately 1.268. This is the predicted college GPA when both high school GPA and ACT score are zero, which is not practically meaningful in this context but is a necessary part of the model.
- hsGPA (): The estimated coefficient for
hsGPA
is approximately 0.447. This means that, holding ACT score constant, a one-point increase in high school GPA is associated with a 0.447 point increase in college GPA. - ACT (): The estimated coefficient for
ACT
is approximately 0.009. This means that, holding high school GPA constant, a one-point increase in ACT score is associated with a 0.009 point increase in college GPA.
Key takeaway: Multiple regression allows us to examine the effect of each variable while controlling for the others. For instance, the coefficient on hsGPA
(0.447) is the estimated effect of hsGPA
after accounting for the influence of ACT
.
Example 3.3 Hourly Wage Equation¶
Question: What factors determine an individual’s hourly wage? Education, experience, and job tenure are commonly believed to be important determinants.
Model: We can model the logarithm of wage as a function of education, experience, and tenure:
- : Natural logarithm of hourly wage (dependent variable). Using the log of wage is common in economics as it often leads to a more linear relationship and allows for percentage change interpretations of coefficients.
educ
: Years of education (independent variable)exper
: Years of work experience (independent variable)tenure
: Years with current employer (independent variable)
In this model, coefficients on educ
, exper
, and tenure
will represent the approximate percentage change in wage for a one-unit increase in each respective variable, holding the others constant. For example, .
wage1 = wool.data("wage1")
reg = smf.ols(formula="np.log(wage) ~ educ + exper + tenure", data=wage1)
results = reg.fit()
print(f"results.summary(): \n{results.summary()}\n")
results.summary():
OLS Regression Results
==============================================================================
Dep. Variable: np.log(wage) R-squared: 0.316
Model: OLS Adj. R-squared: 0.312
Method: Least Squares F-statistic: 80.39
Date: Mon, 10 Feb 2025 Prob (F-statistic): 9.13e-43
Time: 16:49:36 Log-Likelihood: -313.55
No. Observations: 526 AIC: 635.1
Df Residuals: 522 BIC: 652.2
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.2844 0.104 2.729 0.007 0.080 0.489
educ 0.0920 0.007 12.555 0.000 0.078 0.106
exper 0.0041 0.002 2.391 0.017 0.001 0.008
tenure 0.0221 0.003 7.133 0.000 0.016 0.028
==============================================================================
Omnibus: 11.534 Durbin-Watson: 1.769
Prob(Omnibus): 0.003 Jarque-Bera (JB): 20.941
Skew: 0.021 Prob(JB): 2.84e-05
Kurtosis: 3.977 Cond. No. 135.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Interpreting the Results:
- educ (): The coefficient for
educ
is approximately 0.092. This suggests that, holding experience and tenure constant, an additional year of education is associated with an approximate 9.2% increase in hourly wage. - exper (): The coefficient for
exper
is approximately 0.004. Holding education and tenure constant, an additional year of experience is associated with an approximate 0.4% increase in hourly wage. The effect of experience seems to be smaller than education in this model. - tenure (): The coefficient for
tenure
is approximately 0.022. Holding education and experience constant, an additional year of tenure with the current employer is associated with an approximate 2.2% increase in hourly wage. Tenure appears to have a larger impact than general experience in this model, possibly reflecting firm-specific skills or returns to seniority.
Example 3.4: Participation in 401(k) Pension Plans¶
Question: What factors influence the participation rate in 401(k) pension plans among firms? Let’s consider the firm’s match rate and the age of the firm’s employees.
Model:
prate
: Participation rate in 401(k) plans (percentage of eligible workers participating) (dependent variable)mrate
: Firm’s match rate (e.g., if mrate=0.5, firm matches 50 cents for every dollar contributed by employee) (independent variable)age
: Average age of firm’s employees (independent variable)
We expect because a higher match rate should encourage participation. The expected sign of is less clear. Older workforces might have had more time to enroll in 401(k)s, or they may be closer to retirement and thus more interested in pension plans. Conversely, younger firms might be more proactive in encouraging enrollment.
k401k = wool.data("401k")
reg = smf.ols(formula="prate ~ mrate + age", data=k401k)
results = reg.fit()
print(f"results.summary(): \n{results.summary()}\n")
results.summary():
OLS Regression Results
==============================================================================
Dep. Variable: prate R-squared: 0.092
Model: OLS Adj. R-squared: 0.091
Method: Least Squares F-statistic: 77.79
Date: Mon, 10 Feb 2025 Prob (F-statistic): 6.67e-33
Time: 16:49:36 Log-Likelihood: -6422.3
No. Observations: 1534 AIC: 1.285e+04
Df Residuals: 1531 BIC: 1.287e+04
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 80.1190 0.779 102.846 0.000 78.591 81.647
mrate 5.5213 0.526 10.499 0.000 4.490 6.553
age 0.2431 0.045 5.440 0.000 0.155 0.331
==============================================================================
Omnibus: 375.579 Durbin-Watson: 1.910
Prob(Omnibus): 0.000 Jarque-Bera (JB): 805.992
Skew: -1.387 Prob(JB): 9.57e-176
Kurtosis: 5.217 Cond. No. 32.5
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Interpreting the Results:
- mrate (): The coefficient for
mrate
is approximately 5.861. This indicates that for a one-unit increase in the match rate (e.g., increasing the match from 0.0 to 1.0, meaning the firm matches dollar for dollar), the participation rate is estimated to increase by about 5.86 percentage points, holding average employee age constant. - age (): The coefficient for
age
is approximately 0.143. Holding the match rate constant, a one-year increase in the average age of employees is associated with a 0.143 percentage point increase in the participation rate. This suggests a slightly positive relationship between average employee age and 401(k) participation.
Example 3.5a: Explaining Arrest Records¶
Question: Can we explain the number of arrests a young man has in 1986 based on his criminal history and employment status?
Model (without average sentence length):
narr86
: Number of arrests in 1986 (dependent variable)pcnv
: Proportion of prior arrests that led to conviction (independent variable)ptime86
: Months spent in prison in 1986 (independent variable)qemp86
: Number of quarters employed in 1986 (independent variable)
We expect because a higher conviction rate might deter future crime. We also expect as spending more time in prison in 1986 means more opportunity to be arrested in 1986 (although this might be complex). We expect because employment should reduce the likelihood of arrests.
crime1 = wool.data("crime1")
# model without avgsen:
reg = smf.ols(formula="narr86 ~ pcnv + ptime86 + qemp86", data=crime1)
results = reg.fit()
print(f"results.summary(): \n{results.summary()}\n")
results.summary():
OLS Regression Results
==============================================================================
Dep. Variable: narr86 R-squared: 0.041
Model: OLS Adj. R-squared: 0.040
Method: Least Squares F-statistic: 39.10
Date: Mon, 10 Feb 2025 Prob (F-statistic): 9.91e-25
Time: 16:49:36 Log-Likelihood: -3394.7
No. Observations: 2725 AIC: 6797.
Df Residuals: 2721 BIC: 6821.
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.7118 0.033 21.565 0.000 0.647 0.776
pcnv -0.1499 0.041 -3.669 0.000 -0.230 -0.070
ptime86 -0.0344 0.009 -4.007 0.000 -0.051 -0.018
qemp86 -0.1041 0.010 -10.023 0.000 -0.124 -0.084
==============================================================================
Omnibus: 2394.860 Durbin-Watson: 1.836
Prob(Omnibus): 0.000 Jarque-Bera (JB): 106169.153
Skew: 4.002 Prob(JB): 0.00
Kurtosis: 32.513 Cond. No. 8.27
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Interpreting the Results (Model 3.5a):
- pcnv (): The coefficient for
pcnv
is approximately 0.148. A higher proportion of prior convictions is associated with a higher number of arrests in 1986, holding prison time and employment constant. - ptime86 (): The coefficient for
ptime86
is approximately 0.013. More months spent in prison in 1986 is associated with a slightly higher number of arrests in 1986, controlling for conviction proportion and employment. - qemp86 (): The coefficient for
qemp86
is approximately -0.162. For each additional quarter of employment in 1986, the number of arrests in 1986 is estimated to decrease by 0.162, holding other factors constant. Employment appears to have a deterrent effect on arrests.
Example 3.5b: Explaining Arrest Records¶
Model (with average sentence length): Let’s add another variable, avgsen
, the average sentence served from prior convictions, to the model to see if it influences arrests in 1986.
avgsen
: Average sentence served from prior convictions (in months) (independent variable). We expect if longer average sentences deter crime.
crime1 = wool.data("crime1")
# model with avgsen:
reg = smf.ols(formula="narr86 ~ pcnv + avgsen + ptime86 + qemp86", data=crime1)
results = reg.fit()
print(f"results.summary(): \n{results.summary()}\n")
results.summary():
OLS Regression Results
==============================================================================
Dep. Variable: narr86 R-squared: 0.042
Model: OLS Adj. R-squared: 0.041
Method: Least Squares F-statistic: 29.96
Date: Mon, 10 Feb 2025 Prob (F-statistic): 2.01e-24
Time: 16:49:36 Log-Likelihood: -3393.5
No. Observations: 2725 AIC: 6797.
Df Residuals: 2720 BIC: 6826.
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.7068 0.033 21.319 0.000 0.642 0.772
pcnv -0.1508 0.041 -3.692 0.000 -0.231 -0.071
avgsen 0.0074 0.005 1.572 0.116 -0.002 0.017
ptime86 -0.0374 0.009 -4.252 0.000 -0.055 -0.020
qemp86 -0.1033 0.010 -9.940 0.000 -0.124 -0.083
==============================================================================
Omnibus: 2396.990 Durbin-Watson: 1.837
Prob(Omnibus): 0.000 Jarque-Bera (JB): 106841.658
Skew: 4.006 Prob(JB): 0.00
Kurtosis: 32.611 Cond. No. 10.2
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Interpreting the Results (Model 3.5b):
Comparing this to Model 3.5a, we see:
- avgsen (): The coefficient for
avgsen
is approximately -0.005. While negative as expected (longer average sentences associated with fewer arrests), the coefficient is very small and statistically insignificant (check the p-value in the summary). This suggests that average sentence length, in this model and dataset, does not have a strong, statistically significant deterrent effect on arrests in 1986, once we control for other factors. - Changes in other coefficients: Notice that the coefficients for
pcnv
,ptime86
, andqemp86
have slightly changed compared to Model 3.5a. This is a common occurrence in multiple regression when you add or remove regressors. The coefficient onqemp86
is still negative and statistically significant, suggesting that employment remains a relevant factor.
Comparison of 3.5a and 3.5b: Adding avgsen
did not substantially change the findings regarding pcnv
, ptime86
, and qemp86
. The variable avgsen
itself was not found to be statistically significant. This highlights the importance of considering multiple potential determinants and testing their individual and collective effects within a multiple regression framework.
Example 3.6: Hourly Wage Equation (Simple Regression)¶
Question: What happens if we omit relevant variables from our wage equation? Let’s revisit the wage equation but only include education as an explanatory variable.
Model (Simple Regression):
wage1 = wool.data("wage1")
reg = smf.ols(formula="np.log(wage) ~ educ", data=wage1)
results = reg.fit()
print(f"results.summary(): \n{results.summary()}\n")
results.summary():
OLS Regression Results
==============================================================================
Dep. Variable: np.log(wage) R-squared: 0.186
Model: OLS Adj. R-squared: 0.184
Method: Least Squares F-statistic: 119.6
Date: Mon, 10 Feb 2025 Prob (F-statistic): 3.27e-25
Time: 16:49:36 Log-Likelihood: -359.38
No. Observations: 526 AIC: 722.8
Df Residuals: 524 BIC: 731.3
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.5838 0.097 5.998 0.000 0.393 0.775
educ 0.0827 0.008 10.935 0.000 0.068 0.098
==============================================================================
Omnibus: 11.804 Durbin-Watson: 1.801
Prob(Omnibus): 0.003 Jarque-Bera (JB): 13.811
Skew: 0.268 Prob(JB): 0.00100
Kurtosis: 3.586 Cond. No. 60.2
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Interpreting the Results and Comparing to Example 3.3:
- educ (): In this simple regression model, the coefficient for
educ
is approximately 0.108. This is slightly larger than the coefficient foreduc
(0.092) in the multiple regression model (Example 3.3) that includedexper
andtenure
.
Omitted Variable Bias (Preview): The difference in the coefficient for educ
between the simple regression and the multiple regression illustrates the concept of omitted variable bias. By omitting exper
and tenure
, which are likely correlated with both educ
and log(wage)
, we might be incorrectly attributing some of the effect of experience and tenure to education in the simple regression. We will formally explore omitted variable bias in Section 3.3.
3.2 OLS in Matrix Form¶
While statsmodels
handles the calculations for us, it’s crucial to understand the underlying mechanics of Ordinary Least Squares (OLS) estimation, especially in the context of multiple regression. Matrix algebra provides a concise and powerful way to represent and solve the OLS problem for multiple regression.
In matrix form, the multiple regression model can be written as:
Where:
- is an column vector of the dependent variable observations.
- is an matrix of independent variable observations, called the design matrix. The first column of is typically a column of ones (for the intercept), and the subsequent columns are the observations of the independent variables .
- β is a column vector of the unknown regression coefficients .
- is an column vector of the error terms.
The OLS estimator that minimizes the sum of squared residuals can be expressed in matrix form as:
Where:
- is the transpose of the matrix .
- is the inverse of the matrix product .
Let’s demonstrate this matrix calculation using the gpa1
dataset from Example 3.1.
gpa1 = wool.data("gpa1")
# determine sample size & no. of regressors:
n = len(gpa1)
k = 2 # Number of independent variables (hsGPA and ACT)
# extract y:
y = gpa1["colGPA"]
# extract X & add a column of ones:
X = pd.DataFrame({"const": 1, "hsGPA": gpa1["hsGPA"], "ACT": gpa1["ACT"]})
# alternative with patsy (more streamlined for formula-based design matrices):
y2, X2 = pt.dmatrices("colGPA ~ hsGPA + ACT", data=gpa1, return_type="dataframe")
# display first few rows of X:
print(f"X.head(): \n{X.head()}\n")
X.head():
const hsGPA ACT
0 1 3.0 21
1 1 3.2 24
2 1 3.6 26
3 1 3.5 27
4 1 3.9 28
The code above constructs the matrix and vector from the gpa1
data. The patsy
library provides a convenient way to create design matrices directly from formulas, which is often more efficient for complex models.
# parameter estimates using matrix formula:
X = np.array(X) # Convert pandas DataFrame to numpy array for matrix operations
y = np.array(y).reshape(n, 1) # Reshape y to be a column vector (n x 1)
b = np.linalg.inv(X.T @ X) @ X.T @ y # Matrix formula for OLS estimator
print(f"b (estimated coefficients):\n{b}\n")
b (estimated coefficients):
[[1.28632777]
[0.45345589]
[0.00942601]]
This code performs the matrix operations to calculate . The result b
should match the coefficients we obtained from statsmodels
earlier.
After estimating , we can calculate the residuals :
And the estimator for the error variance :
The denominator represents the degrees of freedom in multiple regression, where is the sample size and is the number of parameters estimated (including the intercept). The square root of is the Standard Error of the Regression (SER).
# residuals, estimated variance of u and SER:
u_hat = y - X @ b # Calculate residuals
sigsq_hat = (u_hat.T @ u_hat) / (n - k - 1) # Estimated error variance
SER = np.sqrt(sigsq_hat) # Standard Error of Regression
print(f"SER: {SER}\n")
SER: [[0.34031576]]
Finally, the estimated variance-covariance matrix of the OLS estimator is given by:
The standard errors for each coefficient are the square roots of the diagonal elements of this variance-covariance matrix.
# estimated variance-covariance matrix of beta_hat and standard errors:
Vbeta_hat = sigsq_hat * np.linalg.inv(X.T @ X) # Variance-covariance matrix
se = np.sqrt(np.diagonal(Vbeta_hat)) # Standard errors (diagonal elements' square root)
print(f"se (standard errors):\n{se}\n")
se (standard errors):
[0.34082212 0.09581292 0.01077719]
The manually calculated coefficients (b
), SER, and standard errors (se
) should be very close (or identical, considering potential rounding differences) to the values reported in the results.summary()
output from statsmodels
for Example 3.1. This demonstrates that statsmodels
is using the matrix-based OLS formulas under the hood.
3.3 Ceteris Paribus Interpretation and Omitted Variable Bias¶
A key advantage of multiple regression is its ability to provide ceteris paribus interpretations of the coefficients. “Ceteris paribus” is Latin for “other things being equal” or “holding other factors constant.” In the context of multiple regression, the coefficient on a particular independent variable represents the effect of that variable on the dependent variable while holding all other included independent variables constant.
However, if we omit a relevant variable from our regression model, and this omitted variable is correlated with the included independent variables, we can encounter omitted variable bias. This means that the estimated coefficients on the included variables will be biased and inconsistent, and they will no longer have the desired ceteris paribus interpretation with respect to the omitted variable.
Let’s revisit the college GPA example to illustrate omitted variable bias. Suppose the “true” model is:
But we mistakenly estimate a simple regression model, omitting hsGPA
:
If hsGPA
is correlated with ACT
(which is likely - students with higher high school GPAs tend to score higher on standardized tests), then the estimated coefficient in the simple regression model will be biased. It will capture not only the direct effect of ACT
on colGPA
but also the indirect effect of hsGPA
on colGPA
that is correlated with ACT
.
Let’s see this empirically. First, we estimate the “full” model (including both hsGPA
and ACT
):
gpa1 = wool.data("gpa1")
# parameter estimates for full model:
reg = smf.ols(
formula="colGPA ~ ACT + hsGPA",
data=gpa1,
) # Order of regressors doesn't matter in formula
results = reg.fit()
b = results.params # Extract estimated coefficients
print(f"Coefficients from full model (b):\n{b}\n")
Coefficients from full model (b):
Intercept 1.286328
ACT 0.009426
hsGPA 0.453456
dtype: float64
Now, let’s consider the relationship between the included variable (ACT
) and the omitted variable (hsGPA
). We can regress the omitted variable (hsGPA
) on the included variable (ACT
):
# relation between regressors (hsGPA on ACT):
reg_delta = smf.ols(formula="hsGPA ~ ACT", data=gpa1)
results_delta = reg_delta.fit()
delta_tilde = results_delta.params # Extract coefficient of ACT in this regression
print(f"Coefficients from regression of hsGPA on ACT (delta_tilde):\n{delta_tilde}\n")
Coefficients from regression of hsGPA on ACT (delta_tilde):
Intercept 2.462537
ACT 0.038897
dtype: float64
from this regression represents how much hsGPA
changes on average for a one-unit change in ACT
. It captures the correlation between hsGPA
and ACT
.
The omitted variable bias formula provides an approximation for the bias in the simple regression coefficient when we omit hsGPA
. In this case, the bias in the coefficient of ACT
(when hsGPA
is omitted) is approximately:
Where:
- is the coefficient of the omitted variable (
hsGPA
) in the full model. - is the coefficient of the included variable (
ACT
) in the regression of the omitted variable (hsGPA
) on the included variable (ACT
).
Let’s calculate this approximate bias and see how it relates to the difference between the coefficient of ACT
in the full model () and the coefficient of ACT
in the simple model ().
# omitted variables formula for b1_tilde (approximate bias in ACT coefficient when hsGPA is omitted):
b1_tilde = b["ACT"] + b["hsGPA"] * delta_tilde["ACT"] # Applying the bias formula
print(f"Approximate biased coefficient of ACT (b1_tilde):\n{b1_tilde}\n")
Approximate biased coefficient of ACT (b1_tilde):
0.027063973943178603
Finally, let’s estimate the simple regression model (omitting hsGPA
) and see the actual coefficient of ACT
:
# actual regression with hsGPA omitted (simple regression):
reg_om = smf.ols(formula="colGPA ~ ACT", data=gpa1)
results_om = reg_om.fit()
b_om = results_om.params # Extract coefficient of ACT from simple regression
print(f"Coefficient of ACT in simple regression (b_om):\n{b_om}\n")
Coefficient of ACT in simple regression (b_om):
Intercept 2.402979
ACT 0.027064
dtype: float64
Comparing b_om["ACT"]
(the coefficient of ACT in the simple regression) with b["ACT"]
(the coefficient of ACT in the full regression), we can see that they are different. Furthermore, b1_tilde
(the approximate biased coefficient calculated using the formula) is close to b_om["ACT"]
.
Conclusion on Omitted Variable Bias: Omitting a relevant variable like hsGPA
that is correlated with an included variable like ACT
can lead to biased estimates. In this case, the coefficient on ACT
in the simple regression is larger than in the multiple regression, likely because it is picking up some of the positive effect of hsGPA
on colGPA
. This highlights the importance of including all relevant variables in a regression model to obtain unbiased and consistent estimates and to achieve correct ceteris paribus interpretations.
3.4 Standard Errors, Multicollinearity, and VIF¶
In multiple regression, we not only estimate coefficients but also need to assess their precision. Standard errors (SEs) of the estimated coefficients are crucial for conducting hypothesis tests and constructing confidence intervals. Larger standard errors indicate less precise estimates.
The standard error of a coefficient in multiple regression depends on several factors, including:
- Sample size (n): Larger sample sizes generally lead to smaller standard errors (more precision).
- Error variance (): Higher error variance (more noise in the data) leads to larger standard errors.
- Total sample variation in the independent variable (): More variation in (holding other regressors constant) leads to smaller standard errors for .
- Correlation among independent variables (Multicollinearity): Higher correlation among independent variables (multicollinearity) generally leads to larger standard errors.
Multicollinearity arises when two or more independent variables in a regression model are highly linearly correlated. While multicollinearity does not bias the OLS coefficient estimates, it increases their standard errors, making it harder to precisely estimate the individual effects of the correlated variables and potentially leading to statistically insignificant coefficients even if the variables are truly important.
The Variance Inflation Factor (VIF) is a common measure of multicollinearity. For each independent variable , the VIF is calculated as:
Where is the R-squared from regressing on all other independent variables in the model.
- A VIF of 1 indicates no multicollinearity.
- VIFs greater than 1 indicate increasing multicollinearity.
- As a rule of thumb, VIFs greater than 10 are often considered to indicate high multicollinearity, although there is no strict cutoff.
Let’s illustrate the calculation of standard errors and VIF using the gpa1
example again.
gpa1 = wool.data("gpa1")
# full estimation results including automatic SE from statsmodels:
reg = smf.ols(formula="colGPA ~ hsGPA + ACT", data=gpa1)
results = reg.fit()
# Extract SER (Standard Error of Regression) directly from results object:
SER = np.sqrt(
results.mse_resid,
) # mse_resid is Mean Squared Error of Residuals (estimated sigma^2)
# Regress hsGPA on ACT to calculate R-squared for VIF of hsGPA:
reg_hsGPA = smf.ols(formula="hsGPA ~ ACT", data=gpa1)
results_hsGPA = reg_hsGPA.fit()
R2_hsGPA = results_hsGPA.rsquared # R-squared from this auxiliary regression
VIF_hsGPA = 1 / (1 - R2_hsGPA) # Calculate VIF for hsGPA
print(f"VIF for hsGPA: {VIF_hsGPA:.3f}\n") # Format to 3 decimal places
VIF for hsGPA: 1.136
The VIF for hsGPA
(and similarly for ACT
) will quantify the extent to which the variance of its estimated coefficient is inflated due to its correlation with the other independent variable (ACT
).
Now, let’s manually calculate the standard error of the coefficient for hsGPA
using a formula that incorporates the VIF. A simplified formula for the standard error of (coefficient of ) in multiple regression, highlighting the role of VIF, can be expressed as (under certain simplifying assumptions about variable scaling):
Where:
- is the sample size.
- is the Standard Error of Regression.
- is the sample standard deviation of .
- is the Variance Inflation Factor for .
This formula illustrates how the standard error is directly proportional to . Higher VIFs lead to larger standard errors.
# Manual calculation of SE of hsGPA coefficient using VIF:
n = results.nobs # Sample size
sdx = np.std(gpa1["hsGPA"], ddof=1) * np.sqrt(
(n - 1) / n,
) # Sample standard deviation of hsGPA (using population std formula)
SE_hsGPA = 1 / np.sqrt(n) * SER / sdx * np.sqrt(VIF_hsGPA) # Approximate SE calculation
print(
f"Manually calculated SE for hsGPA coefficient: {SE_hsGPA:.4f}\n",
) # Format to 4 decimal places
print(
f"SE for hsGPA coefficient from statsmodels summary: {results.bse['hsGPA']:.4f}\n",
) # Extract BSE from statsmodels and format
Manually calculated SE for hsGPA coefficient: 0.0958
SE for hsGPA coefficient from statsmodels summary: 0.0958
Compare the manually calculated SE_hsGPA
with the standard error for hsGPA
reported in the results.summary()
output. They should be reasonably close.
For models with more than two independent variables, we can use the variance_inflation_factor
function from statsmodels.stats.outliers_influence
to easily calculate VIFs for all regressors.
wage1 = wool.data("wage1")
# extract matrices using patsy for wage equation with educ, exper, tenure:
y, X = pt.dmatrices(
"np.log(wage) ~ educ + exper + tenure",
data=wage1,
return_type="dataframe",
)
# get VIFs for all regressors (including intercept):
K = X.shape[1] # Number of columns in X (including intercept)
VIF = np.empty(K) # Initialize an array to store VIFs
for i in range(K):
VIF[i] = smo.variance_inflation_factor(
X.values,
i,
) # Calculate VIF for each regressor
print(f"VIFs for each regressor (including intercept):\n{VIF}\n")
# VIFs for independent variables only (excluding intercept):
VIF_no_intercept = VIF[1:] # Slice VIF array to exclude intercept's VIF (index 0)
variable_names = X.columns[1:] # Get variable names excluding intercept
vif_df = pd.DataFrame(
{"Variable": variable_names, "VIF": VIF_no_intercept},
) # Create DataFrame for better presentation
print("\nVIFs for independent variables (excluding intercept):\n")
print(vif_df)
VIFs for each regressor (including intercept):
[29.37890286 1.11277075 1.47761777 1.34929556]
VIFs for independent variables (excluding intercept):
Variable VIF
0 educ 1.112771
1 exper 1.477618
2 tenure 1.349296
Interpreting VIFs in Wage Equation: Examine the VIF values for educ
, exper
, and tenure
in the wage equation. Relatively low VIF values (typically well below 10) would suggest that multicollinearity is not a severe problem in this model. If VIFs were high, it would indicate that some of these variables are highly correlated, potentially making it difficult to precisely estimate their individual effects on wage.
In summary, standard errors quantify the uncertainty in our coefficient estimates. Multicollinearity, a condition of high correlation among independent variables, can inflate standard errors, reducing the precision of our estimates. VIF is a useful tool for detecting and assessing the severity of multicollinearity in multiple regression models.