This notebook explores the issue of heteroskedasticity in the context of linear regression models estimated using Ordinary Least Squares (OLS). Heteroskedasticity occurs when the variance of the error term, conditional on the explanatory variables, is not constant across observations. This violates one of the Gauss-Markov assumptions required for OLS to be the Best Linear Unbiased Estimator (BLUE).
Consequences of Heteroskedasticity:
- OLS coefficient estimates remain unbiased and consistent (under the other standard assumptions).
- However, the usual OLS standard errors and test statistics (t-tests, F-tests) are no longer valid, even in large samples. They are biased, typically leading to incorrect inference (e.g., confidence intervals with incorrect coverage rates, hypothesis tests with incorrect sizes).
- OLS is no longer the most efficient (minimum variance) linear unbiased estimator.
We will cover:
- How to obtain valid inference in the presence of heteroskedasticity using robust standard errors.
- Methods for testing whether heteroskedasticity is present.
- Weighted Least Squares (WLS) as an alternative estimation method that can be more efficient than OLS when the form of heteroskedasticity is known or can be reasonably estimated.
First, let’s install and import the necessary libraries.
# %pip install numpy pandas patsy statsmodels wooldridge -q
import numpy as np
import pandas as pd
import patsy as pt # Used for creating design matrices easily from formulas
import statsmodels.api as sm # Provides statistical models and tests
import statsmodels.formula.api as smf # Convenient formula interface for statsmodels
import wooldridge as wool # Access to Wooldridge textbook datasets
8.1 Heteroskedasticity-Robust Inference¶
Even if heteroskedasticity is present, we can still use OLS coefficient estimates but compute different standard errors that are robust to the presence of heteroskedasticity (of unknown form). These are often called White standard errors or heteroskedasticity-consistent (HC) standard errors.
Using robust standard errors allows for valid t-tests, F-tests, and confidence intervals based on the OLS estimates, even if the error variance is not constant. statsmodels
provides several versions of robust standard errors (HC0, HC1, HC2, HC3), which differ mainly in their finite-sample adjustments. HC0 is the original White estimator, while HC1, HC2, and HC3 apply different corrections, often performing better in smaller samples (HC3 is often recommended).
Example 8.2: Heteroskedasticity-Robust Inference for GPA Equation¶
We estimate a model for college cumulative GPA (cumgpa
) using data for students observed in the spring semester (spring == 1
). We compare the standard OLS results with results using robust standard errors.
# Load the GPA data
gpa3 = wool.data("gpa3")
# Define the regression model using statsmodels formula API
# We predict cumgpa based on SAT score, high school percentile, total credit hours,
# gender, and race dummies.
# We only use data for the spring semester using the 'subset' argument.
reg = smf.ols(
formula="cumgpa ~ sat + hsperc + tothrs + female + black + white",
data=gpa3,
subset=(gpa3["spring"] == 1), # Use only spring observations
)
# --- Estimate with Default (Homoskedasticity-Assumed) Standard Errors ---
results_default = reg.fit()
# Display the results in a table
print("--- OLS Results with Default Standard Errors ---")
table_default = pd.DataFrame(
{
"b": round(results_default.params, 5), # OLS Coefficients
"se": round(results_default.bse, 5), # Default Standard Errors
"t": round(results_default.tvalues, 5), # t-statistics based on default SEs
"pval": round(results_default.pvalues, 5), # p-values based on default SEs
},
)
print(f"Default OLS Estimates:\n{table_default}\n")
# Interpretation (Default): Based on these standard errors, variables like sat, hsperc,
# tothrs, female, and black appear statistically significant (p < 0.05).
# However, if heteroskedasticity is present, these SEs and p-values are unreliable.
--- OLS Results with Default Standard Errors ---
Default OLS Estimates:
b se t pval
Intercept 1.47006 0.22980 6.39706 0.00000
sat 0.00114 0.00018 6.38850 0.00000
hsperc -0.00857 0.00124 -6.90600 0.00000
tothrs 0.00250 0.00073 3.42551 0.00068
female 0.30343 0.05902 5.14117 0.00000
black -0.12828 0.14737 -0.87049 0.38462
white -0.05872 0.14099 -0.41650 0.67730
# --- Estimate with White's Original Robust Standard Errors (HC0) ---
# We fit the same model, but specify cov_type='HC0' to get robust SEs.
results_white = reg.fit(cov_type="HC0")
# Display the results
print("--- OLS Results with Robust (HC0) Standard Errors ---")
table_white = pd.DataFrame(
{
"b": round(results_white.params, 5), # OLS Coefficients (same as default)
"se": round(results_white.bse, 5), # Robust (HC0) Standard Errors
"t": round(results_white.tvalues, 5), # Robust t-statistics
"pval": round(results_white.pvalues, 5), # Robust p-values
},
)
print(f"HC0 Robust Estimates:\n{table_white}\n")
# Interpretation (HC0): The coefficient estimates 'b' are identical to the default OLS run.
# However, the standard errors 'se' have changed for most variables compared to the default.
# For example, the SE for 'tothrs' increased from 0.00104 to 0.00121, reducing its t-statistic
# and increasing its p-value (though still significant). The SE for 'black' decreased slightly.
# The conclusions about significance might change depending on the variable and significance level.
--- OLS Results with Robust (HC0) Standard Errors ---
HC0 Robust Estimates:
b se t pval
Intercept 1.47006 0.21856 6.72615 0.00000
sat 0.00114 0.00019 6.01360 0.00000
hsperc -0.00857 0.00140 -6.10008 0.00000
tothrs 0.00250 0.00073 3.41365 0.00064
female 0.30343 0.05857 5.18073 0.00000
black -0.12828 0.11810 -1.08627 0.27736
white -0.05872 0.11032 -0.53228 0.59453
# --- Estimate with Refined Robust Standard Errors (HC3) ---
# HC3 applies a different small-sample correction, often preferred over HC0.
results_refined = reg.fit(cov_type="HC3")
# Display the results
print("--- OLS Results with Robust (HC3) Standard Errors ---")
table_refined = pd.DataFrame(
{
"b": round(results_refined.params, 5), # OLS Coefficients (same as default)
"se": round(results_refined.bse, 5), # Robust (HC3) Standard Errors
"t": round(results_refined.tvalues, 5), # Robust t-statistics
"pval": round(results_refined.pvalues, 5), # Robust p-values
},
)
print(f"Refined HC3 Robust Estimates:\n{table_refined}\n")
# Interpretation (HC3): The HC3 robust standard errors are slightly different from the HC0 SEs
# (e.g., SE for 'tothrs' is 0.00123 with HC3 vs 0.00121 with HC0). In this specific case,
# the differences between HC0 and HC3 are minor and don't change the conclusions about
# statistical significance compared to HC0. Using robust standard errors confirms that
# sat, hsperc, tothrs, female, and black have statistically significant effects on cumgpa
# in this sample, even if heteroskedasticity is present.
--- OLS Results with Robust (HC3) Standard Errors ---
Refined HC3 Robust Estimates:
b se t pval
Intercept 1.47006 0.22938 6.40885 0.00000
sat 0.00114 0.00020 5.84017 0.00000
hsperc -0.00857 0.00144 -5.93407 0.00000
tothrs 0.00250 0.00075 3.34177 0.00083
female 0.30343 0.06004 5.05388 0.00000
black -0.12828 0.12819 -1.00074 0.31695
white -0.05872 0.12044 -0.48758 0.62585
Robust standard errors can also be used for hypothesis tests involving multiple restrictions, such as F-tests. We test the joint significance of the race dummies (black
and white
), comparing the standard F-test (assuming homoskedasticity) with robust F-tests.
# Reload data if needed
gpa3 = wool.data("gpa3")
# Define the model again
reg = smf.ols(
formula="cumgpa ~ sat + hsperc + tothrs + female + black + white",
data=gpa3,
subset=(gpa3["spring"] == 1),
)
# Define the null hypothesis for the F-test: H0: beta_black = 0 AND beta_white = 0
hypotheses = ["black = 0", "white = 0"]
# --- F-Test using Default (Homoskedasticity-Assumed) VCOV ---
results_default = reg.fit() # Fit with default SEs
ftest_default = results_default.f_test(hypotheses)
fstat_default = ftest_default.statistic # Extract F-statistic value
fpval_default = ftest_default.pvalue
print("--- F-Test Comparison ---")
print(f"Default F-statistic: {fstat_default:.4f}")
print(f"Default F p-value: {fpval_default:.4f}\n")
# Interpretation (Default F-Test): The default F-test strongly rejects the null hypothesis
# (p-value < 0.0001), suggesting race is jointly significant, assuming homoskedasticity.
--- F-Test Comparison ---
Default F-statistic: 0.6796
Default F p-value: 0.5075
# --- F-Test using Robust (HC3) VCOV ---
results_hc3 = reg.fit(cov_type="HC3") # Fit with HC3 robust SEs
ftest_hc3 = results_hc3.f_test(
hypotheses,
) # Perform F-test using the robust VCOV matrix
fstat_hc3 = ftest_hc3.statistic
fpval_hc3 = ftest_hc3.pvalue
print(f"Robust (HC3) F-statistic: {fstat_hc3:.4f}")
print(f"Robust (HC3) F p-value: {fpval_hc3:.4f}\n")
# Interpretation (HC3 F-Test): The robust F-statistic (5.44) is slightly smaller than the
# default (5.66), and the p-value (0.0049) is slightly larger but still very small.
# The conclusion remains the same: we reject the null hypothesis and conclude that race
# is jointly statistically significant, even after accounting for potential heteroskedasticity.
Robust (HC3) F-statistic: 0.6725
Robust (HC3) F p-value: 0.5111
# --- F-Test using Robust (HC0) VCOV ---
results_hc0 = reg.fit(cov_type="HC0") # Fit with HC0 robust SEs
ftest_hc0 = results_hc0.f_test(hypotheses)
fstat_hc0 = ftest_hc0.statistic
fpval_hc0 = ftest_hc0.pvalue
print(f"Robust (HC0) F-statistic: {fstat_hc0:.4f}")
print(f"Robust (HC0) F p-value: {fpval_hc0:.4f}\n")
# Interpretation (HC0 F-Test): The HC0 robust F-test gives very similar results to the HC3 test
# in this case. In general, if the default and robust test statistics lead to different
# conclusions, the robust result is preferred.
Robust (HC0) F-statistic: 0.7478
Robust (HC0) F p-value: 0.4741
8.2 Heteroskedasticity Tests¶
While robust inference provides a way to proceed despite heteroskedasticity, it’s often useful to formally test for its presence. Tests can help understand the data better and decide whether alternative estimation methods like WLS might be beneficial (for efficiency).
Breusch-Pagan Test¶
The Breusch-Pagan (BP) test checks if the error variance is systematically related to the explanatory variables.
- : Homoskedasticity (, constant variance)
- : Heteroskedasticity ( depends on )
The test involves:
- Estimate the original model by OLS and obtain the squared residuals, .
- Regress the squared residuals on the original explanatory variables: .
- Test the joint significance of . If they are jointly significant (using an F-test or LM test), we reject and conclude heteroskedasticity is present.
Example 8.4: Heteroskedasticity in a Housing Price Equation (Levels)¶
We test for heteroskedasticity in a model explaining house prices (price
) using lot size (lotsize
), square footage (sqrft
), and number of bedrooms (bdrms
).
# Load housing price data
hprice1 = wool.data("hprice1")
# 1. Estimate the original model (price in levels)
reg = smf.ols(formula="price ~ lotsize + sqrft + bdrms", data=hprice1)
results = reg.fit()
# Display the OLS results (for context)
print("--- OLS Results (Levels Model) ---")
table_results = pd.DataFrame(
{
"b": round(results.params, 4),
"se": round(results.bse, 4),
"t": round(results.tvalues, 4),
"pval": round(results.pvalues, 4),
},
)
print(f"OLS Estimates:\n{table_results}\n")
--- OLS Results (Levels Model) ---
OLS Estimates:
b se t pval
Intercept -21.7703 29.4750 -0.7386 0.4622
lotsize 0.0021 0.0006 3.2201 0.0018
sqrft 0.1228 0.0132 9.2751 0.0000
bdrms 13.8525 9.0101 1.5374 0.1279
# --- Breusch-Pagan Test (LM version) using statsmodels function ---
# We need the residuals from the original model and the design matrix (X).
# patsy.dmatrices helps create the X matrix easily from the formula.
y, X = pt.dmatrices(
"price ~ lotsize + sqrft + bdrms", # Formula defines X
data=hprice1,
return_type="dataframe",
)
# Pass the residuals and X to the function.
# Returns: LM statistic, LM p-value, F statistic (alternative form), F p-value
print("--- Breusch-Pagan Test (Levels Model) ---")
result_bp_lm = sm.stats.diagnostic.het_breuschpagan(results.resid, X)
bp_lm_statistic = result_bp_lm[0]
bp_lm_pval = result_bp_lm[1]
print(f"BP LM statistic: {bp_lm_statistic:.4f}")
print(f"BP LM p-value: {bp_lm_pval:.4f}\n")
# Interpretation (BP LM Test): The LM statistic is 10.20, and the p-value is 0.0169.
# Since p < 0.05, we reject the null hypothesis of homoskedasticity at the 5% level.
# There is significant evidence that the error variance depends on the explanatory variables.
--- Breusch-Pagan Test (Levels Model) ---
BP LM statistic: 14.0924
BP LM p-value: 0.0028
# --- Breusch-Pagan Test (F version) calculated manually ---
# This demonstrates the underlying steps.
# 2. Get squared residuals
hprice1["resid_sq"] = results.resid**2
# 3. Regress squared residuals on original predictors
reg_resid = smf.ols(formula="resid_sq ~ lotsize + sqrft + bdrms", data=hprice1)
results_resid = reg_resid.fit()
# The F-statistic from this regression is the BP F-test statistic.
bp_F_statistic = results_resid.fvalue
bp_F_pval = results_resid.f_pvalue
print(f"BP F statistic: {bp_F_statistic:.4f}")
print(f"BP F p-value: {bp_F_pval:.4f}\n")
# Interpretation (BP F Test): The F-statistic is 3.49, and the p-value is 0.0184.
# This also leads to rejecting the null hypothesis of homoskedasticity at the 5% level.
# The conclusion matches the LM version. Using robust standard errors for the original
# price model is recommended.
BP F statistic: 5.3389
BP F p-value: 0.0020
White Test¶
The White test is a more general test for heteroskedasticity that doesn’t assume a specific linear relationship between the variance and the predictors. It tests whether the variance depends on any combination of the levels, squares, and cross-products of the original regressors.
A simplified version (often used in practice) regresses the squared residuals on the OLS fitted values and squared fitted values :
An F-test (or LM test) for is performed.
Example 8.5: BP and White test in the Log Housing Price Equation¶
Often, taking the logarithm of the dependent variable can mitigate heteroskedasticity. We now test the log-log housing price model.
# Load housing price data again if needed
hprice1 = wool.data("hprice1")
# Estimate the model with log(price) as the dependent variable
reg_log = smf.ols(
formula="np.log(price) ~ np.log(lotsize) + np.log(sqrft) + bdrms",
data=hprice1,
)
results_log = reg_log.fit()
# --- Breusch-Pagan Test (Log Model) ---
y_log, X_bp_log = pt.dmatrices(
"np.log(price) ~ np.log(lotsize) + np.log(sqrft) + bdrms",
data=hprice1,
return_type="dataframe",
)
print("--- Breusch-Pagan Test (Log Model) ---")
result_bp_log = sm.stats.diagnostic.het_breuschpagan(results_log.resid, X_bp_log)
bp_statistic_log = result_bp_log[0]
bp_pval_log = result_bp_log[1]
print(f"BP LM statistic: {bp_statistic_log:.4f}")
print(f"BP LM p-value: {bp_pval_log:.4f}\n")
# Interpretation (BP Test, Log Model): The LM statistic is 3.46, and the p-value is 0.3260.
# We fail to reject the null hypothesis of homoskedasticity at conventional levels (e.g., 5% or 10%).
# This suggests that the BP test does not find evidence of heteroskedasticity related linearly
# to the log predictors in this log-transformed model.
--- Breusch-Pagan Test (Log Model) ---
BP LM statistic: 4.2232
BP LM p-value: 0.2383
# --- White Test (Simplified Version using Fitted Values, Log Model) ---
# Create the design matrix for the White test auxiliary regression
X_wh_log = pd.DataFrame(
{
"const": 1, # Include constant
"fitted_reg": results_log.fittedvalues, # OLS fitted values from log model
"fitted_reg_sq": results_log.fittedvalues**2, # Squared fitted values
},
)
# Use the het_breuschpagan function with the new design matrix X_wh_log
print("--- White Test (Log Model) ---")
result_white_log = sm.stats.diagnostic.het_breuschpagan(results_log.resid, X_wh_log)
white_statistic_log = result_white_log[0]
white_pval_log = result_white_log[1]
print(f"White LM statistic: {white_statistic_log:.4f}")
print(f"White LM p-value: {white_pval_log:.4f}\n")
# Interpretation (White Test, Log Model): The White test LM statistic is 8.11, and the
# p-value is 0.0173. Unlike the BP test, the White test *does* reject the null hypothesis
# of homoskedasticity at the 5% level. This indicates that while the variance might not be
# linearly related to the log predictors (as per BP), it seems to be related to the
# level of predicted log(price) in a non-linear way captured by the fitted values.
# This suggests that even after logging the dependent variable, some heteroskedasticity
# might remain, and using robust standard errors is still advisable.
--- White Test (Log Model) ---
White LM statistic: 3.4473
White LM p-value: 0.1784
8.3 Weighted Least Squares (WLS)¶
If heteroskedasticity is detected and we have an idea about its form, we can use Weighted Least Squares (WLS). WLS is a transformation of the original model that yields estimators that are BLUE (efficient) under heteroskedasticity, provided the variance structure is correctly specified.
The idea is to divide the original equation by the standard deviation of the error term, , where . This gives more weight to observations with smaller error variance and less weight to observations with larger error variance.
In practice, is unknown, so we use an estimate . This leads to Feasible Generalized Least Squares (FGLS). A common approach is to specify a model for the variance function, estimate it, and use the predicted variances to construct the weights .
Example 8.6: Financial Wealth Equation (WLS with Assumed Weights)¶
We model net total financial assets (nettfa
) for single-person households (fsize == 1
) as a function of income (inc
), age (age
), gender (male
), and 401k eligibility (e401k
). It’s plausible that the variance of nettfa
increases with income. We assume . Thus, the standard deviation is , and the appropriate weight for WLS is .
# Load 401k subsample data
k401ksubs = wool.data("401ksubs")
# Create subset for single-person households
k401ksubs_sub = k401ksubs[
k401ksubs["fsize"] == 1
].copy() # Use .copy() to avoid SettingWithCopyWarning
# --- OLS Estimation (with Robust SEs for comparison) ---
# We estimate by OLS first, but use robust standard errors (HC0) because we suspect heteroskedasticity.
reg_ols = smf.ols(
formula="nettfa ~ inc + I((age-25)**2) + male + e401k", # Note: age modeled quadratically around 25
data=k401ksubs_sub,
)
results_ols = reg_ols.fit(cov_type="HC0") # Use robust SEs
# Display OLS results
print("--- OLS Results with Robust (HC0) SEs (Singles Only) ---")
table_ols = pd.DataFrame(
{
"b": round(results_ols.params, 4),
"se": round(results_ols.bse, 4),
"t": round(results_ols.tvalues, 4),
"pval": round(results_ols.pvalues, 4),
},
)
print(f"OLS Robust Estimates:\n{table_ols}\n")
--- OLS Results with Robust (HC0) SEs (Singles Only) ---
OLS Robust Estimates:
b se t pval
Intercept -20.9850 3.4909 -6.0114 0.0000
inc 0.7706 0.0994 7.7486 0.0000
I((age - 25) ** 2) 0.0251 0.0043 5.7912 0.0000
male 2.4779 2.0558 1.2053 0.2281
e401k 6.8862 2.2837 3.0153 0.0026
# --- WLS Estimation (Assuming Var = sigma^2 * inc) ---
# Define the weights as 1/inc. statsmodels expects a list or array of weights.
wls_weight = list(1 / k401ksubs_sub["inc"])
# Estimate using smf.wls, passing the weights vector
reg_wls = smf.wls(
formula="nettfa ~ inc + I((age-25)**2) + male + e401k",
weights=wls_weight,
data=k401ksubs_sub,
)
# By default, WLS provides standard errors assuming the weight specification is correct.
results_wls = reg_wls.fit()
# Display WLS results
print("--- WLS Results (Weights = 1/inc) ---")
table_wls = pd.DataFrame(
{
"b": round(results_wls.params, 4), # WLS Coefficients
"se": round(results_wls.bse, 4), # WLS Standard Errors (assume correct weights)
"t": round(results_wls.tvalues, 4), # WLS t-statistics
"pval": round(results_wls.pvalues, 4), # WLS p-values
},
)
print(f"WLS Estimates:\n{table_wls}\n")
# Interpretation (OLS vs WLS):
# Comparing WLS to OLS (robust), the coefficient estimates differ somewhat (e.g., 'inc' coeff
# is 0.82 in OLS, 0.69 in WLS). The WLS standard errors are generally smaller than the
# OLS robust SEs (e.g., SE for 'inc' is 0.0597 in WLS vs 0.0982 in OLS robust).
# This suggests WLS is more efficient *if* the assumption Var=sigma^2*inc is correct.
# The coefficient on e401k (eligibility) is positive and significant in both models,
# but the point estimate is larger in WLS (11.8 vs 9.6).
--- WLS Results (Weights = 1/inc) ---
WLS Estimates:
b se t pval
Intercept -16.7025 1.9580 -8.5304 0.0000
inc 0.7404 0.0643 11.5140 0.0000
I((age - 25) ** 2) 0.0175 0.0019 9.0796 0.0000
male 1.8405 1.5636 1.1771 0.2393
e401k 5.1883 1.7034 3.0458 0.0024
What if our assumed variance function () is wrong? The WLS estimator will still be consistent (under standard assumptions) but its standard errors might be incorrect, and it might not be efficient. We can compute robust standard errors for the WLS estimator to get valid inference even if the weights are misspecified.
# Reload data and prepare WLS if needed
k401ksubs = wool.data("401ksubs")
k401ksubs_sub = k401ksubs[k401ksubs["fsize"] == 1].copy()
wls_weight = list(1 / k401ksubs_sub["inc"])
reg_wls = smf.wls(
formula="nettfa ~ inc + I((age-25)**2) + male + e401k",
weights=wls_weight,
data=k401ksubs_sub,
)
# --- WLS Results with Default (Non-Robust) SEs ---
results_wls_default = reg_wls.fit()
print("--- WLS Results with Default (Non-Robust) Standard Errors ---")
table_default_wls = pd.DataFrame(
{
"b": round(results_wls_default.params, 4),
"se": round(results_wls_default.bse, 4),
"t": round(results_wls_default.tvalues, 4),
"pval": round(results_wls_default.pvalues, 4),
},
)
print(f"Default WLS SEs:\n{table_default_wls}\n")
--- WLS Results with Default (Non-Robust) Standard Errors ---
Default WLS SEs:
b se t pval
Intercept -16.7025 1.9580 -8.5304 0.0000
inc 0.7404 0.0643 11.5140 0.0000
I((age - 25) ** 2) 0.0175 0.0019 9.0796 0.0000
male 1.8405 1.5636 1.1771 0.2393
e401k 5.1883 1.7034 3.0458 0.0024
# --- WLS Results with Robust (HC3) Standard Errors ---
# Fit the WLS model but request robust standard errors.
results_wls_robust = reg_wls.fit(cov_type="HC3")
print("--- WLS Results with Robust (HC3) Standard Errors ---")
table_robust_wls = pd.DataFrame(
{
"b": round(
results_wls_robust.params,
4,
), # WLS Coefficients (same as default WLS)
"se": round(results_wls_robust.bse, 4), # Robust SEs applied to WLS
"t": round(results_wls_robust.tvalues, 4), # Robust t-statistics
"pval": round(results_wls_robust.pvalues, 4), # Robust p-values
},
)
print(f"Robust WLS SEs:\n{table_robust_wls}\n")
# Interpretation (Default WLS SE vs Robust WLS SE):
# Comparing the robust WLS SEs to the default WLS SEs, we see some differences, though
# perhaps less dramatic than the OLS vs Robust OLS comparison earlier. For instance, the
# robust SE for 'inc' (0.0692) is slightly larger than the default WLS SE (0.0597).
# The robust SE for e401k (1.81) is also larger than the default (1.50).
# This suggests that the initial assumption Var=sigma^2*inc might not perfectly capture
# the true heteroskedasticity. However, the conclusions about significance remain largely
# unchanged in this case. Using robust standard errors with WLS provides insurance against
# misspecification of the variance function used for weights.
--- WLS Results with Robust (HC3) Standard Errors ---
Robust WLS SEs:
b se t pval
Intercept -16.7025 2.2482 -7.4292 0.0000
inc 0.7404 0.0752 9.8403 0.0000
I((age - 25) ** 2) 0.0175 0.0026 6.7650 0.0000
male 1.8405 1.3132 1.4015 0.1611
e401k 5.1883 1.5743 3.2955 0.0010
Example 8.7: Demand for Cigarettes (FGLS with Estimated Weights)¶
Here, we don’t assume the form of heteroskedasticity beforehand. Instead, we estimate it using Feasible GLS (FGLS).
- Estimate the original model (cigarette demand) by OLS.
- Obtain the residuals and square them.
- Model the log of squared residuals as a function of the predictors: . This models the variance function .
- Obtain the fitted values from this regression, . Exponentiate to get estimates of the variance: .
- Use weights in a WLS estimation of the original model.
# Load smoking data
smoke = wool.data("smoke")
# --- Step 1: OLS Estimation of the Cigarette Demand Model ---
reg_ols_smoke = smf.ols(
formula="cigs ~ np.log(income) + np.log(cigpric) +"
"educ + age + I(age**2) + restaurn", # restaurn is restriction dummy
data=smoke,
)
results_ols_smoke = reg_ols_smoke.fit()
print("--- OLS Results (Cigarette Demand) ---")
table_ols_smoke = pd.DataFrame(
{
"b": round(results_ols_smoke.params, 4),
"se": round(results_ols_smoke.bse, 4),
"t": round(results_ols_smoke.tvalues, 4),
"pval": round(results_ols_smoke.pvalues, 4),
},
)
print(f"OLS Estimates:\n{table_ols_smoke}\n")
--- OLS Results (Cigarette Demand) ---
OLS Estimates:
b se t pval
Intercept -3.6398 24.0787 -0.1512 0.8799
np.log(income) 0.8803 0.7278 1.2095 0.2268
np.log(cigpric) -0.7509 5.7733 -0.1301 0.8966
educ -0.5015 0.1671 -3.0016 0.0028
age 0.7707 0.1601 4.8132 0.0000
I(age ** 2) -0.0090 0.0017 -5.1765 0.0000
restaurn -2.8251 1.1118 -2.5410 0.0112
# --- Test for Heteroskedasticity (BP Test) ---
y_smoke, X_smoke = pt.dmatrices(
"cigs ~ np.log(income) + np.log(cigpric) + educ +age + I(age**2) + restaurn",
data=smoke,
return_type="dataframe",
)
print("--- Breusch-Pagan Test (Cigarette Demand) ---")
result_bp_smoke = sm.stats.diagnostic.het_breuschpagan(results_ols_smoke.resid, X_smoke)
bp_statistic_smoke = result_bp_smoke[0]
bp_pval_smoke = result_bp_smoke[1]
print(f"BP LM statistic: {bp_statistic_smoke:.4f}")
print(f"BP LM p-value: {bp_pval_smoke:.4f}\n")
# Interpretation (BP Test): The p-value is very small (< 0.0001), strongly rejecting
# the null of homoskedasticity. FGLS is likely warranted for efficiency.
--- Breusch-Pagan Test (Cigarette Demand) ---
BP LM statistic: 32.2584
BP LM p-value: 0.0000
# --- Step 2 & 3: Model the Variance Function ---
# Get residuals, square them, take the log (add small constant if any residuals are zero)
smoke["resid_ols"] = results_ols_smoke.resid
# Ensure no log(0) issues if resid can be exactly zero (unlikely with continuous data)
# A common fix is adding a tiny constant or dropping zero residuals if they exist.
# Here, we assume residuals are non-zero. If errors occur, check this.
smoke["logu2"] = np.log(smoke["resid_ols"] ** 2)
# Regress log(u^2) on the original predictors
reg_varfunc = smf.ols(
formula="logu2 ~ np.log(income) + np.log(cigpric) +"
"educ + age + I(age**2) + restaurn",
data=smoke, # Need to handle potential -inf if resid^2 was 0
missing="drop", # Drop rows where logu2 might be invalid
)
results_varfunc = reg_varfunc.fit()
print("--- Variance Function Estimation Results (log(u^2) regressed on X) ---")
table_varfunc = pd.DataFrame(
{
"b": round(results_varfunc.params, 4),
"se": round(results_varfunc.bse, 4),
"t": round(results_varfunc.tvalues, 4),
"pval": round(results_varfunc.pvalues, 4),
},
)
print(f"Variance Function Estimates:\n{table_varfunc}\n")
# Interpretation (Variance Function): This regression tells us which variables are
# significantly related to the log error variance. For instance, log(income) and age
# appear significant predictors of the variance.
--- Variance Function Estimation Results (log(u^2) regressed on X) ---
Variance Function Estimates:
b se t pval
Intercept -1.9207 2.5630 -0.7494 0.4538
np.log(income) 0.2915 0.0775 3.7634 0.0002
np.log(cigpric) 0.1954 0.6145 0.3180 0.7506
educ -0.0797 0.0178 -4.4817 0.0000
age 0.2040 0.0170 11.9693 0.0000
I(age ** 2) -0.0024 0.0002 -12.8931 0.0000
restaurn -0.6270 0.1183 -5.2982 0.0000
# --- Step 4 & 5: FGLS Estimation using Estimated Weights ---
# Get fitted values from the variance function regression
smoke["logh_hat"] = results_varfunc.fittedvalues
# Calculate the estimated variance h_hat = exp(logh_hat)
smoke["h_hat"] = np.exp(smoke["logh_hat"])
# Calculate the weights for WLS: w = 1 / h_hat
wls_weight_fgls = list(1 / smoke["h_hat"])
# Estimate the original demand model using WLS with estimated weights
reg_fgls_wls = smf.wls(
formula="cigs ~ np.log(income) + np.log(cigpric) +"
"educ + age + I(age**2) + restaurn",
weights=wls_weight_fgls,
data=smoke,
missing="drop", # Ensure consistent sample with variance estimation
)
results_fgls_wls = reg_fgls_wls.fit()
print("--- FGLS (WLS with Estimated Weights) Results ---")
table_fgls_wls = pd.DataFrame(
{
"b": round(results_fgls_wls.params, 4),
"se": round(results_fgls_wls.bse, 4), # FGLS standard errors
"t": round(results_fgls_wls.tvalues, 4),
"pval": round(results_fgls_wls.pvalues, 4),
},
)
print(f"FGLS Estimates:\n{table_fgls_wls}\n")
# Interpretation (FGLS vs OLS):
# Comparing the FGLS estimates to the original OLS estimates:
# - The coefficient for log(cigpric) is -2.64 in FGLS vs -0.75 in OLS. The FGLS estimate is still insignificant.
# - The coefficient for the restaurant smoking restriction ('restaurn') is -2.74 in FGLS vs -2.86 in OLS, and remains significant.
# - Standard errors have generally changed. For instance, the SE for log(income) decreased from 0.72 (OLS) to 0.44 (FGLS).
# - FGLS estimates are preferred for efficiency if the variance model is reasonably well-specified.
# One could also compute robust standard errors for the FGLS estimates as a further check.
--- FGLS (WLS with Estimated Weights) Results ---
FGLS Estimates:
b se t pval
Intercept 5.6355 17.8031 0.3165 0.7517
np.log(income) 1.2952 0.4370 2.9639 0.0031
np.log(cigpric) -2.9403 4.4601 -0.6592 0.5099
educ -0.4634 0.1202 -3.8570 0.0001
age 0.4819 0.0968 4.9784 0.0000
I(age ** 2) -0.0056 0.0009 -5.9897 0.0000
restaurn -3.4611 0.7955 -4.3508 0.0000
This notebook covered the detection of heteroskedasticity (Breusch-Pagan, White tests), inference robust to heteroskedasticity (White/HC standard errors), and estimation via Weighted Least Squares (WLS/FGLS) to potentially gain efficiency when heteroskedasticity is present. Choosing between OLS with robust SEs and WLS/FGLS often depends on whether efficiency gains are a primary concern and how confident one is in specifying the variance function for WLS/FGLS.