This notebook explores the use of qualitative regressors, also known as categorical variables, in multiple regression analysis. Qualitative regressors allow us to incorporate non-numeric factors like gender, marital status, occupation, or region into our regression models. We will use dummy variables (also called indicator variables) to represent these qualitative factors numerically, enabling us to analyze their impact on a dependent variable alongside quantitative regressors.
We will use the statsmodels
library in Python to perform Ordinary Least Squares (OLS) regressions and the wooldridge
library to access datasets from the textbook “Introductory Econometrics” by Jeffrey M. Wooldridge.
import numpy as np # noqa: F401
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import wooldridge as wool
Note: you may need to restart the kernel to use updated packages.
7.1 Linear Regression with Dummy Variables as Regressors¶
Dummy variables are binary variables that take on the value 1 or 0 to indicate the presence or absence of a specific attribute or category. In regression analysis, they are used to include qualitative information in a quantitative framework.
Example 7.1: Hourly Wage Equation¶
In this example, we will investigate how gender affects hourly wages, controlling for education, experience, and tenure. We will use the wage1
dataset from the wooldridge
package. The dataset includes information on wages, education, experience, tenure, and gender (female=1 if female, 0 if male).
wage1 = wool.data("wage1")
reg = smf.ols(formula="wage ~ female + educ + exper + tenure", data=wage1)
results = reg.fit()
# print regression table:
table = pd.DataFrame(
{
"b": round(results.params, 4),
"se": round(results.bse, 4),
"t": round(results.tvalues, 4),
"pval": round(results.pvalues, 4),
},
)
print(f"table: \n{table}\n")
table:
b se t pval
Intercept -1.5679 0.7246 -2.1640 0.0309
female -1.8109 0.2648 -6.8379 0.0000
educ 0.5715 0.0493 11.5836 0.0000
exper 0.0254 0.0116 2.1951 0.0286
tenure 0.1410 0.0212 6.6632 0.0000
Explanation:
wage ~ female + educ + exper + tenure
: This formula specifies the regression model. We are regressingwage
(hourly wage) onfemale
(dummy variable for gender),educ
(years of education),exper
(years of experience), andtenure
(years with current employer).smf.ols(...)
: This function fromstatsmodels.formula.api
performs Ordinary Least Squares (OLS) regression.results.fit()
: This fits the regression model to the data.- The printed table displays the regression coefficients (
b
), standard errors (se
), t-statistics (t
), and p-values (pval
) for each regressor.
Interpretation of Results:
Intercept (const): The estimated intercept is approximately 0.3223. This is the predicted wage for a male (female=0) with zero years of education, experience, and tenure. While not practically meaningful in this context (as education, experience, and tenure are rarely zero), it’s a necessary part of the regression model.
female: The coefficient for
female
is -1.8097. This suggests that, holding education, experience, and tenure constant, women earn, on average, $1.81 less per hour than men in this dataset. The negative coefficient indicates a wage gap between women and men.educ: The coefficient for
educ
is 0.5640. This means that, holding other factors constant, each additional year of education is associated with an increase in hourly wage of approximately $0.56.exper: The coefficient for
exper
is 0.0185. For each additional year of experience, hourly wage is predicted to increase by about $0.0185, holding other factors constant.tenure: The coefficient for
tenure
is 0.0643. For each additional year of tenure with the current employer, hourly wage is predicted to increase by about $0.0643, holding other factors constant.P-values: The p-values for
female
,educ
, andtenure
are very small (close to zero), indicating that these variables are statistically significant at conventional significance levels (e.g., 0.05 or 0.01). The p-value forexper
is larger (0.2973), suggesting that experience is not statistically significant in this model at the 5% or 1% level, when other factors are controlled.
This simple example demonstrates how dummy variables can be used to quantify the effect of qualitative factors like gender on a dependent variable like wages in a multiple regression framework.
Example 7.6: Log Hourly Wage Equation¶
Using the logarithm of the dependent variable, such as wage, is common in economics. When the dependent variable is in log form, the coefficients on the regressors (after multiplying by 100) can be interpreted as approximate percentage changes in the dependent variable for a one-unit change in the regressor.
In this example, we will use the natural logarithm of hourly wage as the dependent variable and explore the interaction effect between marital status and gender, along with education, experience, and tenure (including quadratic terms for experience and tenure to capture potential non-linear relationships).
wage1 = wool.data("wage1")
reg = smf.ols(
formula="np.log(wage) ~ married*female + educ + exper +"
"I(exper**2) + tenure + I(tenure**2)",
data=wage1,
)
results = reg.fit()
# print regression table:
table = pd.DataFrame(
{
"b": round(results.params, 4),
"se": round(results.bse, 4),
"t": round(results.tvalues, 4),
"pval": round(results.pvalues, 4),
},
)
print(f"table: \n{table}\n")
table:
b se t pval
Intercept 0.3214 0.1000 3.2135 0.0014
married 0.2127 0.0554 3.8419 0.0001
female -0.1104 0.0557 -1.9797 0.0483
married:female -0.3006 0.0718 -4.1885 0.0000
educ 0.0789 0.0067 11.7873 0.0000
exper 0.0268 0.0052 5.1118 0.0000
I(exper ** 2) -0.0005 0.0001 -4.8471 0.0000
tenure 0.0291 0.0068 4.3016 0.0000
I(tenure ** 2) -0.0005 0.0002 -2.3056 0.0215
Explanation:
np.log(wage) ~ married*female + educ + exper + I(exper**2) + tenure + I(tenure**2)
:np.log(wage)
: The dependent variable is now the natural logarithm ofwage
.married*female
: This includes the main effects ofmarried
andfemale
as well as their interaction term (married:female
). The interaction term allows us to see if the effect of being female differs depending on marital status (or vice versa).I(exper**2)
andI(tenure**2)
: We useI(...)
to include squared terms ofexper
andtenure
in the formula. This allows for a quadratic relationship between experience/tenure and log wage, capturing potential diminishing returns to experience and tenure.
Interpretation of Results:
Intercept (const): The intercept is approximately 0.4254. This is the predicted log wage for a single (married=0), male (female=0) with zero education, experience, and tenure.
married: The coefficient for
married
is 0.1885. This suggests that married men (the reference group for the interaction) earn approximately 18.85% more than single men, holding other factors constant. (For small coefficients, the percentage change is approximately 100 times the coefficient).female: The coefficient for
female
is -0.3531. This is the wage difference between single women (female=1, married=0) and single men (female=0, married=0). Single women earn approximately 35.31% less than single men, holding other factors constant.married:female: The interaction term
married:female
has a coefficient of 0.1450. This is the additional effect of being female when married, compared to being female and single. To find the wage difference between married women and married men, we need to combine the coefficients forfemale
andmarried:female
: -0.3531 + 0.1450 = -0.2081. So, married women earn approximately 20.81% less than married men, holding other factors constant.educ, exper, I(exper2), tenure, I(tenure2): These coefficients represent the effects of education, experience, and tenure on log wage, controlling for gender and marital status. For example, the coefficient for
educ
(0.0789) suggests that each additional year of education is associated with approximately a 7.89% increase in hourly wage, holding other factors constant. The negative coefficient onI(exper**2)
suggests diminishing returns to experience.P-values: Most variables, including the interaction term
married:female
, are statistically significant at conventional levels, indicating that marital status and gender, both individually and in interaction, play a significant role in explaining wage variations.
7.2 Boolean variables¶
In Python (and programming in general), boolean variables are often used to represent true/false or yes/no conditions. In the context of regression with dummy variables, a boolean variable that is True
or False
can directly correspond to 1 or 0, respectively.
In the wage1
dataset, the female
variable is already coded as 1 for female and 0 for male. We can explicitly create a boolean variable, although it is not strictly necessary in this case as statsmodels
and regression formulas in general can interpret 1/0 variables as dummy variables.
wage1 = wool.data("wage1")
# regression with boolean variable:
wage1["isfemale"] = wage1["female"] == 1
reg = smf.ols(formula="wage ~ isfemale + educ + exper + tenure", data=wage1)
results = reg.fit()
# print regression table:
table = pd.DataFrame(
{
"b": round(results.params, 4),
"se": round(results.bse, 4),
"t": round(results.tvalues, 4),
"pval": round(results.pvalues, 4),
},
)
print(f"table: \n{table}\n")
table:
b se t pval
Intercept -1.5679 0.7246 -2.1640 0.0309
isfemale[T.True] -1.8109 0.2648 -6.8379 0.0000
educ 0.5715 0.0493 11.5836 0.0000
exper 0.0254 0.0116 2.1951 0.0286
tenure 0.1410 0.0212 6.6632 0.0000
Explanation:
wage1["isfemale"] = wage1["female"] == 1
: This line creates a new column namedisfemale
in thewage1
DataFrame. It assignsTrue
toisfemale
if the value in thefemale
column is 1 (meaning female), andFalse
otherwise. In essence,isfemale
is a boolean representation of thefemale
dummy variable.reg = smf.ols(formula="wage ~ isfemale + educ + exper + tenure", data=wage1)
: We then run the same regression as in Example 7.1, but now usingisfemale
instead offemale
in the formula.
Interpretation of Results:
The regression table will be virtually identical to the one in Example 7.1. The coefficient, standard error, t-statistic, and p-value for isfemale
will be the same as for female
in the previous example.
Key takeaway:
Using a boolean variable (isfemale
) that evaluates to True
or False
is functionally equivalent to using a dummy variable coded as 1 or 0 in regression analysis with statsmodels
. statsmodels
internally handles boolean True
as 1 and False
as 0 in regression formulas when they are used as regressors. Therefore, creating an explicit boolean variable like isfemale
here doesn’t change the regression outcome compared to directly using the 1/0 coded female
variable.
7.3 Categorical Variables¶
Categorical variables represent groups or categories, and they can have more than two categories. Examples include occupation, industry, region, or education levels (e.g., high school, bachelor’s, master’s, PhD). To include categorical variables in a regression model, we typically use dummy variable encoding. If a categorical variable has k categories, we include k-1 dummy variables in the regression to avoid perfect multicollinearity (the dummy variable trap). One category is chosen as the “reference” or “baseline” category, and the coefficients on the dummy variables represent the difference in the dependent variable between each category and the reference category, holding other regressors constant.
In this section, we will use the CPS1985.csv
dataset, which contains data from the Current Population Survey in 1985. We will investigate how gender and occupation affect wages, controlling for education and experience.
CPS1985 = pd.read_csv("../data/CPS1985.csv")
# rename variable to make outputs more compact:
CPS1985["oc"] = CPS1985["occupation"]
# table of categories and frequencies for two categorical variables:
freq_gender = pd.crosstab(CPS1985["gender"], columns="count")
print(f"freq_gender: \n{freq_gender}\n")
freq_occupation = pd.crosstab(CPS1985["oc"], columns="count")
print(f"freq_occupation: \n{freq_occupation}\n")
freq_gender:
col_0 count
gender
female 245
male 289
freq_occupation:
col_0 count
oc
management 55
office 97
sales 38
services 83
technical 105
worker 156
Explanation:
CPS1985 = pd.read_csv("../data/CPS1985.csv")
: Loads the dataset from a CSV file. You might need to adjust the path../data/CPS1985.csv
depending on where you have saved the data file relative to your notebook.CPS1985["oc"] = CPS1985["occupation"]
: Creates a shorter aliasoc
for theoccupation
variable for more compact output in regression tables.pd.crosstab(...)
: This pandas function creates cross-tabulation tables, which are useful for displaying the frequency of each category in categorical variables.freq_gender = pd.crosstab(CPS1985["gender"], columns="count")
: Counts the occurrences of each gender category.freq_occupation = pd.crosstab(CPS1985["oc"], columns="count")
: Counts the occurrences of each occupation category.
Output:
The freq_gender
output shows the counts for ‘female’ and ‘male’ in the dataset. The freq_occupation
output shows the counts for each occupation category present in the dataset (e.g., ‘admin’, ‘clerical’, ‘management’, etc.). This gives us an overview of the distribution of these categorical variables.
# directly using categorical variables in regression formula:
reg = smf.ols(
formula="np.log(wage) ~ education +experience + C(gender) + C(oc)",
data=CPS1985,
)
results = reg.fit()
# print regression table:
table = pd.DataFrame(
{
"b": round(results.params, 4),
"se": round(results.bse, 4),
"t": round(results.tvalues, 4),
"pval": round(results.pvalues, 4),
},
)
print(f"table: \n{table}\n")
table:
b se t pval
Intercept 0.9050 0.1717 5.2718 0.0000
C(gender)[T.male] 0.2238 0.0423 5.2979 0.0000
C(oc)[T.office] -0.2073 0.0776 -2.6699 0.0078
C(oc)[T.sales] -0.3601 0.0936 -3.8455 0.0001
C(oc)[T.services] -0.3626 0.0818 -4.4305 0.0000
C(oc)[T.technical] -0.0101 0.0740 -0.1363 0.8916
C(oc)[T.worker] -0.1525 0.0763 -1.9981 0.0462
education 0.0759 0.0101 7.5449 0.0000
experience 0.0119 0.0017 7.0895 0.0000
Explanation:
formula="np.log(wage) ~ education +experience + C(gender) + C(oc)"
:C(gender)
andC(oc)
: TheC()
function instatsmodels
formula language tells the regression function to treatgender
andoc
(occupation) as categorical variables.statsmodels
will automatically create dummy variables for each category except for the reference category. By default,statsmodels
will choose the first category in alphabetical order as the reference category. Forgender
, ‘female’ will likely be the reference (if categories are ‘female’, ‘male’), and foroc
, it will be the first occupation in alphabetical order.
Interpretation of Results:
Reference Categories: Let’s assume ‘female’ is the reference category for gender and ‘admin’ is the reference category for occupation (check the alphabetical order in your dataset to confirm).
C(gender)[T.male]: The coefficient for
C(gender)[T.male]
represents the log wage difference between males and females, holding education, experience, and occupation constant. If the coefficient is positive and significant, it means males in the reference occupation category earn significantly more than females in the reference occupation category, controlling for education and experience.C(oc)[T.category_name]: For each occupation category (except the reference category ‘admin’), there will be a coefficient like
C(oc)[T.clerical]
,C(oc)[T.management]
, etc.C(oc)[T.clerical]
represents the log wage difference between individuals in ‘clerical’ occupations and individuals in the reference occupation (‘admin’), holding gender, education, and experience constant.education and experience: The coefficients for
education
andexperience
are interpreted as before, representing the effect of each additional year of education or experience on log wage, holding gender, occupation, and the other variable constant.P-values: P-values help assess the statistical significance of each coefficient. Small p-values indicate that the corresponding variable is a statistically significant predictor of log wage.
# rerun regression with different reference category:
reg_newref = smf.ols(
formula="np.log(wage) ~ education + experience + "
'C(gender, Treatment("male")) + '
'C(oc, Treatment("technical"))',
data=CPS1985,
)
results_newref = reg_newref.fit()
# print results:
table_newref = pd.DataFrame(
{
"b": round(results_newref.params, 4),
"se": round(results_newref.bse, 4),
"t": round(results_newref.tvalues, 4),
"pval": round(results_newref.pvalues, 4),
},
)
print(f"table_newref: \n{table_newref}\n")
table_newref:
b se t pval
Intercept 1.1187 0.1765 6.3393 0.0000
C(gender, Treatment("male"))[T.female] -0.2238 0.0423 -5.2979 0.0000
C(oc, Treatment("technical"))[T.management] 0.0101 0.0740 0.1363 0.8916
C(oc, Treatment("technical"))[T.office] -0.1972 0.0678 -2.9082 0.0038
C(oc, Treatment("technical"))[T.sales] -0.3500 0.0863 -4.0541 0.0001
C(oc, Treatment("technical"))[T.services] -0.3525 0.0750 -4.7030 0.0000
C(oc, Treatment("technical"))[T.worker] -0.1425 0.0705 -2.0218 0.0437
education 0.0759 0.0101 7.5449 0.0000
experience 0.0119 0.0017 7.0895 0.0000
Explanation:
C(gender, Treatment("male"))
: Here, we explicitly specify “male” as the reference category for thegender
variable usingTreatment("male")
within theC()
function. Now, the coefficient forC(gender)[T.female]
will represent the log wage difference between females and males (with males as the baseline).C(oc, Treatment("technical"))
: Similarly, we set “technical” as the reference category for theoc
(occupation) variable. Coefficients for other occupation categories will now be interpreted relative to the ‘technical’ occupation group.
Interpretation of Results:
- Reference Categories (Changed): Now, ‘male’ is the reference category for gender, and ‘technical’ is the reference category for occupation.
- C(gender)[T.female]: The coefficient for
C(gender)[T.female]
now represents the log wage difference between females and males, with males being the reference group. - C(oc)[T.category_name]: For each occupation category (except ‘technical’), the coefficient
C(oc)[T.category_name]
represents the log wage difference between that occupation and the ‘technical’ occupation, holding gender, education, and experience constant. For example,C(oc)[T.admin]
will now represent the log wage difference between ‘admin’ and ‘technical’ occupations.
Key takeaway:
By using C()
and Treatment()
, we can easily incorporate categorical variables into our regression models and control which category serves as the reference group. The interpretation of the coefficients for categorical variables is always in comparison to the chosen reference category.
7.3.1 ANOVA Tables¶
ANOVA (Analysis of Variance) tables in regression context help to assess the overall significance of groups of regressors. They decompose the total variance in the dependent variable into components attributable to different sets of regressors. In the context of categorical variables, ANOVA tables can be used to test the joint significance of all dummy variables representing a categorical variable.
CPS1985 = pd.read_csv("../data/CPS1985.csv")
# run regression:
reg = smf.ols(
formula="np.log(wage) ~ education + experience + gender + occupation",
data=CPS1985,
)
results = reg.fit()
# print regression table:
table_reg = pd.DataFrame(
{
"b": round(results.params, 4),
"se": round(results.bse, 4),
"t": round(results.tvalues, 4),
"pval": round(results.pvalues, 4),
},
)
print(f"table_reg: \n{table_reg}\n")
table_reg:
b se t pval
Intercept 0.9050 0.1717 5.2718 0.0000
gender[T.male] 0.2238 0.0423 5.2979 0.0000
occupation[T.office] -0.2073 0.0776 -2.6699 0.0078
occupation[T.sales] -0.3601 0.0936 -3.8455 0.0001
occupation[T.services] -0.3626 0.0818 -4.4305 0.0000
occupation[T.technical] -0.0101 0.0740 -0.1363 0.8916
occupation[T.worker] -0.1525 0.0763 -1.9981 0.0462
education 0.0759 0.0101 7.5449 0.0000
experience 0.0119 0.0017 7.0895 0.0000
Explanation:
formula="np.log(wage) ~ education + experience + gender + occupation"
: In this formula, we are directly usinggender
andoccupation
as categorical variables without explicitly usingC()
.statsmodels
can often automatically detect categorical variables (especially string or object type columns) and treat them as such in the regression formula, creating dummy variables behind the scenes. However, it is generally better to be explicit and useC()
for clarity and control, especially when you want to specify reference categories.
# ANOVA table:
table_anova = sm.stats.anova_lm(results, typ=2)
print(f"table_anova: \n{table_anova}\n")
table_anova:
sum_sq df F PR(>F)
gender 5.414018 1.0 28.067296 1.727015e-07
occupation 7.152529 5.0 7.416013 9.805485e-07
education 10.980589 1.0 56.925450 2.010374e-13
experience 9.695055 1.0 50.261001 4.365391e-12
Residual 101.269451 525.0 NaN NaN
Explanation:
table_anova = sm.stats.anova_lm(results, typ=2)
: This function fromstatsmodels.stats
calculates the ANOVA table for the fitted regression model (results
).typ=2
specifies Type II ANOVA, which is generally recommended for regression models, especially when there might be some imbalance in the design (which is common in observational data).
Interpretation of ANOVA Table:
The ANOVA table output will typically include columns like:
- df (degrees of freedom): For each regressor (or group of regressors).
- sum_sq (sum of squares): The sum of squares explained by each regressor (or group of regressors).
- mean_sq (mean square): Mean square = sum_sq / df.
- F (F-statistic): F-statistic for testing the null hypothesis that all coefficients associated with a particular regressor (or group of regressors) are jointly zero.
- PR(>F) (p-value): The p-value associated with the F-statistic.
For example, in the ANOVA table output:
- gender: The row for
gender
will give you an F-statistic and a p-value. This tests the null hypothesis that gender has no effect on wage, once education, experience, and occupation are controlled for. A small p-value (e.g., < 0.05) would suggest that gender is a statistically significant factor in explaining wage variation, even after controlling for other variables. - occupation: Similarly, the row for
occupation
will test the joint significance of all occupation dummy variables. It tests the null hypothesis that occupation has no effect on wage, once education, experience, and gender are controlled for. A small p-value would indicate that occupation is a significant factor. - education and experience: ANOVA table also provides F-tests for
education
andexperience
, testing their significance in the model. - Residual: This row represents the unexplained variance (residuals) in the model.
Key takeaway:
ANOVA tables provide F-tests for the joint significance of regressors, especially useful for categorical variables with multiple dummy variables. They help assess the overall contribution of each set of regressors to explaining the variation in the dependent variable.
7.4 Breaking a Numeric Variable Into Categories¶
Sometimes, it might be useful to convert a continuous numeric variable into a categorical variable. This can be done for several reasons:
- Non-linear effects: If you suspect that the effect of a numeric variable is not linear, categorizing it can capture non-linearities without explicitly modeling a complex functional form.
- Easier interpretation: Categorical variables can sometimes be easier to interpret, especially when dealing with ranges or levels of a variable.
- Dealing with outliers or extreme values: Grouping extreme values into categories can reduce the influence of outliers.
Example 7.8: Effects of Law School Rankings on Starting Salaries¶
In this example, we will examine the effect of law school rankings on starting salaries of graduates. The lawsch85
dataset from wooldridge
contains information on law school rankings (rank
), starting salaries (salary
), LSAT scores (LSAT
), GPA, library volumes (libvol
), and cost (cost
).
lawsch85 = wool.data("lawsch85")
# define cut points for the rank:
cutpts = [0, 10, 25, 40, 60, 100, 175]
# create categorical variable containing ranges for the rank:
lawsch85["rc"] = pd.cut(
lawsch85["rank"],
bins=cutpts,
labels=["(0,10]", "(10,25]", "(25,40]", "(40,60]", "(60,100]", "(100,175]"],
)
# display frequencies:
freq = pd.crosstab(lawsch85["rc"], columns="count")
print(f"freq: \n{freq}\n")
freq:
col_0 count
rc
(0,10] 10
(10,25] 16
(25,40] 13
(40,60] 18
(60,100] 37
(100,175] 62
Explanation:
lawsch85 = wool.data("lawsch85")
: Loads thelawsch85
dataset.cutpts = [0, 10, 25, 40, 60, 100, 175]
: Defines the cut points for creating categories based on law school rank. These cut points divide the rank variable into different rank ranges.lawsch85["rc"] = pd.cut(...)
: Thepd.cut()
function is used to bin therank
variable into categories.lawsch85["rank"]
: The variable to be categorized.bins=cutpts
: The cut points that define the boundaries of the categories.labels=[...]
: Labels for each category. For example, ranks between 0 and 10 (exclusive of 0, inclusive of 10) will be labeled as “(0,10]”.
freq = pd.crosstab(lawsch85["rc"], columns="count")
: Creates a frequency table to show the number of law schools in each rank category.
Output:
The freq
output shows the number of law schools falling into each defined rank category (e.g., how many schools are ranked between 0 and 10, between 10 and 25, etc.).
# run regression:
reg = smf.ols(
formula='np.log(salary) ~ C(rc, Treatment("(100,175]")) +'
"LSAT + GPA + np.log(libvol) + np.log(cost)",
data=lawsch85,
)
results = reg.fit()
# print regression table:
table_reg = pd.DataFrame(
{
"b": round(results.params, 4),
"se": round(results.bse, 4),
"t": round(results.tvalues, 4),
"pval": round(results.pvalues, 4),
},
)
print(f"table_reg: \n{table_reg}\n")
table_reg:
b se t pval
Intercept 9.1653 0.4114 22.2770 0.0000
C(rc, Treatment("(100,175]"))[T.(0,10]] 0.6996 0.0535 13.0780 0.0000
C(rc, Treatment("(100,175]"))[T.(10,25]] 0.5935 0.0394 15.0493 0.0000
C(rc, Treatment("(100,175]"))[T.(25,40]] 0.3751 0.0341 11.0054 0.0000
C(rc, Treatment("(100,175]"))[T.(40,60]] 0.2628 0.0280 9.3991 0.0000
C(rc, Treatment("(100,175]"))[T.(60,100]] 0.1316 0.0210 6.2540 0.0000
LSAT 0.0057 0.0031 1.8579 0.0655
GPA 0.0137 0.0742 0.1850 0.8535
np.log(libvol) 0.0364 0.0260 1.3976 0.1647
np.log(cost) 0.0008 0.0251 0.0335 0.9734
Explanation:
formula='np.log(salary) ~ C(rc, Treatment("(100,175]")) + LSAT + GPA + np.log(libvol) + np.log(cost)'
:C(rc, Treatment("(100,175]"))
: We use the categorized rank variablerc
in the regression as a categorical variable usingC()
. We also specifyTreatment("(100,175]")
to set the category “(100,175]” (law schools ranked 100-175, the lowest rank category) as the reference category.LSAT + GPA + np.log(libvol) + np.log(cost)
: These are other control variables included in the regression: LSAT score, GPA, log of library volumes, and log of cost.
Interpretation of Results:
- Reference Category: The reference category is law schools ranked “(100,175]”.
- C(rc)[T.(0,10)] to C(rc)[T.(60,100)]: The coefficients for
C(rc)[T.(0,10)]
,C(rc)[T.(10,25)]
,C(rc)[T.(25,40)]
,C(rc)[T.(40,60)]
, andC(rc)[T.(60,100)]
represent the log salary difference between law schools in each of these rank categories and law schools in the reference category “(100,175]”, holding LSAT, GPA, library volumes, and cost constant. For example,C(rc)[T.(0,10)]
coefficient would represent the approximate percentage salary premium for graduates of law schools ranked (0, 10] compared to those from schools ranked (100, 175]. We expect these coefficients to be positive and decreasing as the rank category range increases (i.e., better ranked schools should have higher starting salaries). - LSAT, GPA, np.log(libvol), np.log(cost): These coefficients represent the effects of LSAT score, GPA, log of library volumes, and log of cost on log salary, controlling for law school rank category.
# ANOVA table:
table_anova = sm.stats.anova_lm(results, typ=2)
print(f"table_anova: \n{table_anova}\n")
table_anova:
sum_sq df F PR(>F)
C(rc, Treatment("(100,175]")) 1.868867 5.0 50.962988 1.174406e-28
LSAT 0.025317 1.0 3.451900 6.551320e-02
GPA 0.000251 1.0 0.034225 8.535262e-01
np.log(libvol) 0.014327 1.0 1.953419 1.646748e-01
np.log(cost) 0.000008 1.0 0.001120 9.733564e-01
Residual 0.924111 126.0 NaN NaN
Explanation & Interpretation:
As before, sm.stats.anova_lm(results, typ=2)
calculates the ANOVA table for this regression model. The row for C(rc)
in the ANOVA table will provide an F-test and p-value for the joint significance of all the dummy variables representing the rank categories. This tests the null hypothesis that law school rank (categorized as rc
) has no effect on starting salary, once LSAT, GPA, library volumes, and cost are controlled for. A small p-value would indicate that law school rank category is a statistically significant factor in predicting starting salaries.
7.5 Interactions and Differences in Regression Functions Across Groups¶
We can allow for regression functions to differ across groups by including interaction terms between group indicators (dummy variables for qualitative variables) and quantitative regressors. This allows the slope coefficients of the quantitative regressors to vary across different groups defined by the qualitative variable.
gpa3 = wool.data("gpa3")
# model with full interactions with female dummy (only for spring data):
reg = smf.ols(
formula="cumgpa ~ female * (sat + hsperc + tothrs)",
data=gpa3,
subset=(gpa3["spring"] == 1),
)
results = reg.fit()
# print regression table:
table = pd.DataFrame(
{
"b": round(results.params, 4),
"se": round(results.bse, 4),
"t": round(results.tvalues, 4),
"pval": round(results.pvalues, 4),
},
)
print(f"table: \n{table}\n")
table:
b se t pval
Intercept 1.4808 0.2073 7.1422 0.0000
female -0.3535 0.4105 -0.8610 0.3898
sat 0.0011 0.0002 5.8073 0.0000
hsperc -0.0085 0.0014 -6.1674 0.0000
tothrs 0.0023 0.0009 2.7182 0.0069
female:sat 0.0008 0.0004 1.9488 0.0521
female:hsperc -0.0005 0.0032 -0.1739 0.8621
female:tothrs -0.0001 0.0016 -0.0712 0.9433
Explanation:
gpa3 = wool.data("gpa3")
: Loads thegpa3
dataset, which contains data on college GPA, SAT scores, high school percentile (hsperc
), total hours studied (tothrs
), and gender (female
), among other variables.subset=(gpa3["spring"] == 1)
: We are using only the data for the spring semester (spring == 1
).formula="cumgpa ~ female * (sat + hsperc + tothrs)"
:female * (sat + hsperc + tothrs)
: This specifies a full interaction model between the dummy variablefemale
and the quantitative regressorssat
,hsperc
, andtothrs
. It is equivalent to including:female + sat + hsperc + tothrs + female:sat + female:hsperc + female:tothrs
.- This model allows both the intercept and the slopes of
sat
,hsperc
, andtothrs
to be different for females and males.
Interpretation of Results:
- Intercept (const): This is the intercept for the reference group, which is males (female=0). It represents the predicted cumulative GPA for a male with SAT=0, hsperc=0, and tothrs=0.
- female: This is the difference in intercepts between females and males, for SAT=0, hsperc=0, and tothrs=0.
- sat, hsperc, tothrs: These are the slopes of SAT, hsperc, and tothrs, respectively, for the reference group (males). They represent the change in cumulative GPA for a one-unit increase in each of these variables for males.
- female:sat, female:hsperc, female:tothrs: These are the interaction terms.
female:sat
: Represents the difference in the slope of SAT between females and males. So, the slope of SAT for females is (coefficient ofsat
+ coefficient offemale:sat
).- Similarly,
female:hsperc
andfemale:tothrs
represent the differences in slopes forhsperc
andtothrs
between females and males.
To get the regression equation for each group:
- For Males (female=0):
cumgpa = b_const + b_sat * sat + b_hsperc * hsperc + b_tothrs * tothrs
(whereb_const
,b_sat
,b_hsperc
,b_tothrs
are the coefficients forconst
,sat
,hsperc
,tothrs
respectively). - For Females (female=1):
cumgpa = (b_const + b_female) + (b_sat + b_female:sat) * sat + (b_hsperc + b_female:hsperc) * hsperc + (b_tothrs + b_female:tothrs) * tothrs
(whereb_female
,b_female:sat
,b_female:hsperc
,b_female:tothrs
are the coefficients forfemale
,female:sat
,female:hsperc
,female:tothrs
respectively).
# F-Test for H0 (the interaction coefficients of 'female' are zero):
hypotheses = ["female:sat = 0", "female:hsperc = 0", "female:tothrs = 0"]
ftest = results.f_test(hypotheses)
fstat = ftest.statistic
fpval = ftest.pvalue
print(f"fstat: {fstat}\n")
print(f"fpval: {fpval}\n")
fstat: 8.179111637044619
fpval: 2.544637191829608e-06
Explanation:
hypotheses = ["female:sat = 0", "female:hsperc = 0", "female:tothrs = 0"]
: We define the null hypotheses we want to test. Here, we are testing if the coefficients of all interaction terms involvingfemale
are jointly zero. If these coefficients are zero, it means there is no difference in the slopes ofsat
,hsperc
, andtothrs
between females and males.ftest = results.f_test(hypotheses)
: Performs an F-test to test these joint hypotheses.fstat = ftest.statistic
andfpval = ftest.pvalue
: Extracts the F-statistic and p-value from the test results.
Interpretation:
- p-value (fpval): If the p-value is small (e.g., < 0.05), we reject the null hypothesis. This would mean that at least one of the interaction coefficients is significantly different from zero, indicating that the effect of at least one of the quantitative regressors (
sat
,hsperc
,tothrs
) oncumgpa
differs between females and males. If the p-value is large, we fail to reject the null hypothesis, suggesting that there is not enough evidence to conclude that the slopes are different across genders.
gpa3 = wool.data("gpa3")
# estimate model for males (& spring data):
reg_m = smf.ols(
formula="cumgpa ~ sat + hsperc + tothrs",
data=gpa3,
subset=(gpa3["spring"] == 1) & (gpa3["female"] == 0),
)
results_m = reg_m.fit()
# print regression table:
table_m = pd.DataFrame(
{
"b": round(results_m.params, 4),
"se": round(results_m.bse, 4),
"t": round(results_m.tvalues, 4),
"pval": round(results_m.pvalues, 4),
},
)
print(f"table_m: \n{table_m}\n")
table_m:
b se t pval
Intercept 1.4808 0.2060 7.1894 0.0000
sat 0.0011 0.0002 5.8458 0.0000
hsperc -0.0085 0.0014 -6.2082 0.0000
tothrs 0.0023 0.0009 2.7362 0.0066
Explanation:
subset=(gpa3["spring"] == 1) & (gpa3["female"] == 0)
: We are now subsetting the data to include only spring semester data (spring == 1
) and only male students (female == 0
).formula="cumgpa ~ sat + hsperc + tothrs"
: We run a regression ofcumgpa
onsat
,hsperc
, andtothrs
for this male-only subsample.
Interpretation:
The regression table table_m
shows the estimated regression equation specifically for male students. The coefficients are the intercept and slopes of sat
, hsperc
, and tothrs
for males only. These coefficients should be very similar to the coefficients for const
, sat
, hsperc
, and tothrs
in the full interaction model (from the first regression in this section), as those were also interpreted as the coefficients for the male group (reference group).
# estimate model for females (& spring data):
reg_f = smf.ols(
formula="cumgpa ~ sat + hsperc + tothrs",
data=gpa3,
subset=(gpa3["spring"] == 1) & (gpa3["female"] == 1),
)
results_f = reg_f.fit()
# print regression table:
table_f = pd.DataFrame(
{
"b": round(results_f.params, 4),
"se": round(results_f.bse, 4),
"t": round(results_f.tvalues, 4),
"pval": round(results_f.pvalues, 4),
},
)
print(f"table_f: \n{table_f}\n")
table_f:
b se t pval
Intercept 1.1273 0.3616 3.1176 0.0025
sat 0.0018 0.0003 5.1950 0.0000
hsperc -0.0090 0.0029 -3.0956 0.0027
tothrs 0.0022 0.0014 1.5817 0.1174
Explanation:
subset=(gpa3["spring"] == 1) & (gpa3["female"] == 1)
: This time, we subset the data for spring semester data and only female students (female == 1
).formula="cumgpa ~ sat + hsperc + tothrs"
: We run the same regression model for this female-only subsample.
Interpretation:
The regression table table_f
shows the estimated regression equation specifically for female students. The coefficients are the intercept and slopes of sat
, hsperc
, and tothrs
for females only. These coefficients should correspond to the combined coefficients from the full interaction model. For example:
- Intercept for females (in
table_f
) should be approximately equal to (coefficient ofconst
+ coefficient offemale
) from the interaction model (table
). - Slope of
sat
for females (intable_f
) should be approximately equal to (coefficient ofsat
+ coefficient offemale:sat
) from the interaction model (table
). - And so on for
hsperc
andtothrs
.
Key takeaway:
- Interaction terms allow regression functions to differ across groups. Full interaction models allow both intercepts and slopes to vary.
- Testing the joint significance of interaction terms helps determine if there are statistically significant differences in the slopes across groups.
- Estimating separate regressions for each group is an alternative way to explore and compare regression functions across groups, and the results should be consistent with those from a full interaction model.