Naive Regressions and Weak Causal Claims

Identification assumptions are key to drawing justifiable causal inferences.

Jun 23, 2023

Free presentation statistic boy vector — Image source.

Today, I came upon an article by Rikki Schlott in the NY Post, “Don’t stop at affirmative action: End college legacy admissions too.” The topic of Schlott’s article is controversial, and not one I intend to address in this post. Instead, I’m interested in the empirical specification implicit in her statistical analysis.

Citing NBER research on Harvard admissions from 2019, Schlott reports that “Harvard’s admissions rate averaged 6% from 2009 to 2014” and “a legacy applicant with a close relative who graduated from Harvard had a 33.6% chance of acceptance.” Schlott suggests that the 27.6 percentage point difference in the rate of acceptance between the all applicants and legacy applicants is explained by legacy status itself. Schlott’s empirical specification, then, would look a little something like this:

Y = A + BL + E

Where Y is the acceptance rate (in %), A is the acceptance rate for non-legacy applicants (in %), L is a dummy variable that equals 1 if the applicant is a legacy and zero otherwise, B is the effect of the dummy variable (in percentage points), L, being “on” (i.e. equal to 1), and E is the error term.

To conclude that B captures the causal effect of being a legacy applicant, one must assume that the treated (legacy) and control (untreated) groups are otherwise identical. Clearly, such an identification assumption is entirely implausible; a balance test between the general and legacy applicant pools would find all sorts of significant differences between the two groups that independently effect Y, the acceptance rate. By not accounting for these confounding variables in her regression, Schlott is unintentionally exaggerating the effect of being a legacy appplicant.

I imagine that, even controlling for a matrix of factors that may be different between the control and treatment groups, the isolated legacy effect is still statistically significant and substantive. Still, the 27.6 percentage point figure is most likely positively biased.

In any event, I thought Schlott’s piece provided useful fodder for a brief explanation of some basic econometrics. Statistics is completely unintuitive and I am confident Schlott did not intend to make stronger claims than are warranted.

P.S. Schlott penned a wonderful piece on modern dating in the Father’s Day edition of National Review, which I encourage my readers to check out.

P.P.S. I am completely disinterested apropos the topic. (Nobody in my family went to Dartmouth, my belovéd school.)

The Inquisitive Individualist

Discussion about this post