⬅️ Back to list of blog posts

In this post, I summarize some famous examples of selection bias in finance and economics. Hopefully they serve as a useful reminder that even when identification and estimation are sound, there are always lingering concerns about external validity.


Overview

In general, the data we use are obtained from a third-party provider who collect data based on their own criteria. This means that there are generally two types of selection biases that one may worry about:

  1. Criteria-based Selection: This is when a subset of the data is systematically excluded due to some criteria, either imposed by the data provider or the accompanying regulation.
  2. Survivorship-based Selection: This happens when the sample drops out naturally because the entity no longer survives. For example, if a firm delists it is no longer included in Compustat.

Example 1. Compustat

Compustat, perhaps one of the most widely used source for accounting and financial data, primarily covers public companies. This is an important consideration when one studies topics that may impact private and public firms differentially such as the role of financing constraints or investment behavior.

To see the sampling bias of Compustat most bluntly, consider this table from Crouzet and Mehrotra (2020) who juxtapose the Quarterly Financial Report (QFR) data with Compustat:

Untitled

The last column reports the average value for the Compustat manufacturing segment, while the first four columns report the distribution of equivalent statistics for the QFR sample. The Compustat average is close to the average size of the top 1% of the QFR sample!

So how much does this affect results? Take a look at this fascinating figure from Zwick and Mahon (2017) who estimate the effect of temporary tax incentives on investment:

Untitled