Causal Inference Fundamentals

Introduction

Questions concerning causal relationships are at the core of science; understanding causality is also increasingly important in data applications in technology, business and policy. At the same time, unparalleled recent growth in the scale of data collection and computation present correspondingly exceptional opportunities to learn from this data. However, making credible inferences about causal relationships remains difficult.¹

The first step toward credible causal inference is to establish what the target quantity we want to study actually is. The simplest form causal inference takes is an investigator asking how one variable causally effects another variable. We call the first of these a treatment and denote it with \(D\). We call the second of these an outcome and denote it with \(Y\). So we want to understand how changes in \(D\) causally effect the distribution of values that \(Y\) takes. However, as we’ll see, we’ll need to be more rigousous that that simple statement.

There are multiple frameworks for discussing and exploring causality. We’ll focus on the counterfactual or potential outcomes framework (Rubin 1974, 1978, 1990) which can be connected to structural causal models (Pearl 2009). For more comprehensive introductions to the ideas discussed here, see Pearl (2009), Hernán and Robins (2020), Imbens and Rubin (2015), Cunningham (2021) and similar books. These are all excellent.

Causal questions like “do carbon emissions increase overall global temperatures,” “does additional education increase career earnings,” “does an individual’s ethnicity increase their likelihood of being stopped by the police,” or “does chemotherapy, in addition to radiation, lead to better oncological outcomes” require us to analyze counterfactual versions of variables of interest. For example, in determining the effect of education on career earnings, we might analyze whether the average career earnings would have been lower if all individuals had an additional year of education, relative to if all individuals did not receive an additional year of education. Importantly note where causal questions diverge from statistical ones. We’re not merely interested in whether people who have more education tend to have higher incomes; this could be the result of how hard-working individuals are, perhaps those who work harder to get more education also work harder in the workplace. We want to know if having more education causally raises incomes, perhaps by increasing skills that are valued on the job market. Such analysis requires a formalization of the counterfactual values that variables could take under different interventions on the treatment.

Potential Outcomes and Causal Effects

In the potential outcomes framework, each unit \(i\) in the selected sample has a treatment value \(D[i]\), an outcome value \(Y[i]\), as well as other covariate measurements. These are the realized or factual versions of these variables that we’re used to in any statistics course. Now let’s define potential or counterfactual versions of these variables. Note that for now, we’ll primarily consider potential outcomes, but potential treatments and any other type of variable are possible.

Let \(Y_d[i]\) be the value that the variable \(Y\) would have taken for unit \(i\), if the variable \(D\) for unit \(i\) had been set, possibly counterfactually, to the value \(d\). Call this the potential outcome or counterfactual outcome of \(Y[i]\).

Extending beyond the single unit, \(Y_d\) is the random variable that represents the values that the variable \(Y\) would have taken, if the variable \(D\) had been set, possibly counterfactually, to the value \(d\). Call this the potential outcome or counterfactual value of \(Y\). This concept will be refined further below.

As we discussed above, causal studies are typically interested in inference over some causal effect of the treatment on the outcome. These can be formalized in counterfactual language as well.

The unit-level additive causal effect of setting \(D\) to \(d\) relative to \(D\) to \(d'\) is \(Y_{d}[i] - Y_{d'}[i]\), for unit \(i\). The average additive causal effect of setting \(D\) to \(d\) relative to \(D\) to \(d'\) is \(\mathbb{E}[Y_{d} - Y_{d'}]\). This is sometimes called the average treatment effect or ATE. Other causal effects are possible.²

A key portion of credible causal inference is stating explicitly what your target causal parameter (effect) is. Is it the ATE? Is it the causal risk ratio (CRR; \(\frac{p(Y_d=1)}{p(Y_{d'}=1)}\))? Or the average (additive) treatment effect for units that got treated (ATT; \(\mathbb{E}[Y_{d} - Y_{d'}| D=1]\))? Or what? Without specifying your target, it is hard to determine when you’ve actually nailed it down.

Sample vs Population ATE

Often we focus on estimating the average causal effect in a particular sample. We call this the Sample Average Treatment Effect (SATE). Uncertainty arises only from treatment assignment and whether or not this is random relative to the outcome; there is also normal statistical uncertainty. When this is our target, the inferences we make are limited to the sample in our study.

Compare this to the Population Average Treatment Effect (PATE), which aims to estimate the average causal effect for the population from which the sample was drawn. This, therefore, requires knowledge about the sampling process. Here, we need to account for two sources of causal uncertainty: uncertainty from the sampling process and uncertainty from treatment assignment.

If sampling is random from a given population, then \(\mathbb{E}[SATE] = PATE\).

Fundamental Problem of Causal Inference

Now this all sounds great and it seems like we need only calculate individual causal effects using the potential outcomes we’re interested in and then take their mean or whatever we want. Sadly, no.

For each unit \(i\), only one of its counterfactual values for \(Y\) (or any other variable) can be observed. This is because each unit either is or isn’t treated in reality; we can’t do both. This is sometimes called the “fundamental problem of causal inference.” (Rubin 1978; Holland 1986; Imbens and Rubin 2015; Westreich et al. 2015) Thus, unit-level causal effects cannot be observed and we’ll have to make assumptions in order to get at causal effects, rather than just statistical associations.

Assumptions

So causal inference introduces assumptions, external to the data, about the causal processes that gave rise to the data. The hope is that the assumptions are believable and that they combine to allow us to use data to try to make statements about causal effects.

We first assume that the counterfactual values \(Y_d[i]\) for unit \(i\) does not depend on the treatment received by other individuals and that slight variations in the specific treatments associated with value \(D=d\) are not relevant and all lead to the same outcome \(Y_d\), which are components of the stable unit treatment value assumption (SUTVA) (Rubin 1990).³

Next, we assume that the value of \(Y_d[i]\) is the same as the value for \(Y[i]\) if \(D[i]\) is set to the treatment value that unit \(i\) was observed to experience, which is called the consistency assumption. We can see this through the follow equation, if we assume a binary treatment: \(Y[i] = D[i]Y_1[i] + (1-D[i])Y_0[i]\). This is a sort of switch, where \(Y[i] = Y_1[i]\) if \(D[i]=1\) and \(Y[i] = Y_0[i]\) if \(D[i]=0\).

We also typically assume that measurement error is minimal and sometimes that we have a random and representative sample of the population we’re interested in. These are strong assumptions and much work focuses on these. But we make them here for simplicity.

Finally, we need some way to use the data we do have, which contains outcomes for some treated and some untreated units but never both potential outcomes, to estimate a causal effect. As in much of statistics, we do this by borrowing information across units, as we’ll see more clearly in the next section. In particular, the assumption that allows for this is usually some sort of ignorability or conditional ignorability assumption. These assumptions state that the treatment was assigned as though it were randomly assigned with respect to the outcome, within strata of the variables being conditioned on. Ignorability takes the form of the independence statement \(Y_d \perp\!\!\!\!\perp D\). Conditional ignorability takes the form of the conditional independence statement \(Y_d \perp\!\!\!\!\perp D|Z\), for some covariates \(Z\). Again, ignorability states that, unconditionally, the treatment was assigned essentially randomly, with respect to the outcome. And conditional ignorability states that, within strata of the specified set of variables, the treatment was assigned essentially randomly, with respect to the outcome. These ignorability assumptions are at the core of much of the concerns about validity of causal statements and require the researcher to defend them using knowledge about the data generating process external to the data. For example, ignorability might be plausible in a randomized controlled trial, but perhaps only conditional ignorability, with a rich set of covariates, is plausible for a given observational study.

We also need an assumption that there is positive probability of treatment level \(D=d\) (within each strata \(Z=z\)) that has positive probability. See (Hernán and Robins 2020), chapter 3. None of these assumptions should be made without consideration.

Identification

Together, these assumptions can be used to equate the target causal effect with expressions of quantities that can be estimated from observed data.

Use the term identification to mean the equation of a causal effect (or more generally an expression containing counterfactual quantities) with expressions of quantities that can be estimated from observed data.

It is key to note that causal inference consists of two steps. First, we must identify the target causal effect with an expressions of quantities that can be estimated from observed data. Second, we must use the observed data to actually estimate this expression.

Let us see how we might identify a causal quantity using the assumptions we’ve introduced. First, let’s look at an ignorability assumption and the ATE.

\[\begin{aligned} ATE &= \mathbb{E}[Y_{d} - Y_{d'}] \\ &= \mathbb{E}[Y_{d}] - \mathbb{E}[Y_{d'}] \\ &= \mathbb{E}[Y_{d}|D=d] - \mathbb{E}[Y_{d'}|D=d'] &&\text{by }Y_d \perp\!\!\!\!\perp D \\ &= \mathbb{E}[Y|D=d] - \mathbb{E}[Y|D=d'] &&\text{by consistency} \\ \end{aligned}\]

This can be estimated using data by what we call the difference in means (DIM):

\[DIM = \frac{1}{n_d}\sum_{D_i = d} Y_i - \frac{1}{n_{d'}}\sum_{D_i = d'} Y_i\]

Now let’s look at a conditional ignorability assumption and the ATE.

\[\begin{aligned} ATE &= \mathbb{E}[Y_{d} - Y_{d'}] \\ &= \mathbb{E}[Y_{d}] - \mathbb{E}[Y_{d'}] \\ &= \sum_z \mathbb{E}[Y_{d}|Z=z] p(Z=z) - \sum_z \mathbb{E}[Y_{d'}|Z=z] p(Z=z) \\ &= \sum_z \mathbb{E}[Y_{d}|D=d,Z=z] p(Z=z) - \sum_z \mathbb{E}[Y_{d'}|D=d', Z=z] p(Z=z) &&\text{by }Y_d \perp\!\!\!\!\perp D|Z \\ &= \sum_z \mathbb{E}[Y|D=d,Z=z] p(Z=z) - \sum_z \mathbb{E}[Y|D=d', Z=z] p(Z=z) &&\text{by consistency} \end{aligned}\]

And this can also be estimated with data in a similar way.

Structural Causal Model and DAGs

How do we determine whether some set of covariates provides conditional ignorability? How do we determine if there exists any set of covariates that can provide conditional ignorability? These questions are difficult, and, in practice, we can never be certain that some set of covariates will provide the ignorability we need. The onus is on researchers to make plausible arguments for ignorability. To aid in this, we can build a model of how the treatment and outcome causally relate to each other and relevant covariates. Such a model should capture all the structural information that is available about the causal mechanisms relating important variables, as well as the uncertainty about such relationships. The causal relationships can be non-parametrically encoded in a structural causal model. I adapt the following definition from Pearl (2009), chapter 7.

A structural causal model is a triple \(M = <U,V,F>\), where

\(U\) is a set of background variables (one for each variable in \(V\)) that are determined by factors outside the model (typically these variables will be unobserved);
\(V\) is a set \(\{V_1,V_2,\dots,V_n\}\) of variables that are determined by variables in the model (i.e., by \(U \cup V\); these variables can be observed or unobserved);
\(F\) is a set \(\{f_1,f_2,\dots,f_n\}\) of functions such that each \(f_i\) is a mapping from \(U_i\cup {PA}_i\) to \(V_i\), where \(U_i \subset U\) and \({PA}_i \subset V \backslash V_i\) and the entire set \(F\) forms a mapping from \(U\) to \(V\) (in general the specific functional forms of these will not be known). That is, each \(f_i\) in \(v_i = f_i({pa}_i,u_i)\), \(i=1,\dots,n\), assigns a value to \(V_i\) that depends on the values of a select set of variables in \(V\cup U\), and the entire set \(F\) has a unique solution \(F(u)\). (The choice of \({PA}_i\) reflects the researcher’s understanding of the causal relationships between variables.)

When \(M\) is taken in addition to \(p(u)\) it is called a probabilistic structural causal model, where \(p(u)\) is a probability function defined over the domain of \(U\).

As stated in Pearl (2009), every casual model, \(M\), is associated with a causal graph, \(G\), in which each variable in \(V\) is represented by a node and directed edges represent parent-child causal relationships, where the edge points from the parent to the child.⁴ Such causal graphs are referred to as directed acyclic graphs (DAGs).

Here is a simple structural causal model and DAG with a treatment \(D\), outcome \(Y\), and a common cause confounders \(Z\). These non-parametrically capture the causal structure between these variables.

\[\begin{aligned} Z &= f_Z(U_Z) \\ D &= f_D(Z,U_D) \\ Y &= f_Y(D,Z,U_Y) \\ \end{aligned}\]

Two more definitions adapted from Pearl (2009), chapter 7 will connect the causal model with the counterfactual quantities that we recently introduced.

Let \(M\) be a causal model, \(D\) be a set of variables in \(V\), and \(d\) a particular realization of \(D\). A submodel \(M_d\) of \(M\) is the causal model \(M_d = <U,V,F_d>\), where \(F_d\) is formed by deleting the functions for the variables in \(D\) and replacing them with constant functions \(D=d\).

Let \(D\) and \(Y\) be two subsets of variables in \(V\). The counterfactual values of \(Y\) when \(D\) had been set to \(d\), written \(Y_d\), is the solution for \(Y\) of the set of equations \(F_d\), given the realized values of the background variables.

Twin Networks and SWIGs

You may have noticed that potential outcomes do not appear in the DAGs that we’ve seen so far. How can we use a graph that doesn’t include potential outcomes to make statements about potential outcomes?

Balke and Pearl (1994) and Pearl (2009) introduce a graphical tool for visualizing actual world variables and counterfactual world variables together. This tool, called the twin network, can help us in determining ignorability for a causal model or graph under consideration. The idea is essentially to create twin graphs, one that represents the actual world and is identical to the original causal graph and one that represents the counterfactual world that is described by the submodel in which we intervene on certain variables to set them to specifc values. One can think of a butterfly opening its wings, as the actual world is reflected in the counterfactual world. The two graphs are identical except that no edges enter into the intervention nodes on the counterfactual side, which captures the graphical version of the submodel, and all variables downstream of the intervention nodes are subscripted with lower-case versions of the intervention node labels. The two graphs share the background variable nodes and any other variables that are not downstream of the intervention nodes, since these are not changed in the counterfactual world.

We are then able to check conditional independence relationships between the observed variables and the counterfactual variables by evaluating their d-seperation⁵ in the twin network. When two variables are d-seperated in the twin network, they are independent or conditionally independent. When we look at conditional independence of the potential outcome and treatment, we can evaluate the type of ignorability we need. (Pearl 2009)

Let’s consider the above example. We can see that the potential outcome \(Y_d\) is d-seperated from \(D\) by \(Z\), which means that \(Y_d \perp\!\!\!\!\perp D|Z\). This is precisely the type of ignorability we need to identify causal quantities in this causal model.

Richardson and Robins (2013b) and Richardson and Robins (2013a) also introduce a graphical approach to visualizing how counterfactual values relate to factual world variables called Single World Intervention Graphs (SWIGs). SWIGs are often simpler than twin networks. However, they do not provide the same picture of both the actual world and the counterfactual world that twin networks do; for example, actual world post-treatment variables do not appear in SWIGs, though they can be used to block generalized non-causal paths between the treatment and potential outcome of interest in some settings.

The Adjustment Criterion

Shpitser, VanderWeele, and Robins (2010) presents a graphical criterion for determining ignroability from DAG, rather than requiring the user to draw the twin network or SWIG. This is known as the adjuetment criterion and it generalizes Pearl’s backdoor criterion. The following are adapted from Shpitser, VanderWeele, and Robins (2010).

A set of nodes \(Z\) in \(G\) satisfies the adjustment criterion relative to \(D\) (treatment) and \(Y\) (outcome) if

No element of \(Z\) lies on a causal path from \(D\) to \(Y\) or is a descendant of a node on a causal path from \(D\) to \(Y\). (An element of \(Z\) could be a descendant of \(D\) itself, if it is not on a causal path from \(D\) to \(Y\).)
\(Z\) blocks every non-causal path between \(D\) and \(Y\).

Assume the adjustment criterion holds for \(Z\) relative to \(D\) (treatment) and \(Y\) (outcome) in \(G\).

Then \(Y_d \perp\!\!\!\!\perp D|Z\).

It’s easy to see using the DAG above that Z satisfies the adjustment criterion and so \(Y_d \perp\!\!\!\!\perp D|Z\).

I’ll stop this note here, but future notes will cover estimation and get more indepth into causal inference.

References

Balke, Alexander, and J. Pearl. 1994. “Probabilistic Evaluation of Counterfactual Queries.” In AAAI.

Cunningham, Scott. 2021. Causal Inference: The Mixtapen. Yale University Press. https://mixtape.scunning.com/index.html.

Hernán, M., and J. Robins. 2020. Causal Inference: What If. Boca Raton: Chapman & Hall/CRC.

Holland, Paul W. 1986. “Statistics and Causal Inference.” Journal of the American Statistical Association 81 (396): 945–60. https://doi.org/10.1080/01621459.1986.10478354.

Imbens, Guido W., and Donald B. Rubin. 2015. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press. https://doi.org/10.1017/CBO9781139025751.

Pearl, Judea. 2009. Causality. Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9780511803161.

Richardson, T. S., and J. M. Robins. 2013a. “Single World Intervention Graphs (SWIGs): A Unification of the Counterfactual and Graphical Approaches to Causality.” Working Paper, Center for Statistics and the Social Sciences, University of Washington, Seattle, no. 128.

———. 2013b. “Single World Intervention Graphs: A Primer.” Working Paper, University of Washington, Seattle.

Rubin, Donald B. 1974. “Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies.” Journal of Educational Psychology 66 (5): 688–701. https://doi.org/10.1037/h0037350.

———. 1978. “Bayesian Inference for Causal Effects: The Role of Randomization.” The Annals of Statistics 6 (1): 34–58. https://doi.org/10.1214/aos/1176344064.

———. 1990. “Formal Mode of Statistical Inference for Causal Effects.” Journal of Statistical Planning and Inference 25 (3): 279–92. https://doi.org/10.1016/0378-3758(90)90077-8.

Shpitser, Ilya, Tyler VanderWeele, and James M. Robins. 2010. “On the Validity of Covariate Adjustment for Estimating Causal Effects.” In Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence, UAI 2010, 527–36. Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence, UAI 2010. AUAI Press.

Westreich, Daniel, Jessie K Edwards, Stephen R Cole, Robert W Platt, Sunni L Mumford, and Enrique F Schisterman. 2015. “Imputation approaches for potential outcomes in causal inference.” International Journal of Epidemiology 44 (5): 1731–37. https://doi.org/10.1093/ije/dyv135.

This might look familiar.. intro paragraph courtesy of my home page: adam-rohde.github.io.↩︎
For example, we might be interested only in the subset of units that actually gets treated or perhaps in the ratio (rather than difference) in potential outcomes.↩︎
There are lots of very interesting and important settings in which this definitely does not hold (e.g., vaccination). And it is possible that it doesn’t hold exactly very often. But we make this assumption here and will not dwell on it, as is perhaps incorrectly, is the typical practice in most causal studies.↩︎
For each, \(v_i = f_i({pa}_i,u_i)\), the variable \(V_i\) is the child and the variables \({PA}_i\) are the parents. Note that usually only the variables in \(V\) are included in the graph. In the discussion in subsequent sections, we will argue that certain of the \(U\) variables should also be included in graphs. We will also start to refer to both \({PA}_i\) and \(U_i\) as parents of \(V_i\).↩︎
D-seperation essentially means that a path is blocked. Specifically, a node set \(Z\) d-seperates (blocks) a path if either some \(W\) is a collider on the path between and is not in \(Z\) and the descendants of \(W\) are not in \(Z\) or \(W\) is not a collider on the path but is in \(Z\). See Pearl (2009), chapter 1 for details.↩︎