When the Fundamental Problem of Causal Inference Ain't No Problem

Brady Neal

The Fundamental Problem of Causal Inference

Consider the potential outcomes $Y_i(t)$ and $Y_i(t')$ , where $Y_i(t)$ denotes the outcome $Y$ that unit (individual) $i$ would have if unit $i$ receives treatment $t$ . If you prefer do-notation, $P \big(Y | \text{do}(T = t), i \big)$ is the do-notation equivalent of $Y_i(t)$ . The fundamental problem of causal inference is that it is impossible to observe both $Y_i(t)$ and $Y_i(t')$ (Holland, 1986). This is because you cannot give unit $i$ treatment $t$ , observe $Y_i(t)$ , rewind time to before treatment $t$ was given, and give unit $i$ treatment $t'$ to observe $Y_i(t')$ . We illustrate this below in a table with question marks in place of the potential outcomes that cannot be observed.

Unit	$T$	$Y$	$Y_i(t)$	$Y_i(t')$
1	$t$	$1$	$1$	$?$
2	$t'$	$1$	$?$	$1$
3	$t'$	$0$	$?$	$0$
4	$t$	$0$	$0$	$?$
5	$t'$	$1$	$?$	$1$
6	$t$	$0$	$0$	$?$

You might be thinking you can just give unit $i$ treatment $t$ to observe $Y_i(t)$ , wait some time, and then give unit $i$ treatment $t'$ . However, the second potential outcome you observe will not necessarily be $Y_i(t')$ . Really, it is $Y_i(t, \text{wait}, t')$ , which denotes the compound treatment of first giving treatment $t$ , then waiting, and then giving treatment $t'$ . If you assume waiting sufficiently long nullifies the effects of the initial treatment $t$ , you get $Y_i(t, \text{wait}, t') = Y_i(\text{wait}, t')$ . If you further assume that all relevant confounding factors are exactly the same after waiting as they were when you gave the initial treatment $t$ , then you get $Y_i(\text{wait}, t') = Y_i(t')$ . However, these are two strong assumptions.

This means that it is impossible to calculate the unit-level causal effect $Y_i(t') - Y_i(t)$ . This is why much of causal inference is focused on average treatment effects (ATE) $\mathbb{E}[Y(t') - Y(t)] = \mathbb{E}[Y(t')] - \mathbb{E}[Y(t)]$ , which are possible to calculate under assumptions such as SUTVA and ignorability (see, e.g., Morgan & Winship (2014, Sections 2.5 and 4.3.1)).

No Problem in Simulations

However, experiments in a virtual world such as a computer program don’t suffer from the fundamental problem of causal inference. Whenever our experiments are taking place in a program where we can change the treatment by changing the program, we can observe both $Y_i(t)$ and $Y_i(t')$ . To observe $Y_i(t)$ , we just run the program with treatment = t. All we have to do to observe $Y_i(t')$ is modify the program by changing the assignment statement to treatment = t' and rerun the program. It’s that simple. What are referred to as “simulations” in causal inference are just programs, so the fundamental problem of causal inference ain’t no problem there either.

This is not a particularly novel observation, but I think it is important to emphasize to machine learning researchers because so many of their experiments actually take place in computer simulations that they have a high degree of control over. I’ll present some highly simplified examples below. Think of these as seeds for you to come up with your own examples that could be much more complicated.

Supervised Learning Example

Consider a neural network where everything about the learning process is fixed except for the learning rate, so you are looking at a particular network $i$ (unit $i$ ). Your “treatment” is the learning rate $\Alpha$ . Your outcome of interest is the test error $\mathcal{E}$ . You have already tried a learning rate of $.01$ and observed $\mathcal{E}_i(\Alpha = .01) = .15$ . You are wondering what would happen if you were to change the learning rate to $.005$ . Because you have control of the program, you just change the learning rate to $.005$ , keeping everything else in the program the same. You then observe $\mathcal{E}_i(\Alpha = .005) = .13$ . This procedure may be very familiar to you. You bypassed the fundamental problem of causal inference and computed a unit-level causal effect: $\mathcal{E}_i(.005) - \mathcal{E}_i(.01) = .13 - .15 = -.02$ . The fact that this causal effect is negative is why you would choose a learning rate of $.005$ instead of a learning rate of $.01$ .

Reinforcement Learning Example

Consider a reinforcement learning agent in state $s$ at time $t$ ( $S_t = s$ ). I will omit the subscript $i$ , but it is implicit in this example. Think of the specific unit $i$ as specifying everything else that is held constant in the program such as the random seed, specific task, specific learner, etc. Say the action space is $\{a, a', a''\}$ . These are your possible “treatments.” Your outcomes of interest are the reward $R_{t + 1}$ and next state $S_{t + 1}$ , which are both dependent on your action (potentially stochastically, depending on your environment). You can observe all 3 potential outcomes: $(R_{t + 1}(a), S_{t + 1}(a)), (R_{t + 1}(a'), S_{t + 1}(a')), (R_{t + 1}(a''), S_{t + 1}(a''))$ . All you have to do is run the program 3 times from that particular state, taking a different action each time.

How About the Real World?

Of course, the fundamental problem of causal inference is still a problem outside of computer programs. However, the better we can model the world (in computers), the less of a problem the fundamental problem of causal inference becomes, in general. If we ever have good enough models of the world, we will be able to observe both $Y_i(t)$ and $Y_i(t')$ by simply rerunning the model with $t$ and with $t'$ , just as I described above for a regular computer program. And modeling the world is an active research topic. For example, there was recently the NeurIPS 2018 Workshop on Modeling the Physical World: Learning, Perception, and Control.

Concluding Thoughts

If you care about causal effects in computer programs and you have access to the assignments for the causal variables you care about, you do not need to worry about the fundamental problem of causal inference. Because experiments in machine learning largely take place in computers, this had broad implications in this field.

Progress on modeling the physical world will directly translate to progress on estimating more and more accurate unit-level causal effects in the physical world.

Acknowledgments

Thanks to Nitarshan Rajkumar and Deepak Sharma for reviewing this blog post and giving feedback. Thanks to Alex Lamb for recommending I add some examples.

References

Holland, P. W. (1986). Statistics and Causal Inference. Journal of the American Statistical Association.
Morgan, S. L., & Winship, C. (2014). Counterfactuals and Causal Inference: Methods and Principles for Social Research.