The Fundamental Problem of Causal Inference

Consider the potential outcomes Yi(t)Y_i(t) and Yi(t)Y_i(t'), where Yi(t)Y_i(t) denotes the outcome YY that unit (individual) ii would have if unit ii receives treatment tt. If you prefer do-notation, P(Ydo(T=t),i)P \big(Y | \text{do}(T = t), i \big) is the do-notation equivalent of Yi(t)Y_i(t). The fundamental problem of causal inference is that it is impossible to observe both Yi(t)Y_i(t) and Yi(t)Y_i(t') (Holland, 1986). This is because you cannot give unit ii treatment tt, observe Yi(t)Y_i(t), rewind time to before treatment tt was given, and give unit ii treatment tt' to observe Yi(t)Y_i(t'). We illustrate this below in a table with question marks in place of the potential outcomes that cannot be observed.

Unit TT YY Yi(t)Y_i(t) Yi(t)Y_i(t')
1 tt 11 11 ??
2 tt' 11 ?? 11
3 tt' 00 ?? 00
4 tt 00 00 ??
5 tt' 11 ?? 11
6 tt 00 00 ??

You might be thinking you can just give unit ii treatment tt to observe Yi(t)Y_i(t), wait some time, and then give unit ii treatment tt'. However, the second potential outcome you observe will not necessarily be Yi(t)Y_i(t'). Really, it is Yi(t,wait,t)Y_i(t, \text{wait}, t'), which denotes the compound treatment of first giving treatment tt, then waiting, and then giving treatment tt'. If you assume waiting sufficiently long nullifies the effects of the initial treatment tt, you get Yi(t,wait,t)=Yi(wait,t)Y_i(t, \text{wait}, t') = Y_i(\text{wait}, t'). If you further assume that all relevant confounding factors are exactly the same after waiting as they were when you gave the initial treatment tt, then you get Yi(wait,t)=Yi(t)Y_i(\text{wait}, t') = Y_i(t'). However, these are two strong assumptions.

This means that it is impossible to calculate the unit-level causal effect Yi(t)Yi(t)Y_i(t') - Y_i(t). This is why much of causal inference is focused on average treatment effects (ATE) E[Y(t)Y(t)]=E[Y(t)]E[Y(t)]\mathbb{E}[Y(t') - Y(t)] = \mathbb{E}[Y(t')] - \mathbb{E}[Y(t)], which are possible to calculate under assumptions such as SUTVA and ignorability (see, e.g., Morgan & Winship (2014, Sections 2.5 and 4.3.1)).

No Problem in Simulations

However, experiments in a virtual world such as a computer program don’t suffer from the fundamental problem of causal inference. Whenever our experiments are taking place in a program where we can change the treatment by changing the program, we can observe both Yi(t)Y_i(t) and Yi(t)Y_i(t'). To observe Yi(t)Y_i(t), we just run the program with treatment = t. All we have to do to observe Yi(t)Y_i(t') is modify the program by changing the assignment statement to treatment = t' and rerun the program. It’s that simple. What are referred to as “simulations” in causal inference are just programs, so the fundamental problem of causal inference ain’t no problem there either.

This is not a particularly novel observation, but I think it is important to emphasize to machine learning researchers because so many of their experiments actually take place in computer simulations that they have a high degree of control over. I’ll present some highly simplified examples below. Think of these as seeds for you to come up with your own examples that could be much more complicated.

Supervised Learning Example

Consider a neural network where everything about the learning process is fixed except for the learning rate, so you are looking at a particular network ii (unit ii). Your “treatment” is the learning rate A\Alpha. Your outcome of interest is the test error E\mathcal{E}. You have already tried a learning rate of .01.01 and observed Ei(A=.01)=.15\mathcal{E}_i(\Alpha = .01) = .15. You are wondering what would happen if you were to change the learning rate to .005.005. Because you have control of the program, you just change the learning rate to .005.005, keeping everything else in the program the same. You then observe Ei(A=.005)=.13\mathcal{E}_i(\Alpha = .005) = .13. This procedure may be very familiar to you. You bypassed the fundamental problem of causal inference and computed a unit-level causal effect: Ei(.005)Ei(.01)=.13.15=.02\mathcal{E}_i(.005) - \mathcal{E}_i(.01) = .13 - .15 = -.02. The fact that this causal effect is negative is why you would choose a learning rate of .005.005 instead of a learning rate of .01.01.

Reinforcement Learning Example

Consider a reinforcement learning agent in state ss at time tt (St=sS_t = s). I will omit the subscript ii, but it is implicit in this example. Think of the specific unit ii as specifying everything else that is held constant in the program such as the random seed, specific task, specific learner, etc. Say the action space is {a,a,a}\{a, a', a''\}. These are your possible “treatments.” Your outcomes of interest are the reward Rt+1R_{t + 1} and next state St+1S_{t + 1}, which are both dependent on your action (potentially stochastically, depending on your environment). You can observe all 3 potential outcomes: (Rt+1(a),St+1(a)),(Rt+1(a),St+1(a)),(Rt+1(a),St+1(a))(R_{t + 1}(a), S_{t + 1}(a)), (R_{t + 1}(a'), S_{t + 1}(a')), (R_{t + 1}(a''), S_{t + 1}(a'')). All you have to do is run the program 3 times from that particular state, taking a different action each time.

How About the Real World?

Of course, the fundamental problem of causal inference is still a problem outside of computer programs. However, the better we can model the world (in computers), the less of a problem the fundamental problem of causal inference becomes, in general. If we ever have good enough models of the world, we will be able to observe both Yi(t)Y_i(t) and Yi(t)Y_i(t') by simply rerunning the model with tt and with tt', just as I described above for a regular computer program. And modeling the world is an active research topic. For example, there was recently the NeurIPS 2018 Workshop on Modeling the Physical World: Learning, Perception, and Control.

Concluding Thoughts

If you care about causal effects in computer programs and you have access to the assignments for the causal variables you care about, you do not need to worry about the fundamental problem of causal inference. Because experiments in machine learning largely take place in computers, this had broad implications in this field.

Progress on modeling the physical world will directly translate to progress on estimating more and more accurate unit-level causal effects in the physical world.

Acknowledgments

Thanks to Nitarshan Rajkumar and Deepak Sharma for reviewing this blog post and giving feedback. Thanks to Alex Lamb for recommending I add some examples.

References

  1. Holland, P. W. (1986). Statistics and Causal Inference. Journal of the American Statistical Association.
  2. Morgan, S. L., & Winship, C. (2014). Counterfactuals and Causal Inference: Methods and Principles for Social Research.