Simulation of basic causal structures

causal inference

Author

René Valenzuela

Published

October 28, 2024

Modified

March 15, 2025

Introduction

We consider the problem of estimating the effect of variable \(X\) on a target variable \(Y\) from a sample obtained from an unnknown distribution. For simplicity in our first example both \(X\) and \(Y\) are binary variables. Also, there is an extra variable \(Z\) which is also binary.

	X	Z	Y
0	1	1	1
1	0	0	0
2	1	1	0
3	1	0	0
4	1	0	1
5	0	1	0
6	1	0	0
7	0	0	0
8	0	0	0
9	1	1	0

We assume that we have no additional knowledge. The first step is to understand what do we mean by the effect of \(X\) on \(Y\). Here the implicit question is if I set the value of \(X\) to 1 how will I affect the value of \(Y\)

One possibility to estimate this effect would be \[ \hat{e}_{1} = \mathbb{E}\left[Y \mid X = 0\right] - \mathbb{E}\left[Y \mid X = 1\right] \]

e1 = obs.loc[obs.X == 1, 'Y'].mean() - obs.loc[obs.X == 0, 'Y'].mean()

Correlation does not imply causation refers to the inability to legitimately deduce a cause-and-effect relationship between two events or variables solely on the basis of an observed association or correlation between them (Correlation does not imply causation). Although this statement agrees with what our common sense tells us only recently a proper causal inference framework has been developed which allows us to formally specify the difference.

Correlation in its broadest sense may indicate any type of association, i.e. any statistical relationship whether causal or not, between two random variables. This can be formally written as \[ \mathbb{E}\left[ Y \mid X = x \right] = f(x) \]

We want to show that the magnitude of a direct causal effect is different from the value of the variable condtional on the other one being set to a given value
Understand the correct approach to fit a model given the data and its correponding SCM

Example

We consider the following structural causal model (as defined in Section 1.5 of (Pearl 2016)) \[ \begin{align*} f_{Z} &: Z = 1_{U_{Z} > 0} \\ f_{X} &: X = 1_{Z + U_{X} > 0.5} \\ f_{Y} &: Y = 1_{X + Z + U_{Y} > 2} \end{align*} \] where \(U = \{ U_{X}, U_{Y}, U_{Z} \}\) is the set of exogenous variables, \(V = \{ X, Y, Z \}\) is the set of endogenous variables and \(1_{A}\) is the indicator function of the event \(A\). The associated graphical causal model is

We will assume \(U_{i}\) are standard normal random variables. Recall that for binary variables probabilities are equal to means and computing conditional probabilities is particularly easy.

We see that \(P\left( Y \mid \text{do}\left(X\right)\right)\) is different from \(P\left( Y \mid X = 1\right)\)

print(f'{np.mean(y)}, {np.mean(y_dox)}, {np.mean(y[x==1])}')

0.2335, 0.3309, 0.39857510389867407

SEM versus SCM

Consider the following structural equation model, \[ asp = 2hd str = asp + hd \]

which is represented as follows

flowchart LR
  hd([HD])
  asp([ASP])
  str([STR])
  hd -- 2 --> asp
  hd -- 1 --> str
  asp -- 1 --> str

Our goal is to estimate the coefficients from a sample of the data

# Size of sample
n_sample = 100

abt = pd.DataFrame({'hd':rng.random(n_sample)})
abt['asp'] = 2 * abt['hd']
abt['str'] = abt['hd'] + abt['asp']

We use a linear regression model to generate the estimation

reg = LinearRegression().fit(abt[['asp']], abt[['str']])

The coefficient of determination is given by 1.0

print(reg.coef_, reg.intercept_)

[[1.5]] [4.4408921e-16]

reg_adj = LinearRegression().fit(abt[['hd', 'asp']], abt[['str']])

reg_adj.score(abt[['hd', 'asp']], abt[['str']])

1.0

print(reg_adj.coef_, reg_adj.intercept_)

[[0.6 1.2]] [-2.22044605e-16]

References

Section 1.6 R examples. Causal Data Science with Directed Acyclic graphs. Udemy
Pearl 2016: Causal inference in Statistics. A primer. Judea Pearl, Madelyn Glymour and Nicholas P. Jewell
Correlation does not imply causation: Wikipedia article Correlation does not imply causation