X | Z | Y | |
---|---|---|---|
0 | 1 | 1 | 1 |
1 | 0 | 0 | 0 |
2 | 1 | 1 | 0 |
3 | 1 | 0 | 0 |
4 | 1 | 0 | 1 |
5 | 0 | 1 | 0 |
6 | 1 | 0 | 0 |
7 | 0 | 0 | 0 |
8 | 0 | 0 | 0 |
9 | 1 | 1 | 0 |
Simulation of basic causal structures
Introduction
We consider the problem of estimating the effect of variable \(X\) on a target variable \(Y\) from a sample obtained from an unnknown distribution. For simplicity in our first example both \(X\) and \(Y\) are binary variables. Also, there is an extra variable \(Z\) which is also binary.
We assume that we have no additional knowledge. The first step is to understand what do we mean by the effect of \(X\) on \(Y\). Here the implicit question is if I set the value of \(X\) to 1 how will I affect the value of \(Y\)
One possibility to estimate this effect would be \[ \hat{e}_{1} = \mathbb{E}\left[Y \mid X = 0\right] - \mathbb{E}\left[Y \mid X = 1\right] \]
= obs.loc[obs.X == 1, 'Y'].mean() - obs.loc[obs.X == 0, 'Y'].mean() e1
Correlation does not imply causation refers to the inability to legitimately deduce a cause-and-effect relationship between two events or variables solely on the basis of an observed association or correlation between them (Correlation does not imply causation). Although this statement agrees with what our common sense tells us only recently a proper causal inference framework has been developed which allows us to formally specify the difference.
Correlation in its broadest sense may indicate any type of association, i.e. any statistical relationship whether causal or not, between two random variables. This can be formally written as \[ \mathbb{E}\left[ Y \mid X = x \right] = f(x) \]
- We want to show that the magnitude of a direct causal effect is different from the value of the variable condtional on the other one being set to a given value
- Understand the correct approach to fit a model given the data and its correponding SCM
Example
We consider the following structural causal model (as defined in Section 1.5 of (Pearl 2016)) \[ \begin{align*} f_{Z} &: Z = 1_{U_{Z} > 0} \\ f_{X} &: X = 1_{Z + U_{X} > 0.5} \\ f_{Y} &: Y = 1_{X + Z + U_{Y} > 2} \end{align*} \] where \(U = \{ U_{X}, U_{Y}, U_{Z} \}\) is the set of exogenous variables, \(V = \{ X, Y, Z \}\) is the set of endogenous variables and \(1_{A}\) is the indicator function of the event \(A\). The associated graphical causal model is
We will assume \(U_{i}\) are standard normal random variables. Recall that for binary variables probabilities are equal to means and computing conditional probabilities is particularly easy.
We see that \(P\left( Y \mid \text{do}\left(X\right)\right)\) is different from \(P\left( Y \mid X = 1\right)\)
print(f'{np.mean(y)}, {np.mean(y_dox)}, {np.mean(y[x==1])}')
0.2335, 0.3309, 0.39857510389867407
SEM versus SCM
Consider the following structural equation model, \[ asp = 2hd str = asp + hd \]
which is represented as follows
flowchart LR hd([HD]) asp([ASP]) str([STR]) hd -- 2 --> asp hd -- 1 --> str asp -- 1 --> str
Our goal is to estimate the coefficients from a sample of the data
# Size of sample
= 100
n_sample
= pd.DataFrame({'hd':rng.random(n_sample)})
abt 'asp'] = 2 * abt['hd']
abt['str'] = abt['hd'] + abt['asp'] abt[
We use a linear regression model to generate the estimation
= LinearRegression().fit(abt[['asp']], abt[['str']]) reg
The coefficient of determination is given by 1.0
print(reg.coef_, reg.intercept_)
[[1.5]] [4.4408921e-16]
= LinearRegression().fit(abt[['hd', 'asp']], abt[['str']]) reg_adj
'hd', 'asp']], abt[['str']]) reg_adj.score(abt[[
1.0
print(reg_adj.coef_, reg_adj.intercept_)
[[0.6 1.2]] [-2.22044605e-16]
References
- Section 1.6 R examples. Causal Data Science with Directed Acyclic graphs. Udemy
- Pearl 2016: Causal inference in Statistics. A primer. Judea Pearl, Madelyn Glymour and Nicholas P. Jewell
- Correlation does not imply causation: Wikipedia article Correlation does not imply causation