Causality & Machine Learning

From CSclasswiki
Jump to: navigation, search


Week 1 (12 Sept 2019)

Tim Maudlin's Book Review of Judea Pearl's The Book of Why: The New Science of Cause and Effect

Key Points:

  • Data set with high correlation between red cars and frequency of accidents
  • Counterfactuals and their "obvious" relationship to causation
    • Is Maudlin's criticism of Pearl's treatment of counterfactuals too strong?
      • Counterfactuals
      - A counterfactual makes an assertion about what would have happened had the world been other than it is in some way (Maudlin, 2019).
  • Brief mention of Structural Equation Models (SEMs)

Items to further investigate for next week:

  • Find more philosophical criticisms of The Book of Why and Causality (similar to that of Maudlin)
  • Are there data sets out there depicting cases similar to Maudlin's red car example?
  • What exactly are SEMs?
  • Are there relevant DoWhy examples?

Week 2 (19 Sept 2019)

More Philosophical Criticisms of Pearl's Work

Hunting Causes and Using Them: Approaches in Philosophy and Economics (2007) by Nancy Cartwright

  • Note: Cartwright does not offer a direct response to Pearl's work; it is the other way around. May still be worth looking into.
  • Pearl argues that any causal system can be considered as a non-parametric structural equation model (NPSEM).
    • NPSEM: the value of each node is taken as a function of its parents and some individual error term.
      • The error terms between different nodes may in general be correlated, to represent common causes.
        • I'm still trying to understand what this exactly means.
  • In her book, Cartwright gives an example involving a car engine.
    • She claims this example cannot be modeled in the NPSEM framework.
  • Pearl disputes this in his review of Cartwright's book.

Getting Rid of Intervention by Alexander Reutlinger

Suppose that the causal claim the Big Bang is a cause of Y is true (for some Y). According to the interventionist theory, the Big Bang is a cause of Y is true iff the following two interventionist counterfactuals are true: (a) if there were an intervention I = i on the Big Bang such that the Big Bang occurred as it actually did, then Y = y would be the case as it actually is, and (b) if there were an intervention I = i* on the Big Bang such that the Big Bang were to occur differently than it actually did, then Y would take a counterfactual value y*. However, the existence of a logically possible intervention seems to be completely dispensable for stating the truth conditions of the Big Bang is a cause of Y. I will now begin to argue for this claim by way of an example of an intervention on the Big Bang suggested by Tim Maudlin. Having pointed out that it is neither practically nor physically possible to intervene on the Big Bang, Maudlin presents an instructive example of a physically impossible intervention on the Big Bang... (pp. 12-13).
  • A criticism of Pearl's claim re intervention:
Pearl’s remark
If you wish to include the whole universe in the model, causality disappears because interventions disappear—the manipulator and the manipulated lose their distinction. (2009: 419-20)
seems to assume that the imagined intervention has to meet some additional constraint beyond this (having to do in some way with the possibility of realizing the intervention). Reutlinger challenges this assumption by...
  • Also, see Reutlinger's (2016) Explanation Beyond Causation? for next week.
JOR comment:
Reutlinger's paper is a critique of Woodward's book-length 2003 theory of causation, which relies on intervention. Much of the discussion in this paper is not so relevant, because Woodward is aimed at the semantics of causal inference, "not intended as a contribution to practical problems in causal inference." Nevertheless, the discussion does seem to me to raise substantive questions on exactly what is meant by a Pearl intervention. I sense there are assumptions underlying Pearl's interventions that are perhaps not clear, and might be debatable. However, they are only obliquely addressed in this paper. I enjoyed the discussion of interventions that are not physically possible, only logically possible. Spontaneous decay of uranium is interesting, but not problematical for Pearl. I found there is too much reliance on Big Bang interventions, which weakens several arguments. There should be better examples that would be resistant to Pearl's rejoinder.

Causality and Statistical Learning by Andrew Gelman

Pearl, as befits his computer-science background, supports a permissive view, but only in theory; he accepts that any practical causal inference will be only as good as the substantive models that underlie it, and he is explicit about the additional assumptions required to translate associations into causal inference. (In some ways he is close to the statisticians and econometricians who emphasize that randomization allows more robust inferences for causal inference about population averages, although he is also interested in causal inference when randomization or its equivalents are not available.) I hold conflicting views,in a manner similar to those expressed by Sobel(2008). On one hand, I do not see how one can get scientifically strong causal inferences from observational data alone without strong theory; it seems to me a hopeless task to throw a data matrix at a computer program and hope to learn about causal structure (more on this below). On the other hand, I recognize that much of our everyday causal intuition, while not having the full quality of scientific reasoning, is still useful, and it seems a bit of a gap to simply use the label “descriptive” for all inference that is not supported by experiment or very strong theory...the outputs of structural modeling programs can easily be misinterpreted." (pp. 959-60).

Counterfactuals and Causal Models by Stephen L. Morgan and Christopher Winship

  • Another source re sociology and political science.
    • Why does there seem to be more tension between Pearl's work and sociological work on causation?
  • Morgan and Winship claim that the 2000 election would have gone in favor of Al Gore if felons and ex-felons had been allowed to vote.
    • They argue that the counterfactual world would have had very different candidates and issues, so you cannot characterize the alternative causal state.
      • For next week, relate more specifically to Maudlin's critique and pinpoint Pearl's exact conception of counterfactuals in The Book of Why and Causality.

Potentially Usable Data Sets

Automoble accident data sets (1982-2009)

IP investigations and technological progress or trade patterns

International debt and political instability

Sleep schedule and academic performance

  • Context:
    • Taylor, D. J., Vatthauer, K. E., Bramoweth, A. D., Ruggero, C., & Roane, B. (2013). The role of sleep in predicting college academic performance: is it a unique predictor?. Behavioral sleep medicine, 11(3), 159–172. doi:10.1080/15402002.2011.602776
Few studies have looked at the predictability of academic performance (i.e., cumulative grade point average [GPA]) using sleep when common nonsleep predictors of academic performance are included. The present project studied psychological, demographic, educational, and sleep risk factors of decreased academic performance in college undergraduates.
  • Can a dataset be found?


What is precisely the relationship between causation and correlation?

From Bollen, Kenneth A., and Judea Pearl. "Eight myths about causality and structural equation models." In Handbook of causal analysis for social research, pp. 301-328. Springer, Dordrecht, 2013.


SEM does not aim to establish causal relations from associations alone. Perhaps the best way to make this point clear is to state formally and unambiguously what SEM does aim to establish. SEM is an inference engine that takes in two inputs, qualitative causal assumptions and empirical data, and derives two logical consequences of these inputs: quantitative causal conclusions and statistical measures of fit for the testable implications of the assumptions. Failure to fit the data casts doubt on the strong causal assumptions of zero coefficients or zero covariances and guides the researcher to diagnose or repair the structural misspecifications. Fitting the data does not “prove” the causal assumptions, but it makes them tentatively more plausible. Any such positive results need to be replicated and to withstand the criticisms of researchers who suggest other models for the same data.
Structural Equation Modeling (SEM)
methodology for representing, estimating, and testing a network of relationships between variables

Helpful Links:

Relevant DoWhy examples?

DoWhy and the relationship between title length and click-through rate

Does treatment cause outcome?

Week 3 (26 Sept 2019)

Review of Last Week's Philosophical Criticisms of Pearl's Work

The Book of Why: The New Science of Cause and Effect (2019) by Judea Pearl and Dana Mackenzie

  • W.r.t. RCTs (Randomized Controlled Trials):
Daniel's experiment was strikingly modern in all these ways. Prospective controlled trials are still a hallmark of sound science. However, Daniel didn't think of one thing: confounding bias. Suppose that Daniel and his friends are healthier than the control group to start with. In that case, their robust appearance after ten days on the diet may have nothing to do with the diet itself; it may reflect their overall health. Maybe they would have prospered even more if they had eaten the king's meat! Confounding bias occurs when a variable influences both who is selected for the treatment and the outcome of the experiment. Sometimes the confounders are known; other times they are merely suspected and act as a lurking third variable.(pp. 137-138)
The randomized controlled trial is indeed a wonderful invention - but until recently the generations of statisticians who followed Fisher could not prove that what they got from the RCT was indeed what they sought to obtain. They did not have a language to write down what they were looking for - namely, the causal effect of X on Y. [...] Confounding is a causal concept - it belongs on rung two of the Ladder of Causation (pp. 139-40).
Even as recently as 2001, reviewers rebuked a paper of mine while insisting Confounding is solidly founded in standard statistics. Fortunately, the number of such reviewers has shrunk dramatically in the past decade. There is now an almost universal consensus, at least among epidemiologists, philosophers, and social scientists, that (1) confounding needs and has a causal solution, and (2) causal diagrams provide a complete and systematic way of finding that solution (p. 141).
- Where is this consensus?

Hunting Causes and Using Them: Approaches in Philosophy and Economics (2007) by Nancy Cartwright

  • Modularity re causation (for context):
    • Modularity is a prominent concept in philosophy of mind and science, especially with respect to human cognitive faculties.
      • Modularity of mind is the notion that a mind may, at least in part, be composed of innate neural structures or modules which have distinct, established, and evolutionarily developed functions. (Link:
    • In this case, modularity accounts
require that each [causal] law describe a 'mechanism' for the effect, a mechanism that can vary independently of the law for any other effect (p. 13).
  • Cartwright, in fact, does directly address Pearl's concept of causation.
I. But first I would like to look in some detail at an argument in support of modularity that has received less attention in the philosophical literature. Judea Pearl and Stephen LeRoy both make claims about ambiguity that I also find an echo of in Woodward. Causal analysis, Pearl tells us, 'deals with changes' - and here he means changes under an 'intervention' that changes only the cause (and anything that must change in train). So
Pearl/LeRoy requirement: a causal law for the effect of x_c on x_e is supposed to state unambiguously what difference a unit change of x_c (by 'intervention' will make on x_e'
  • Return to Reutlinger's critique of Woodward (2016).
I always find it puzzling why we should think that a law for the effect of e on c should tell us what happens to e when the set of laws is itself allowed to alter or even when c is brought about in various ways. I would have thought that if there was an answer to the question, it would be contained in some other general facts - like the facts about the underlying structure that gives rise to the laws and that permits certain kinds of changes in earlier variables. My reconstruction of Pearl and LeRoy's answer to my puzzle takes them to be making a very specific claim about what a causal law is (in the kind of deterministic frameworks we have been considering):
A causal law about the effect of x_n on any other variable is Nature's instruction for determining what happens when either:
  1. the causal law describing the causes of x_n varies from x_n = f_n(x_1, ..., x_n-1) + u_n to x_n = X, OR
  2. the exogenous variable for x_n(i.e. u_n) varies and nothing else varies except what these variations compel.
So for every system of causal laws:
(i) such variation in any cause must be possible, and
(ii) the law in question must yield an unambiguous answer for what happens to the effect under such variation in a cause.
Hence the requirement called 'modularity'. (See bullet below for modularity re causation.) But there must be something wrong with this conception of causal laws. When Pearl talked about this recently at LSE he illustrated this requirement with a Boolean input-output diagram for a circuit. In it, not only could the entire input for each variable be changed independently of that for each other, so too could each Boolean component of that input. But most arrangements we study are not like that. They are rather like a toaster or a laser or a carburettor.
I shall illustrate with a casual account of the carburettor, or rather, a small part of the operation of the carburettor - the control of the amount of gas that enters the chamber before ignition (pp. 14-15).
II. The point is this. In Pearl's circuit-board, there is one distinct physical mechanism to underwrite each distinct causal connection. But that is incredibly wasteful of space and materials, which matters for the carburettor. One of the central tricks for an engineer in designing a carburettor is to ensure that one and the same physical design - for example, the design of the chamber - can underwrite or ensure a number of different causal connections that we need all at once.
Just look back my diagrammatic equations, where we can see a large number of laws all of which depend on the same physical features - the geometry of the carburettor. So no one of these laws can be changed on its own. To change any one requires a redesign of the carburettor, which will change the others in train. By design the different causal laws are harnessed together and cannot be changed singly so modularity fails.
  • Contrast with Cartwright's understanding of engineering design re causality and the Instantiating Sapience paper from 2019?
My conclusion though is not that we must discard modularity. Rather it is not a universal characteristic of some univocal concept of (generic) causation. There are different causal questions we can ask. We can, for instance, ask the causal question we see in the Pearl/LeRoy requirement: how much will the effect change for a unit change in the cause if the unit change in the cause were to be introduced 'by intervention'? (pp. 15-16).

Counterfactuals and Causal Models by Stephen L. Morgan and Christopher Winship

  • Reading in progress.

More Philosophical Criticisms of Pearl's Work

Instantiating Sapience (2019) by Andrea Giuseppe Ragno

This philosophical paper...titled Instantiating Sapience, describes #Book of why as An engineering approach to AI. I must confess that I cannot understand the many non-engineering alternatives described in the paper, perhaps one of our readers could.
  • Thesis: I offer a brief methodological perspective on the project about the implementation of sapient conceptual skills in technical devices, i.e. part of what has been defined as Artificial General Intelligence (AGI).
    • Refers to Kant, Hegel, Popper, Brandom and Trafford, Putnam and Sellars, and finally Judea Pearl and his Ladder of Causation.
    • End objective: showing a new functionalist paradigm, i.e. the engineering approach to functionalism, in order to overcome and exploit flaws, vagueness, and doubts about contemporary AI and to advance novel theories about the implementation of conceptual skills in machines (p. 82).


  • Tensions re Is Judea Pearl's model of causation an engineerable feat or not?
  • If we accept Cartwright's reconstruction of Pearl's understanding of causation, then what would this model look like on a practical level?
  • Could an inherent weakness within Pearl's model be that, indeed, modularity is assumed, and that if this is so, the problem of infinite regress is inevitable?
  • What is the connection between Woodward and Pearl re intervention?

JOR Questions 26Sep2019

(1) What is the meaning of a Pearl directed edge in a causal diagram? Can we find Pearl's explanation? Is it just "information flow"? Is he implicitly accepting a particular definition of causal influence? Perhaps a statistical definition, which philosophers have found problematical?
(2) What is the meaning of a Pearl node? Is it something (a variable) that can be (theoretically if not practically) measured statistically? So something like gravity could not be a node? Or could it?
(3) Pearl says a causal diagram displays "assumptions." The questions above are wondering whether there are assumptions already in assuming assumptions can be modeled in a causal diagram.
(4) It appears that an intervention is setting a variable to a fixed value. This appears to assume a limited sense of what constitutes an intervention. Again I am thinking about gravity, setting gravity to be repulsive.
(5) Can we find any critics who question the causal diagrams as capturing what needs to be captured?

JOR 1Oct 2019

Elements of Causal Inference. Janzing & Schölkopf. Highly mathematical. Still useful.
p.7: Reichenbach's common cause principle: If two rand vars X,Y are stat dependent, then there exists a 3rd variable Z that causally influences both.
p.16: Indep Mechanisms. Good example of city temp vs. altitude.
Ch.4: Learning cause-effect models. This is close to my idea to use stat observations to infer the best causal model.
p.58: Hints of using Kolmogorov complexity as foundation.
p.90: Myopia example, very instructive.

Week 4 (03 Oct 2019)

Addressing the Above Questions

Pearl's Directed Edges, Nodes, and Assumptions in Causal Diagrams

  • "Causal Diagrams for Empirical Research (With Discussions) by Judea Pearl in Biometrika (1995), 82, 4, pp. 669-710.
    • Arrows = Inference? Causal link?
      • Indicative of functional dependencies
      • What is a confounding arc?
    • Nodes
      • Nodes are random, manipulable variables Xi and Xj
        • Every causal diagram with latent variables can be converted to an equivalent diagram involving measured variables interconnected by arrows and confounding arcs. This conversion corresponds to substituting out all latent variables from the structural equations of (3) and then constructing a new diagram by connecting any two variables Xi and Xj by (i) an arrow from Xj to Xi, whenever Xj appears in the equation for Xi, and (ii) a confounding arc whenever the same e term appears e in both fi and fj. The result is a diagram in which all unmeasured variables are exogenous and mutually independent (pp. 681-2).
    • Assumptions
      • The usefulness of directed acyclic graphs as economical schemes for representing conditional independence assumptions is well evidenced in the literature (Pearl, 1988; Whittaker, 1990) (p. 671)
      • The use of directed acyclic graphs as carriers of independence assumptions has also been instrumental in predicting the effect of interventions when these graphs are given acausal interpretation (Spirtes, Glymour & Schemes, 1993, p. 78; Pearl, 1993b). Pearl(1993b), for example, treated interventions as variables in an augmented probability space, and their effects were obtained by ordinary conditioning (p. 672).
      • Such models are called identifying because their structures communicate a sufficient number of assumptions to permit the identification of the target quantity pr(y|x) (p. 681).
      • There have been three approaches to expressing causal assumptions in mathematical form. The most common approach in the statistical literature invokes Rubin's model (Rubin, 1974),in which probability functions are defined over an augmented space of observable and counterfactual variables. In this model, causal assumptions are expressed as independence constraints over the augmented probability function, as exemplified by Rosenbaum &Rubin's (1983) definitions of ignorability conditions. An alternative but related approach, still using the standard language of probability, is to define augmented probability functions over variables representing hypothetical interventions (Pearl, 1993b) (pp. 684-685).
  • "Potential Outcome and Directed Acyclic Graph Approaches to Causality: Relevance for Empirical Practice in Economics" by Guido W. Imbens (July 2019),
    • Guido mentions:
…work in economics is closer in spirit to the potential outcome framework" (p.1).
The graphical approach has gained much traction in the wider computer science community, see for example the recent text “Elements of Causal Inference,” ([Peters, Janzing, and Schölkopf, 2017]) and the work on causal discovery ([Glymour, Scheines, and Spirtes, 2014, Hoyer, Janzing, Mooij, Peters, and Schölkopf, 2009, Lopez-Paz, Nishihara, Chintala, Schölkopf, and Bottou, 2017]) (p. 1).

What is Pearl's Understanding of an "Intervention"

  • On the Interpretation of do(x) by Judea Pearl (2019). Forthcoming in Journal of Causal Inference.
We view do(x) as an ideal intervention that provides valuable information on the effects of manipulable variables and is thus empirically testable (p.1).
Pearl's structural causal modeling framework (SCM) is as follows:
The structural causal modeling (SCM) framework described in (Pearl, 1995, 2011, 2019) defines and computes quantities of the form Q = E[Y |do(X = x)] which are interpreted as the causal effect of X on Y (p. 1).
Several critics of SCM have voiced concerns about this interpretation of Q when X [the variable with causal effect on Y] is non-manipulable; that is, X is a variable whose value cannot be controlled directly by an experimenter (Cartwright, 2007; Heckman and Vytlacil, 2007; Pearl, 2009; Hernán, 2016; Pearl, 2018). Indeed, asking for the effect of setting X to a constant x makes perfect sense when X is a treatment, say drug-1 or diet-2, but how can we imagine an action do(X = x) when X is non-manipulable, like gender, race, or even a state of a variable such as blood-pressure or cholesterol level? (p.1).
From here, we might say:
  1. That Pearl distinguishes between variable X as "non-manipulable" suggests he is not completely limited in his approach to interventions. He may be allowing for interventions to not be so fixed, OR
  2. There's a difference between X = x being fixed and do(X = x) being fixed. In this case, this would be a difference between the nature of the variable itself and the nature of the intervention.
However, he (2019) notes three uses of his SCM model:
Q = E[Y |do(X = x)] is equal to q(x) where q(x) is some function of x, computed from the joint distribution of observed variables in the model.
  1. Q represents a theoretical limit on the causal effects of manipulable interventions that might become available in the future.
  2. Q imposes constraints on the causal effects of currently manipulable variables.
  3. Q serves as an auxiliary mathematical operation in the derivation of causal effects of manipulable variables (p.2).
According to Pearl's list of uses, (1) and (2) are both true. Then, what exactly is a "manipulable intervention" with respect to a "manipulable variable"? Is an intervention manipulable by virtue of the variable used with the do-operator being manipulable?
Pearl then goes on to present the following to discuss Q as a limit on pending interventions in Section 2.1:
Consider a set I = {I1, I2, . . . , In} of manipulable interventions whose effects on outcome Y we wish to compare. Assume that these interventions are suspected of affecting Y through their effect on X, and X is not directly manipulable (p. 2).
Are there now direct vs. indirect manipulations of variables? What do these entail?
He further explains an ideal intervention with modern-day examples:
However, an ideal intervention may not be feasible given the current state of technology, but may become feasible in the future. For example, cloud seeding made “rain” manipulable in our century, and genetic engineering may render gene variations manipulable in the future. If we simulate the impact of such an ideal intervention, one with no side effects and with a deterministic f, its resultant effect on Y will be Q (p.3).
At the end of Section 2.1, it still doesn't seem clear to me what his understanding of "pending interventions" is. He then goes on to discuss "feasible" vs. "infeasible" interventions w.r.t. Q as "'the theoretical effect' of an infeasible intervention" (p. 3).
What is the relationship between this definition of Q and use #1 in the list above? Is there a relationship between this definition of Q and "pending interventions"?
Pearl makes this note in his conclusion:
Armed with these clarifications, researchers need not be concerned with the distinction between manipulable and non-manipulative variables, except of course in the design of actual experiments. In the analytical stage, including model specification, identification and estimation, all variables can be treated equally, and are therefore equally eligible to receive the do-operator and to deliver the ramifications in its effect (p. 8).
I'm not convinced.

Critics of his Causal Diagrams?

  • "Causal Diagrams for Empirical Research (With Discussions) by Judea Pearl in Biometrika (1995), 82, 4, pp. 669-710.
…the whole enterprise of structural equation modelling has become the object of serious controversy and misunderstanding among researchers(Freedman, 1987; Wermuth, 1992; Whittaker, 1990, p. 302; Cox & Wermuth, 1993) (p. 685).
  • Potential Outcome and Directed Acyclic Graph Approaches to Causality: Relevance for Empirical Practice in Economics by Guido W. Imbens (July 2019),
This is of course fine, but the result is that the models in TBOW are all, in a favorite phrase in TBOW “toy models,” that we should not take seriously. This is common to other discussions of graphical causal models (e.g., [Koller, Friedman, and Bach, 2009]). It would have been useful if the authors had teamed up with subject matter experts and discussed some substantive examples where DAGs, beyond the very simple ones implied by randomized experiments, are aligned with experts’ opinions. Such examples, whether in social sciences or otherwise, would serve well in the effort to convince economists that these methods are useful in practice (p.6).
Criticism: Pearl is not engaged as he should be with econometrics literature and should consult more expert opinions…

Philosophical Criticisms (cont.)

Causal Reasoning without Counterfactuals by Phil Dawid (2000)

  • Still in the process of reading.

Week 5 (10 Oct 2019)

See Word document for a Simplified Overview of Pearl's Causal Diagrams.

Week 6 (17 Oct 2019)

Understand statistical tests to determine independence

Independence: P(A and B) = P(A) * P(B)

  1. Chi-square test of independence
- determines if there is a relationship between two categorical variables in the population
    • Example questions that this test can answer:
      • For young adults in the United States, is gender related to body image?
      • Is alcohol abuse by New York firefighters dependent on participation in the 9/11 rescue operation?
      • In the United States, is race associated with political views (conservative, moderate, liberal)?
    • Conditional percentages and independence
      • Definitions here:
      • Recall the definition of independence from Probability and Probability Distribution. Two events, A and B, are independent if the probability of A is the same as the probability of A when B has already occurred. We write this statement as P(A) = P(A | B).

P(A and B | C): is there independence or dependence?

  • X and Y are conditionally independent with respect to a given Z iff
    1. The overall X^2 or G^2 statistics can be found by summing the individual test statistics for BC independence across the levels of A. The total degrees of freedom for this test must be I (J − 1)(K − 1).
    2. Cochran-Mantel-Haenszel Test (using option CMH in PROC FREQ/ TABLES/ in SAS and mantelhaen.test in R). This test produces Mantel-Haenszel statistic also known as "average partial association" statistic.

Read Cartwright

Chapter 1: How to Get Causes from Probabilities

  • Statistical correlations as indicators of causation
    • Thesis of Chapter 1: in the context of causal modelling theory, probabilities are an entirely reliable instrument for discovering causal laws.
    • It is widely agreed by proponents of causal modelling techniques that causal relations cannot be analysed in terms of probabilistic regularities. Nevertheless, statistical correlations seem to be some kind of indicator of causation (p. 11).
      • I think Pearl would agree here.
    • Draws on econometrics.
  • Pg. 23 - three questions
    • Humean problem: under what conditions do the parameters in a set of linear equations determine causal connections?
    • Problem of identification
    • Problem of estimation
  • a serious problem for any attempt to ground causal claims in functional relationships: 'If altering the equations to a mathematically equivalent set alters the causal relations expressed, then those relations involve additional structure not entirely caught by the equations alone' (p. 20).
    • What is the relationship between this understanding of causal claims re functional relationships and Pearl's definition of his arrows as indicative of functional dependencies?
  • In progress.

Chapter 2: No Causes In, No Causes Out

  • Thesis: There is no going from pure theory to causes, no matter how powerful the theory (p. 39).
  • In progress.

ML and Causal Discovery

Communications of the ACM, Judea Pearl: The Seven Tools of Causal Inference

Tool 7. Causal discovery. The ''d-separation criterion'' described earlier enables machines to detect and enumerate the testable implications of a given causal model. This opens the possibility of inferring, with mild assumptions, the set of models that are compatible with the data and to represent this set compactly. Systematic searches have been developed that, in certain circumstances, can prune the set of compatible models significantly to the point where causal queries can be estimated directly from that set. Alternatively, Shimizu et al. proposed a method for discovering causal directionality based on functional decomposition.24 The idea is that in a linear model X → Y with non-Gaussian noise, P(y) is a convolution of two non-Gaussian distributions and would be, figuratively speaking, "more Gaussian" than P(x). The relation of "more Gaussian than" can be given precise numerical measure and used to infer directionality of certain arrows.Tian and Pearl developed yet another method of causal discovery based on the detection of "shocks," or spontaneous local changes in the environment that act like "nature's interventions," and unveil causal directionality toward the consequences of those shocks.
  • Cited for bolded section above: Tian, J. and Pearl, J. A general identification condition for causal effects. In Proceedings of the 18th National Conference on Artificial Intelligence (Edmonton, AB, Canada, July 28-Aug. 1). AAAI Press/MIT Press, Menlo Park, CA, 2002, 567–573.
    • Link:
    • Paper's objective: to establis[h] a necessary and sufficient criterion for the identifiability of the causal effects of a singleton variable on all other variables in the model, and a powerful sufficient criterion for the effects of a singleton variable on any set of variables (p. 1).
      • New definitions of arrows:
        • The arrows represent the potential existence of direct causal relationships between the corresponding variables,and the bi-directed arcs represent spurious dependencies due to unmeasured confounders (p. 1).
      • Shocks
        • Not explicitly mentioned in this article. (Maybe Pearl has incorrect citation in ACM article?)
        • Very technical, not certain if particularly useful.
        • Another source: Tian and Pearl, Causal Discovery from Changes (2013)
          • Link:
          • Paper's objective: We propose a new method of discovering causal rela­tions in data, based on the detection and interpreta­tion of local spontaneous changes in the environment (p. 512).
            • Draws on dynamic changes in static statistical distributions assumed by previous methods such as Cooper & Yoo's (1999) Bayesian method of causal discovery and Spirtes et al's (1993) use of experimental data to locate and identify causal relationships.
            • Example: nat­ural disasters, armed conflicts, epidemics, labor dis­putes, and even mundane decisions by other agents, are unexpected eventualities that are not naturally captured in distribution functions. The occurrence of such eventualities tend to alter the distribution un­der study and yield changes that are markedly dif­ferent from ordinary statistical fluctuations. Whereas static analysis views these changes as nuisance, and attempts to adjust and compensate for them, we will view them as a valuable source of information about the data-generating process (p. 512).
              • AI and multiagent environments?
            • The idea is based on economics - see Kevin Hoover (1990)
            • Tian and Pearl mention modularity in this paper -- goes back to Cartwright's interpretation in 2007.
              • The factorization in Eq. (1) obtains causal character through the assumption of modularity; (p. 513).
            • Explains independence equivalence of causal diagrams.

Data sets - randomly generated, applicable real-life scenarios

  • In preparation for DoWhy

Yoshua Bengio and Causality + Deep Learning

JOR 21Oct2019

Glymour, Zhang, Sprites. "Review of Causal Discovery Methods Based on Graphical Models."
Front. Genet. 10:524, 2019.
p.1: "traditional way is to use interventions or randomized experiments."
p.2: PC: Assumes no confounder; asymptotically correct. FCI (Fast Causal Inference): asymptotically correct even w confounders. Both output equivalence classes of graphs. (See Eberhardt Fig.2.)
p.2: FCMs (Func Causal Models): can distinguish between different DAGs in the same equivalence class. "the causal direction between X and Y is identifiable because the indep condition between the noise and cause holds only for the true causal direction and is violated for the wrong direction."
p.6: "why it is possible to identify the causal direction btwn two variables in the linear case." Worth studying these examples.
p.13: "the condition that noise is independent from causes plays an important role."

Week 7 (24 Oct 2019)

Causal Discovery

Glymour, Zhang, Sprites. Review of Causal Discovery Methods Based on Graphical Models.Front. Genet. 10:524, 2019.

  • Link:
  • Goal of the paper: It is then necessary to discover causal relations by analyzing statistical properties of purely observational data, which is known as causal discovery or causal structure search. his paper aims to give a introduction to and a brief review of the computational methods for causal discovery that were developed in the past three decades... (p.1).
1 Introduction
  • We focus on it here because while apparently first proposed in 2000 for studies of gene expression (Murphy and Mian, 1999; Friedman et al., 2000; Spirtes et al., 2000), the models have found wide use in systems biology, especially in omics and in neural connectivity studies... (p.1).
    • Pearl proposes DAGs for ML systems, largely inspired by econometrics literature. Are Glymour et al. saying that research on causal discovery was uniquely and independently developed within the field of biology? Could there be a connection between this development in 2000 and Pearl's first edition (2000) of Causality (and even econometrics literature re DAGs)?
    • In traditional causality research, algorithms for identification of causal effects, or inferences about the effects of interventions, when the causal relations are completely or partially known, address a different class of problems; see Pearl (2000) and references therein (p.1).
      • What exactly is the difference?
    • Look further into FCMS for next week.
2 Directed Graphical Causal Models
  • A class of DGCMs, commonly presented as “structural equation models” (SEMs), or functional causal models (FCMs), assumes the value of each variable is a deterministic function of its direct causes in the graph and the unmeasured disturbances (p.2).
    • Are SEMs and FCMs the same?
  • It is important to note that d-separation and related properties provide necessary but not sufficient conditions for conditional independence relations in the joint probability distribution over the values of the variables (p.3).
    • Where to place d-separation w.r.t. information found in previous weeks and this week?
  • What is the success rate of using these algorithms (e.g. PC and FCI) in biological studies? If there is success in genetic applications, can we extrapolate a similar degree of success for phenomena/events on a larger scale (i.e. policy-making)?
    • ...although the past 30 years witnessed remarkable progress in development of theories and practical methods for automated causal search, they received rather limited applications in biology... (p.11).
  • Two types of causal search algorithms:

One makes use of conditional independence relations in the data to find a Markov equivalence class of directed causal structures. Typical algorithms include the PC algorithm, FCI, and the GES algorithm. The other makes use of suitable classes of structural equation models and is able to find a unique causal structure under certain assumptions, for which the condition that noise is independent from causes plays an important role (p.12-13).

Eberhardt, F. Introduction to the Foundations of Causal Discovery International Journal of Data Science and Analytics 3, 81-91 (2017).

1 Introduction
  • Like many scientific concepts, causal relations are not features that can be directly read off from the data, but have to be inferred. The field of causal discovery is concerned with this inference and the assumptions that support it (p. 1).
    • This is reasonable. Successful example: inferring possible causal relationships between cigarette smoking on airplanes and illness -- later confirmed that smoking on airplanes causes lung cancer.
  • R.A. Fisher's work on experimental design (): randomization and causal discovery
    • Randomization breaks confounding, whether common cause is observed or not.
2 Basic Assumptions of Causal Discovery
  • We noted earlier that correlation does not imply causation (and similarly, the converse is not true either, though that may not be as obvious initially). Yet, we do take both dependencies and independencies as indicators of causal relations (or the lack thereof) [...] Consequently, there seem to be principles we use – more or less explicitly – that connect probabilistic relations to causal relations. (p. 3).
    • See my question from a previous week.
    • Dependent upon two principles/conditions:
  1. causal Markov
  2. causal faithfulness
- these together imply a correspondence be- tween the (conditional) independences in the proba- bility distribution and the causal connectivity relations within the graph G (p.3).
- that together ensure that (conditional) d-separation corresponds to (conditional) probabilistic independence (p.3):
X⊥Y|C ⇔ X⊥Y|C
4 Non-linear additive noise models
  • Pearl himself is focusing on non-linearity re causal models.
  • Eberhardt notes re non-linearity: i.e. when each variable xj is determined by a non-linear function fj of the values of its parents plus some additive noise(p.7).
  • He points out: To my knowledge there is very little work (other than some subcases of the additive noise models referred to above) that has developed general restrictions to enable identifiability of the causal structure for dis- crete models with finite state spaces (p.9).
    • Could work on entropy and deep learning address this additive noise?
    • discrete models with finite state spaces - sounds similar to understandings of classic AI.
5 Restrictions on multinomial distributions
  • Using Boolean SAT solvers and propositional logic
    • Once all the information has been encoded in con- straints in propositional logic, one can use standard Boolean SAT(isfiability) solvers to determine solutions consistent with the joint set of constraints (p.8).
      • Could there be a link between symbolic AI and current ML applications here (i.e. KRR and ML via causality)?
  • See Hyttinen et al's article (2013) below.
6 Experiments and Background Knowledge
  • Eberhardt notes: Hyttinen et al. used a query- based approach to check the conditions for the applications of the rules of the do-calculus when the true graph is unknown (p.13).
    • See Hyttinen et al.'s (2015) article below.

Hyttinen, A., Hoyer, P., Eberhardt, F., Järvisalo, M. Discovering cyclic causal models with latent variables: A general SAT-based procedure. In: Proceedings of UAI, pp. 301–310. AUAI Press (2013)

  • Link:
  • Goal of paper: We present a very general approach to learning the structure of causal models based on d-separation constraints, obtained from any given set of overlapping passive observational or experimental data sets (p.1).
  • Look into again!
  • Use of solvers and logic!!(KRR)
    • Our approach is based on a logical representation of causal pathways, which permits the integration of quite general background knowledge, and inference is performed using a Boolean satisfiability (SAT) solver (p.1).
    • The SAT-based approach to causal structure discovery by Triantafillou et al. (2010) uses an encoding based on partial ancestral graphs (PAGs), a particular form of equivalence class (p. 4).
      • See 2010 paper.
    • Classen, T., Heskes, T. A Logical Characterization of Constraint-Based Causal Discovery (2011)
      • Link:
      • Goal of paper: We present a novel approach to constraint-based causal discovery, that takes the form of straightforward logical inference, applied to a list of simple, logical statements aboucausal relations that are derived directly from observed (in)dependencies. It is both sound and complete, in the sense that all invariant features of the corresponding partial ancestral graph (PAG) are identified, even in the presence of latent variables and selection bias.
        • Is there are relationship between this and Pearl's concept of causal discovery? If not, could this addition improve on Pearl's ideas?

Hyttinen, A., Eberhardt, F., Järvisalo, M. Do-calculus When the True Graph is Unknown n: Proceedings of UAI (2015)

Geiger, D., Pearl, J. On the Logic of Causal Models. (2013)

Eberhardt, F., Scheines, R. Interventions and Causal Inference Philosophy of Science 74:5, 2006.

Week 8 (31 Oct 2019)

See tentative outline/table of contents. Regroup and reorganize thoughts, plans, research previously done.

Week 9 (07 November 2019)

In progress with writing an incomplete draft of thesis. Began conversation with Professor Jordan Crouser on finding the appropriate dataset. Conversation will continue on 19 November 2019.

Week 10 (14 November 2019)

In progress with writing an incomplete draft of thesis.

JOR Notes 14Nov2019

The "cars" R dataset might be usable: cars. It is from the 1920's, but that is irrelevant. The important point is that it is clear that speed is the cause of ("drives") the stopping distance, and not the other way around. So maybe it could be used to test causality discovery.


Week 11 (21 November 2019)

In progress with writing draft of thesis.

JOR Notes 24Nov2019


The left plot is Y vertical vs X horizontal, the right the reverse. Both show linear regressions. The data was generated as Y = 2 X + ε, with ε uniformly distributed in [-1,+1].

Week 12 (28 November 2019)

Still working on draft. Explore the possibility that causality may not necessarily be the answer to today's "black box problem" of ML, although it may illuminate future AI mechanisms moving forward.

Week 13 (05 December 2019)

Draft writing. Sources found increasingly confirm the above possibility.

Week 14 (12 December 2019)

Established beginning structure of chapter on causal discovery and formatted Mma code into figures in the experimental section of the chapter.

JOR Notes (16 Feb 2020)

Bengio et al. "Independently Controllable Features." 2017.

One useful aspect of this short, preliminary paper is its explanation of autoencoders. A pair of function approximations, where f maps the input space X to a "latent space" H, and g maps back to X. The error is minimized. The latent space has much smaller dimension than the input space, so there is forced dimensionality reduction.

They are proposing a method to "learn how the world works" by interacting and observing the effects of the interaction. Very preliminary.

18 February 2020

Bernhard Scholkopf, ``Causality for Machine Learning." 2019.

  • Citation of Reichenbach (1956) re the connection between causality and statistical dependence: the Common Cause Principle (p. 3). What is this principle, and the acceptance of it assuming about the relationship between correlation/statistical dependence and causation?
  • Disentanglement, specifically ``causal (or disentangled) factorization" vs. ``entanglement" (p. 4) -- what is this more precisely?
  • Algorithmic independence:
So far, I have discussed links between causal and statistical structures. The fundamental of the two is the causal structure, since it captures the physical mechanisms that generate statistical dependencies in the first place. The statistical structure is an epiphenomenon [i.e., a secondary effect or byproduct that arises from but does not causally influence a process] that follows if we make the unexplained variables random. [...] This motivated us to devise an algorithmic model of causal structures in terms of Komogorov complexity (Janzing and Scholkopf, 2010) (p. 7).

Here it is implied that causality generates correlation. So if a relationship is causal, then there must be at least one correlation that is a byproduct of the causal relationship. But with causal discovery, we are trying to estimate causal relations from statistical dependencies. Logically, does the reverse of the conditional hold? According to Reichenbach's Common Cause Principle, it should be the case that the reverse of the conditional does hold. Then, relative to one another, which conditional is stronger/more robust?

  • ``The problem of causal discovery from observational data" (p. 8): can sometimes recover aspects of the underlying graph from observations by performing conditional independence tests. However, there are several problems with this approach. One is that in practice, our datasets are always finite, and conditional independence testing is a notoriously difficult problem, especially if conditioning sets are continuous and multi-dimensional. ...conditional testing is hard without additional assumptions. The other problem is that in the case of only two variables, the ternary concept of conditional independies collapses and the Markov condition thus has no nontrivial implications. [...] It turns out that both problems can be addressed by making assumptions on function classes (pp. 8-9).

---> Additive noise model (used w.r.t. Dataset #1 in Mathematica):

  • Causal representation learning
Traditional causal discovery and reasoning assumes that the units are random variables connected by a causal graph. Real-world observations, however, are usually not structured into those units to begin with, e.g., objects in images (Lopez-Paz et al., 2017). The emerging field of causal representation learning hence strives to learn these variables from data, much like machine learning went beyond symbol AI in not requiring that the symbols that algorithms manipulate be given a priori (cf. Bonet and Geffner (2019)).

Bengio et al., ``Independently Controllable Features." 2017.

  • Mentions reward-providing environments and Monte-Carlo estimations (p.3).

Monte Carlo (MC) methods are a subset of computational algorithms that use the process of repeated random sampling to make numerical estimations of unknown parameters. They allow for the modeling of complex situations where many random variables are involved, and assessing the impact of risk. Succinctly put, MC methods provide the basis for resampling technics for estimating a quantitiy, such as the accuracy of a model of a finite dataset. Source:

Phyllis Illari & Federica Russo, Causality: Philosophical Theory Meets Scientific Practice. 2014.

§ 15.2 Causality as regular patterns

  • Hume and the relationship between causality and correlation:
[Hume] believes that the only access to causal relations is empirical, through observed regular successions of events [similar to correlative, statistical dependencies]. [...] The principle of causality that helps us order what we observe arises by custom or habit in our own minds (p. 163).
  • Concern: Then, if this relationship is mind-dependent, then does it truly reflect the events in our physical world?
    • To say that causality is mind-dependent does not imply denial of the existence of the world or the empirical basis of our knowledge. (p. 163). However, rather than denial, what about accuracy?
  • Stathis Psillos: interested in the metaphysics of causality (Psillos, 2002; Psillos, 2009) -- the idea that we only observe regular succession in the world. We do not observe any relation that we are entitled to call causation (p. 163).

§ 15.3 Updating regularlity for current science

  • Examples where regular succession -> causality fails:
    • Smoking causes lung cancer. But many smokers do not get lung cancer. While the causal relation between smoking and cancer is not contested, its `regularity" is not univocally established (p. 164).
    • There can also be regular succession with causality: night invariably follows day, but day does not cause night (p. 164).
  • See Ch 4: Mackie provides an account of complex regularities: several causes acting at the same time, a number of background conditions and possibly negative causal factors (p. 164).
    • Can Dataset #1 (Y->X) be explained by negative causal factors?
  • See Ch 5: ...regularity seems to tie causality to the generic case (p. 164): single-case instantiations of a general causal relation. [...] [I]f we give primacy to single-case causal relations, then regularities at the generic level may just be generalizations, whose causal character is then undermined.
  • Additive noise model mentioned on p. 165.
Based on this model, regularity is about frequency of instantiation in time and/or pace of the same causal relation. In other words, in the data set analysed, cause- and effect-variables with show a robust dependence (p. 165). Authors want to scientifically define regularity here.

Tim Miller, Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence 267 (2019): 1-38.

Probabilities probably don’t matter — while truth and likelihood are important in explanation and probabilities really do matter, referring to probabilities or statistical relationships in explanation is not as effective as referring to causes. The most likely explanation is not always the best explanation for a person, and importantly, using statistical generalisations to explain why events occur is unsatisfying, unless accompanied by an underlying causal explanation for the generalisation itself (p.3).
  • Cites Hume also re regularity on p. 5 (see above).
  • Our case: Interventionist theories

of causality [191,58] state that event C can be deemed a cause of event E if and only if any change to event E can be brought about solely by intervening on event C. Probabilistic theories, which are extensions of interventionist theories, state that event C is a cause of event E if and only if the occurrence of C increases the probability of E occurring (p. 5).

  • Ultimately argues that an intelligent agent must be able to reason about its own causal

model (p. 12).

Atoosa Kasirzadeh, Mathematical decisions and non-causal elements of explainable AI. 2019. (

  • Argument: Recent conceptual discussion on the nature of the explainability of Artificial Intelligence (AI) has largely been limited to data-driven investigations. This paper
identifies some shortcomings of this approach to help strengthen the debate on this subject .