This post is by Kevin Gray and Tyler VanderWeele
A lot of marketing research is aimed at uncovering why consumers do what they do and not just predicting what they’ll do next. Marketing scientist Kevin Gray asks Harvard Professor Tyler VanderWeele about causal analysis, arguably the next frontier in analytics.
Kevin Gray: If we think about it, most of our daily conversations invoke causation, at least informally. We often say things like “I dropped by this store instead of my usual place because I needed to go to the laundry and it was on the way” or “I always buy chocolate ice cream because that’s what my kids like.” First, to get started, can you give us nontechnical definitions of causation and causal analysis?
Tyler VanderWeele: Well, it turns out that there a number of different contexts in which words like “cause” and “because” are used. Aristotle, in his Physics and again in his Metaphysics, distinguished between what he viewed as four different types of causes: material causes, formal causes, efficient causes, and final causes. Aristotle described the material cause as that out of which the object is made; the formal cause as that into which the object is made; the efficient cause as that which makes the object; and the final cause that for which the object is made. Each of Aristotle’s “causes” offers some form of explanation or answers a specific question: Out of what?. . . Into what. . . ? By whom or what. . .? For what purpose. . .?
Causal inference literature in statistics, and in the biomedical and social sciences focus on what Aristotle called “efficient causes.” Science in general focuses on efficient causes and perhaps, to a certain extent, material and formal causes. We only really use “cause” today to refer to efficient causes and perhaps sometimes final causes. However, when we give explanations like, “I always buy chocolate ice cream because that’s what my kids like” we are talking about human actions and intention and these Aristotle referred to as final causes. We can try to predict actions, and possibly even reasons, but again the recent developments in causal inference literature in statistics and the biomedical and social sciences focus more on “efficient causes.” Even such efficient causes are difficult to define precisely. The philosophical literature is full of attempts at a complete characterization and we arguably still are not there yet (e.g. a necessary and sufficient set of conditions for something to be considered “a cause”).
However, what there is relative consensus on is that there are certain sufficient conditions for something to be “a cause.” These are often tied to counterfactuals, so that if there are settings in which an outcome would have occurred if a particular event took place, but the outcome would not have occurred if that event hadn’t taken place then this would be a sufficient condition for that event to be a cause. Most of the work in the biomedical and social sciences on causal inference has focused on this sufficient condition of counterfactual dependence in thinking about causes. This has essentially been the focus of most “causal analysis”, an analysis of counterfactuals.
KG: Could you give us a very brief history of causal analysis and how our thinking about causation has developed over the years?
TV: In addition to Aristotle above, another major turning point was Hume’s writing on causation which fairly explicitly tied causation to counterfactuals. Hume also questioned whether causation was anything except the properties of spatial and temporal proximity, plus the constant conjunction of that which we called the cause and that which we called the effect, plus some idea in our minds that the cause and effect should occur together. In more contemporary times within the philosophical literature David Lewis’ work on counterfactuals provided a more explicit tie between causation and counterfactuals and similar ideas began to appear in the statistics literature with what we now call the potential outcomes framework, ideas and formal notation suggested by Neyman and further developed by Rubin, Robins, Pearl and others. Most, but not all, contemporary work in the biomedical and social sciences uses this approach and effectively tries to ask if some outcome would be different if the cause of interest itself had been different.
KG: “Correlation is not causation” has become a buzz phrase in the business world recently, though some seem to misinterpret this as implying that any correlation is meaningless. Certainly, however, trying to untangle a complex web of cause-and-effect relationships is usually not easy – unless a machine we’ve designed and built ourselves breaks down, or some analogous situation. What are the key challenges in causal analysis? Can you suggest simple guidelines marketing researchers and data scientists should bear in mind?
TV: One of the central challenges in causal inference is confounding, the possibility that some third factor, prior to both the supposed cause and the supposed effect is in fact what is responsible for both. Ice cream consumption and murder are correlated, but ice cream probably does not itself increase murder rates. Rather, both go up during summer months. When analyzing data, we try to control for such common causes of the exposure or treatment or cause of interest and the outcome of interest. We often try to statistically control for any variable that precedes and might be related to supposed cause or the outcome or effect we are studying to try to rule this possibility out.
However, we generally do not want to control for anything that might be affected by the exposure or cause of interest because these might be on the pathway from cause to effect and could explain the mechanisms for the effect. If that is so, then the cause may still lead to the effect but we simply know more about the mechanisms. I have in fact written a whole book on this topic. But if we are just trying to control for confounding, so as to provide evidence for a cause-effect relationship then we generally only want to control for things preceding both the cause and the effect.
Of course, in practice we can never be certain we have controlled for everything possible that precedes and might explain them both. We are never certain that we have controlled for all confounding. It is thus important to carry out sensitivity analysis to assess how strong an unmeasured confounder would have been related to both the cause and the effect to explain away a relationship. A colleague and I recently proposed a very simple way to carry this out. We call it the E-value, which we hope will supplement in causal analysis, the traditional p-value, which is a measure of evidence that two things are associated, not that they are causally related. I think this sort of sensitivity analysis for unmeasured or uncontrolled confounding is very important in trying to establish causation. It should be used with much greater frequency.
KG: Many scholars in medical research, economics, psychology and other fields have been actively developing methodologies for analyzing causation. Are there differences in the ways causal analysis is approached in different fields?
TV: I previously noted the importance of trying to control for common causes of the supposed cause and the outcome of interest. This is often the approach taken in observational studies in much of the biomedical and social science literature. Sometimes it is possible to randomize the exposure or treatment of interest and this can be a much more powerful way to try to establish causation. This is often considered the gold standard for establishing causation. Many randomized clinical trials in medicine have used this approach and it is also being used with increasing frequency in social science disciplines like psychology and economics.
Sometimes, economists especially, try to use what is sometimes called a natural experiment, where it seems as though something is almost randomized by nature. Some of the more popular of such techniques are instrumental variables and regression discontinuity designs. There are a variety of such techniques and these require different types of data and assumptions and analysis approaches. In general, the approach used is going to depend on the type data that is available, and whether it is possible to randomize, and this will of course vary by discipline.
KG: In your opinion, what are the most promising developments in causal analysis, i.e., what’s big on the horizon?
TV: Some areas that might have exciting developments in the future include causal inference with network data, causal inference with spatial data, causal inference in the context of strategy and game theory, and the bringing together of causal inference and machine learning.
KG: Do Big Data and Artificial Intelligence (AI) have roles in causal analysis?
TV: Certainly. In general, the more data that we have the better off we are in about ability to make inferences. Of course, the amount of the data is not the only thing that is relevant. We also care about the quality of the data and the design of the study that was used to generate it. We also must not forget the basic lessons on confounding in the context of big data. I fear many of the principles of causal inference we have learned over the years are sometimes being neglected in the big-data age. Big data is helpful but the same interpretative principles concerning causation still apply. We do not just want lots of data; rather the ideal data for causal inference will still include as many possible confounding variables as possible, quality measurements, and longitudinal data collected over time. In all of the discussions about big data we really should be focused on the quantity-quality trade-off.
Machine learning techniques also have an important role in trying to help us understand which variables, of the many possible, are most important to control for in our efforts to rule out confounding. I think this is, and will continue to be, an important application and area of research for machine learning techniques. Hopefully our capacity to draw causal inferences will continue to improve.
KG: Thank you, Tyler!
Kevin Gray is president of Cannon Gray, a marketing science and analytics consultancy.
Tyler VanderWeele is Professor of Epidemiology at Harvard University. He is the author of Explanation in Causal Inference: Methods for Mediation and Interaction and numerous papers on causal analysis.