 Regular article
 Open Access
 Published:
Leveraging change point detection to discover natural experiments in data
EPJ Data Science volume 11, Article number: 49 (2022)
Abstract
Change point detection has many practical applications, from anomaly detection in data to scene changes in robotics; however, finding changes in high dimensional data is an ongoing challenge. We describe a selftraining modelagnostic framework to detect changes in arbitrarily complex data. The method consists of two steps. First, it labels data as before or after a candidate change point and trains a classifier to predict these labels. The accuracy of this classifier varies for different candidate change points. By modeling the accuracy change we can infer the true change point and fraction of data affected by the change (a proxy for detection confidence). We demonstrate how our framework can achieve low bias over a wide range of conditions and detect changes in high dimensional, noisy data more accurately than alternative methods. We use the framework to identify changes in realworld data and measure their effects using regression discontinuity designs, thereby uncovering potential natural experiments, such as the effect of pandemic lockdowns on air pollution and the effect of policy changes on performance and persistence in a learning platform. Our method opens new avenues for datadriven discovery due to its flexibility, accuracy and robustness in identifying changes in data.
1 Introduction
The explosive growth of Big Data has transformed the study of human behavior [1]. Yet one critical use case, inferring the effect of policies and interventions, has proven challenging. To address this challenge, researchers are developing causal inference methods to quantify the effects of actions within heterogeneous observational data [2–5]. One approach to causal inference leverages “natural experiments,” fortuitous occurrences that serve to segment a population into a treatment group that was affected by a change and a control group that was not. Agrist [6], for example, examined the impact of military service on individual’s lifetime earnings using the Vietnam War draft lottery to separate individuals who performed military service (the treatment group) from those who did not serve (the control group). Comparing these populations allowed Agrist to estimate the effect of military service on earnings. Since this pioneering study others have used abrupt changes—raising the legal drinking age [7], changing the minimum wage [8], or modifying a website’s user interface [9]—to infer the effects of policies [10, 11].
Identifying natural experiments requires creativity and luck, which has made this an underutilized tool in the social sciences. One of the main difficulties is to identify exogenous events that may significantly affect a population. This task, however, can be made easier with change point detection, a method that detects events that suddenly modify a feature distribution. Once these change points are found, researchers can look within a narrow time range for events that contributed to these changes and use regression discontinuity to measure their effects. Change point detection, however, is challenging because social data is typically massive (many people) but sparse (few observations per individual), high dimensional (many features), dynamic, and noisy.
A growing body of research has proposed methods to detect change points, from simple approaches based on cumulative summation [12, 13] to more sophisticated methods based on Markov models [14, 15] and Bayesian statistics [16]. Many of the existing methods, however, are bespoke to problem domains or are only meant for time series. Bayesian approaches, for example, usually need data to follow a particular set of distributions. Moreover, while these methods will identify where the change occurs, many are not able to quantify estimation error or their confidence in the change. Despite the strengths and successes of existing change point detection methods, there is a critical need for an accurate and general purpose method that can be applied to various data, including highdimensional sparse data like video, audio, and EKG sensor signals.
Our contribution
We describe Meta Change Point Detection (MtChD), a selfsupervised method for detecting changes in high dimensional data. The method extends on a confusionbased training metamodel used to detect phase transitions in matter [17] by introducing a mathematical model of classification accuracy to more precisely infer both when the change occurs and the fraction of data affected by change [18]. The method labels data as occurring before (0) or after (1) each candidate change point and trains a classifier to predict the labels. A mathematical model is then trained to estimate classification accuracy as a function of a feature, t. The model parameters provide an estimate of the expected change point as well as the fraction of data affected, which is a proxy of change confidence: we trust the change point more if a large fraction of data is affected [18].
We apply MtChD to a range of data, both synthetic and realworld, to demonstrate that it has low bias under a wide range of conditions and accurately detects changes in noisy and highdimensional data, including images and text. Our method uses standard classifiers, such as a random forest or a multilayer perceptron (MLP), to outperform stateoftheart change detection methods, even on sparse, noisy, and incomplete realworld data. We show that our method accurately infers events in realworld data that are useful for discovering regression discontinuities that represent potential natural experiments. We show examples our method uncovers, including the impact of COVID19 lockdowns on air pollution and website policy changes on student performance in a learning platform. Due to MtChD’s flexibility, accuracy and robustness, the proposed framework significantly advances the stateoftheart in change point detection, thereby opening new opportunities for datadriven discovery.
The rest of the paper is organized as follows. First, we review research on change point detection. Next, we present details of our confusionbased training method and derive the mathematical model of accuracy. We thoroughly evaluate the performance and robustness on an array of synthetic and realworld datasets, and then apply RDD on the discovered realworld events.
2 Related work
2.1 Change point detection
Change point detection has a long history. An early method, called CUSUM [12], can detect changes in univariate time serie data but assumes the data follows a normal distribution with known parameters and the method only detects changes in the mean. A major improvement over CUSUM are the general likelihood ratio (GLR) testbased algorithms [19–22]. The GLRbased algorithms seek to reject a null hypothesis that observations before and after a proposed change point follow the same parameterized distribution. Wherever this null hypothesis is least likely compared to a twodistribution hypothesis is the estimated change point. With the help of advanced search algorithms [23–27], new change point detection algorithms based on cost functions can detect multiple (rather than single) change points. A collection of cost functions and search algorithms is available as a Python library called ruptures [23].
Alternate methods for change point detection include hidden Markov model (HMM) and alternative code function approaches. Change point detection can, for example, be formulated as a state transition in a HMM [15]. There are also Bayesian change point detection methods [16, 28–30]. Moreover, apart from cost functionbased change point detection, there exists penalized quasilikelihood [31] and kernel methods [32]. Unsupervised Change Analysis is a method most closely aligned with ours [33] as it uses a similar labeling method. But the paper focuses on explaining changes and not quantifying the change point.
Existing methods have significant drawbacks. First, methods are not generalizable. For example, kernelbased support vector machine methods do not perform as well as deep learning methods on image datasets [34]. Moreover, the computational complexity of segmentationbased methods and Bayesian methods scale quadratically with data length, which makes these methods ineffective for long datasets. Although some methods, such as PELT segmentation [27], scale linearly, certain assumptions must be made about the data and cost function.
Our method improves on previous methods in several ways. First, it can estimate the fraction of data affected by change, a proxy of change confidence. Moreover, our method can handle many data forms and be applied to many supervised learning models. Finally, our method scales almost linearly with respect to the length of data. This is because our method requires a small number of training rounds (usually no more than 20) for the candidate change points.
2.2 Natural experiments
Natural experiments have become a popular tool to measure the effects of treatments and policy changes. Agrist’s pioneering study [6] used Vietnam War draft lottery as a natural experiment to measure the effect of military service on individual’s lifetime earnings. The lottery created a quasirandom assignment, putting some individuals in the treatment group (drafted) and others in the control (not drafted). Other studies have since leveraged abrupt exogenous changes unrelated to an outcome to separate the population into treated (after the change) and untreated (before the change) groups and compare outcomes for these groups. Regression discontinuity design (RDD), a framework for measuring effects of changes, is a subcategory of natural experiments [35]. Studies used natural experiments to explore the effect of raising the minimum drinking age on traffic accidents [7], the effect of minimum wage on employment [8], and impact of the prenatal environment on individual’s future health [36]. However, identifying natural experiments requires creativity and insight on the part of researchers to connect some random event in the natural world to their research question. Our method offers a systematic approach to sift through observational data to identify candidates for causal inference, such as RDDs.
3 Methodology
Problem statement
Assume we have data of the form \((X_{i}, t_{i}), i = 1,\dots ,n\), where X is an arbitrarily high dimensional vector and t is a different data dimension, such as time. We refer t as the indicator and look for a change point in t. Assume there is a change at \(t_{0}\) such that some data before the change and some after the change have different distributions. In many datasets, however, only a fraction of data, \(0\le \alpha \le 1\), may show observable changes. Our goal is to infer the change point, \(t_{0}\), and the fraction of data that undergoes the change, α, given the observations \((X_{i}, t_{i})\).
Step 1: Confusionbased training
Similar to [17], we assume a candidate change point \(t=t_{a}\) and label the observed data before \(t_{a}\) as belonging to class \(\tilde{y}_{i} = 0\), and the data after \(t_{a}\) as class \(\tilde{y}_{i} = 1\).
We then train a classifier to predict the labels \(\tilde{y}_{i}\) from the features \(X_{i}\). We plot the accuracy of the classifier as a function of \(t_{a}\) for the entire range of indicator t. In case a true change point exists in the observed range of t, the accuracy vs. \(t_{a}\) curve will significantly increase over the baseline accuracy, which is the majority class ratio of labels ỹ. The shape of the curve will be affected both by the actual change point, \(t_{0}\), and the fraction of data points affected by change, α. Any classifier can be used — we use random forest and MLPs in applications described in this paper. For each candidate change point \(t_{a}\), classifiers are trained on random splits of 50% of data, validated on 30%, and tested on 20%. The test set is used to judge the accuracy of the learned models for each \(t_{a}\). This step is known as confusionbased training.
Accuracy varies significantly with \(t_{a}\): near the beginning and end of the dataset, accuracy is nearly 1 (we get high accuracy since a large portion of data is labeled “0” or “1”), but accuracy drops when we move away from these extremes. If \(t_{a}\) is near \(t_{0}\), the accuracy will again be high because in this case, the created labels ỹ matches the true change in data. Thus an accuracy versus \(t_{a}\) plot will have a “W” shape [17].
Step 2: Modeling accuracy vs. \(t_{a}\) curve
We show that by modeling this accuracy curve we can better infer \(t_{0}\) and, in contrast to Step 1 alone, we can also estimate α. We assume that the change happens instantaneously to simplify calculations. We model the CDF of t, \(F(t)\), using a cubic spline of the emperical CDF, \tilde{F}(t)=1/T{\sum}_{i}\mathbb{1}({t}_{i}\le t). (Other options should not significantly affect the results.) Data X can fall into three categories (or three distinguishable distributions): (a) a distribution that does not change, \(S_{u}\), which comprises \(1\alpha \) of all data; (b) a distribution before the change (\(t\leq t_{0}\)), \(S_{0}\); (c) a distribution after the change (\(t>t_{0}\)), \(S_{1}\). We do not know these distributions a priori but we assume the trained classifier will be able to distinguish these distributions using data X.
Assume that the distribution of t is independent of the event \(X \in S_{u}\), \(X \in S_{0}\) or \(X \in S_{1}\). With real change point locate at \(t_{0}\), given any t, we assume that among α fraction of data affected by change, \(\theta (tt_{0})\) fraction of data belongs to \(S_{1}\) and \(1\theta (tt_{0})\) fraction of data belongs to \(S_{0}\). Here \(\theta (\cdot )\) is the Heaviside step function, repesenting an instantaneous change, but a gradual change can be modeled using a sigmoidlike function. We can estimate the fractions of data in \(S_{u}\), \(S_{0}\), and \(S_{1}\) as
Recall we label data as “0” if \(t_{a} \leq t\) and “1” otherwise. Given candidate change point \(t_{a}\), \(P_{S_{u,0}}=(1\alpha )F(t_{a})\) of data in \(S_{u}\) is labeled “0” and \(P_{S_{u,1}}=(1\alpha )P_{S_{u,0}}\) is labeled “1”. On top of this, for a data point in \(S_{u}\), the expected predicting accuracy should be \(\frac{1}{1\alpha}\max (P_{S_{u,0}}, P_{S_{u,1}})\). Similarly, we can calculate the ratio of data labeled as “0” or “1” in \(S_{0}\) and \(S_{1}\), respectively. We can calculate for \(S_{1}\), which has fraction \(P_{S_{1}} = \alpha (1F(t_{0}))\), the fraction of data labeled “1” as
And the fraction of data labeled “0” is
The expected predicting accuracy for \(S_{1}\) is thus \(\frac{1}{\alpha (1F(t_{0}))}\max (P_{S_{1,0}}, P_{S_{1,1}})\). Finally, \(S_{0}\) has a fraction of \(P_{S_{0}} = \alpha F(t_{0})\). The total fractions of data labeled “0” in both \(S_{0}\) and \(S_{1}\) is
This gives \(P_{S_{0,0}} = \alpha F(t_{a})  P_{S_{1,0}}\). Therefore the fraction in \(S_{0}\) incorrectly labeled as “1” is
The expected predicting accuracy for data point in \(S_{0}\) is then \(\frac{1}{\alpha F(t_{0})}\max{( P_{S_{0,0}}, P_{S_{0,1}})}\).
We then utilize the results above to estimate the accuracy as a function of \(t_{a}\) using the average predicting accuracy in \(S_{u}\), \(S_{0}\) and \(S_{1}\) weighted by the fraction of these three sets. Namely,
These variables only depend on empirically estimated CDF, \(F(t)\), and the free parameters \(t_{0}\) and α. We therefore do not need to know the distributions of \(S_{0}\), \(S_{1}\) and \(S_{u}\). To estimate \(t_{0}\) and α, we can do a grid search and use a mean squared error cost function to fit the observed accuracy. The standard error of α and \(t_{0}\) are estimated via multiple random splits of data. The source code to is available on our GitHub repository.^{Footnote 1}
3.1 Stateoftheart
We compare our method against stateoftheart change detection methods. These methods can be divided into two groups, optimal segmentation algorithms and Bayesian change point detection. Optimal segmentation algorithms we compare against include dynamic programming (DP) [24], binary segmentation [25], bottom up methods [26], and window based methods [23] with \(L_{1}\), \(L_{2}\), normal distribution loss and RBF kernel loss functions. These algorithms are implemented in the Python package ruptures [23]. We also compare against GLR, which is equivalent to optimal segmentation with a normal distribution likelihood cost function. Bayesian change point detection requires a prior and likelihood function. We used uniform and geometric distributions as priors and applied Gaussian, individual feature model [30], and full covariance model [30] as likelihood functions. We used a Python implementation for Bayesian change point detection available from GitHub.^{Footnote 2}
4 Results
We demonstrate the accuracy and robustness of our method on data from a variety of domains. We first apply it to synthetic data to evaluate method’s performance and robustness with respect to noise, then apply it to realworld data to discover changes corresponding to external events. Finally, we illustrate how leveraging regression discontinuities around the newlydiscovered changes enables us to estimate effects of events and policies.
4.1 Discovering changes in synthetic data
4.1.1 Synthetic “chessboard pattern”
In this experiment, we generate twodimensional numeric data in a chessboard pattern, with two features \(x_{1}\) and \(x_{2}\), each in the range \([0, 1]\), as shown in Fig. 1. At a time \(t_{0}\), data points spread uniformly at random within the blue squares of a \(n_{c}\times n_{c}\) chessboard move to the orange squares of the chessboard. Mathematically, for \(n_{c}\times n_{c}\) chessboard, the data generated satisfies the following condition,
For first part of this experiment, we set \(t_{0} = 0.5\) and the size of the data \(N = 8K\). We use different arrangements of the chessboard, \(n_{c} = \{2, 4, 6, 8, 10\}\). For higher \(n_{c}\), the data is grouped into smaller chess squares with fewer data points per square. For second part of this experiment, we fix \(n_{c} = 6\) (a six by six chessboard) and we vary \(t_{0}\) between 0.2 and 0.8.
We repeat our method and comparing algorithms for 6 times on random data splits. For the optimal segmentation methods, we randomly sample 70% of data in each trial. Due to computational limitations, we only sample 18.8% of data (around 1.5K) for Bayesian change point detection.
The results are shown in Table 1. In the tables and figures, μ and σ are the estimated mean and standard error of parameters, respectively. For our method, α represents the fraction of data changed. We see that for small \(n_{c}\), optimal segmentation methods perform as well as ours, but for \(n_{c} \ge 6\), our method outperforms comparing methods. Of the two classifiers used by our method, random forest performs better.
4.1.2 Synthetic images
Our method can also identify changes in diverse highdimensional data, such as text [18] and images. To illustrate this we generate a series of synthetic 64 by 64 pixel gray scale images that qualitatively change at \(t_{0}=0.5\) from solid to hollow circles (Fig. 2). These images can represent, for example, organisms that were originally alive and then died; thus our task would be to determine the moment an organism died, a finding that is very useful in the field of survival analysis [37]. The gray scale of the solid and hollow circles is \(\gamma = 0.8\) and the gray scale of the background is \(\gamma = 0.2\). To create more realistic data, we position the circles randomly within the image and inject different levels of Gaussian noise to model poor quality data. After adding noise, pixel grey scale values are truncated to the range \([0.0, 1.0]\). We also assign each image a random time t uniformly distributed between 0 and 1. For every noise level, we generated a dataset with 4,000 images respectively.
We check the robustness of the estimated change point against noise. Table 2 shows the inferred change point and estimated value of α as a function of noise for the synthetic image data. Due to spatial correlation of image data and the superior predicting power of CNN classifier, the change point inferred is close (often not statistically significantly different) to the true change point and α is close to 1.0, even for very noisy image frames. Alternative methods were infeasible because of the highdimension and large data size.
4.2 Discovering changes in realworld data
We now demonstrate the ability of MtChD to identify changes in realworld data.
4.2.1 Covid19 air quality
We first apply our method to air pollution data to see if pollution drops around the time the COVID19 pandemic occurred. We collected air quality data daily from January 1 to May 26, 2020 for major U.S. cities from AQICN (aqicn.org). This data includes daily concentrations of nitrogen dioxide, carbon monoxide, and fine particulates less than 2.5 microns across (PM2.5), totalling 4.3K observations for 37 cities across the U.S. once missing data are removed. We also include population within 50 km of the city as a feature because people within this area may have contributed to the concentration of pollutants. We can use our model to determine when the change started, and compare these results to the gold standard: the date stayathome orders were issued by states. These orders limited business and commercial activity, which likely lead to the dramatic decline in pollution, and therefore act as the ground truth external events for RDDs. The earliest such order was announced in California on March 19, 2020 and the latest in South Carolina on April 7.
We compare our method to stateoftheart algorithms in Table 3. Our method is the only one that inferred a reasonable change point for the data of March 21, 2020 ± 3 days, roughly in the middle of all state stayathome orders. We show accuracy deviation for MtChD in Fig. 3. A random forest classifier gives better accuracy than MLP and the mathematical model fits accuracy deviation well. Although our method can work with any classifier, the performance on a given dataset can be improved by choosing a classifier that best fits the data. Some empirical ways to determine which classifier to use is (a) choosing the classifier that gives the largest accuracy deviation or (b) choosing the classifier that gives the highest α.
4.2.2 Khan academy
As a second example, we apply our method to the learning platform Khan Academy (khanacademy.org), which offers courses on a variety of subjects where students watch videos and test their knowledge by answering questions. The Khan Academy platform had undergone substantial changes to its user interface around April 1, 2013 (or \(1.3648\times 10^{9}\) in Unix epoch time) [38], which affected user performance. This change acts as a ground truth event we want to detect. After discovering this event, we can take regressions of scores before and after the event and determine if this policy significantly changes student performance scores via a RDD.
Data was collected by Khan Academy over the period from June 2012 to February 2014 and contains 16K questions answered by 13K students totalling 681K data points. Despite the large number of students, the data is very sparse: the vast majority of students were typically active for less than 20 minutes and never returned to the site. The performance data records whether the student solved the problem correctly on their first attempt and without a hint. When the user failed, they were able to attempt the problem again, and the number of attempts made on a problem is recorded. Additional features recorded include the time since the previous problem, the number of problems in a student session, and the number of sessions. Segmentation methods implemented in ruptures are not memory efficient, therefore we only sample 0.5% of the data (about 3.5K entries) uniformly at random. For Bayesian change point detection, we sampled around 1.6K data points uniformly at random.
Both our method and optimal segmentation algorithms can identify the change from user performance data (Table 3), although optimal segmentation algorithms have larger error. Bayesian change point detection does not give a reasonable change point for this data. The accuracy deviation curve is shown in Fig. 4. The random forest classifier and MLP classifier have comparable performance when used to estimate change points.
4.3 Measuring effects of changes via regression discontinuity design
We demonstrate how we can use regression discontinuity design to measure the effects of changes on the population. Automatically discovered changes can therefore help uncover potential natural experiments in data.
4.3.1 Persistence and performance in learning on Khan academy
Our analysis uncovered an abrupt change around April 2013 in the Khan Academy data (Sect. 4.2.2). The change only affected user performance in a fraction of all sessions, quantified by parameter α in Table 3. This change was likely due to a major redesign of the platform’s user interface [38], although we do not know exactly what changed. We found no indication that the population was any different before and after the change. Therefore, the April 2013 change could be used for a RDD, with some users “assigned” quasirandomly to visit the platform before the interface change and some after. This created an effective control condition (before the change) and treatment condition (after the change). The external event allows us to control for some of the confounders when investigating correlates of performance in learning platforms. Specifically, comparing treated group to the controled helps identify the link between persistence (working longer on problems first answered incorrectly) and performance (answering the problem correctly on the first attempt).
Figure 5(a) shows average performance over time, measured as the fraction of problems the user solved correctly on their first attempt. Performance decreases gradually for all users over the twoyear period (blue line), despite seasonal variation. However, for users working on problems that take more than 100 seconds to answer, i.e., hard problems, performance increases after the change (orange line). To estimate the effect of the change, we binned the data and fit the outcomes before and after the change as functions of time using two kernel models (see Appendix A.2 for details). The effect is strongest in users who solve hard problems correctly on their first attempt (Fig. 5(b)). At the same time, users became more persistent, i.e., more likely to continue working on a problem they did not solve correctly on the first attempt (Fig. 6(a)). The effect is bigger for users working on hard problems (Fig. 6(b)). Thus, the change had two effects: it made users working on hard (to them) problems more persistent, and this improved their performance on other hard problems, i.e., made them more likely to correctly solve these problems on the first try. Improvement in performance for these users was large, ∼10%, which corresponds to a full letter grade in a class setting. Psychological studies have identified traits, such as conscientiousness or grit, that allow some people to practice a skill so as to achieve mastery [39]. Our study supports the link between persistence and improved performance.
4.3.2 Covid19 lockdowns reduced air pollution
We detect a change on Mar. 21, 2020 in the COVID19 Air Quality data (Sect. 4.2.1). The change is consistent with the dates of the COVID19 lockdown orders in the US, in which people had to stay at home to reduce the spread of the disease. We calculated the change in nitrogen dioxide levels before and after Mar. 21, 2020 as shown in Fig. 7. For both Manhattan and San Francisco, nitrogen dioxide levels drop significantly (by around 5 ppb) after the lockdown. The reduction in air pollution is due to reduced traffic after the lockdown. Our findings of the date and effect of the change are confirmed by Venter et al. [40].
5 Discussion
We introduce Meta Change Point Detection (MtChD), a novel method to detect changes in high dimensional data. The method identifies changes in a wide range of data, from tabular to images. Moreover, it gives us the fraction of data changed, which we find can act as a confidence metric. Our comprehensive experiments validated the method on synthetic and realworld data that are difficult for other methods, and showed that it can robustly identify changes in sparse and noisy data. We also demonstrate that our method has low bias with higher accuracy than competing stateoftheart methods, and efficiently handles large datasets.
MtChD can be used in tandem with regression discontinuity designs to discover effects of policies within observational data. By accurately estimating when a change occurs, we can uncover plausible exogenous events that produce these changes, and then use RDDs to determine average treatment effects of the event, thereby discovering natural experiments in data. Importantly, RDDs assume unconfoundedness: the treatment (i.e., change) is unaffected by the outcome variable. Therefore, RDDs on the change points themselves would not be methodologically sound. Instead, the method offers candidate events and additional research would then reveal what is an appropriate exogenous event and what features are confounded by this change. Therefore, our method substantially reduces research time needed to detect natural experiments.
We illustrate this idea by discovering important events in empirical data. Namely, by applying the change point detection, we identify a change in user performance on Khan Academy. We discover that for long problems, users are both more likely to be persistant and perform a full letter grade better. This finding is consistent with the notion that persistent people perform better [39]. It appears that simply by encouraging users to keep working on problems they find challenging (i.e., they failed to solve them on the first attempt), could make these users more successful later on. Our findings therefore hint that user interface design choices might make people more persistent.
Our method helps researchers automatically detect natural experiments otherwise hidden in highdimensional empirical data [41]. Determining which dimensions produce causal effects is an ongoing problem, especially when the change may be heterogeneous across conditions, as in the case of Khan Academy [42].
Availability of data and materials
All code and synthetic data (including code to generate synthetic data) is available at https://github.com/yuziheusc/confusion_multi_change. COVID19 Air dataset is avialable to at https://aqicn.org/dataplatform/covid19/. US census dataset is avialable at https://data.census.gov/cedsci/. The processed and cleaned version of the data is available from the corresponding author upon request. Khan Academy dataset is available from Khan Academy but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of Khan Academy.
Abbreviations
 RDD:

Regression Discontinuity Design
 MtChD:

Meta Change Point Detection
 COVID19:

Coronavirus Disease 2019
 CUSUM:

Cumulative Sum
 GLR:

General Likelihood Ratio
 HMM:

Hidden Markov Model
 CNN:

Convolutional Neural Network
 DP:

Dynamic Programming
 BinSeg:

Binary Segmentation
 RBF:

Radial Basis Function
 RF:

Random Forest
 MLP:

MultiLayer Perceptron
 PM2.5:

Fine inhalable particles, with diameters that are generally 2.5 micrometers and smaller
References
Lazer D, Pentland A, Adamic L, Aral S, Barabasi AL, Brewer D, Christakis N, Contractor N, Fowler J, Gutmann M et al. (2009) Social science. Computational social science. Science 323:721–723
Pearl J (2009) Causal inference in statistics: an overview. Stat Surv 3:96–146
Athey S, Imbens G (2016) Recursive partitioning for heterogeneous causal effects. Proc Natl Acad Sci 113(27):7353–7360. [Online]. Available. https://www.pnas.org/content/113/27/7353
Künzel SR, Sekhon JS, Bickel PJ, Yu B (2019) Metalearners for estimating heterogeneous treatment effects using machine learning. Proc Natl Acad Sci 116(10):4156–4165
Bryan CJ, Tipton E, Yeager DS (2021) Behavioural science is unlikely to change the world without a heterogeneity revolution. Nat Hum Behav 5(8):980–989
Angrist JD (1990) Lifetime earnings and the Vietnam era draft lottery: evidence from social security administrative records. Am Econ Rev 80(3):313–336. [Online]. Available. http://www.jstor.org/stable/2006669
Serdula MK, Brewer RD, Gillespie C, Denny CH, Mokdad A (2004) Trends in alcohol use and binge drinking, 1985–1999: results of a multistate survey. Am J Prev Med 26(4):294–298. [Online]. Available. http://www.sciencedirect.com/science/article/pii/S0749379703003933
Card D, Krueger AB (1993) Minimum wages and employment: A case study of the fast food industry in new jersey and pennsylvania. NBER Working Paper No. 4509
Oktay H, Taylor BJ, Jensen DD (2010) Causal discovery in social media using quasiexperimental designs. In: Proceedings of the first workshop on social media analytics, ser. SOMA’10. Association for Computing Machinery, New York, pp 1–9. https://doi.org/10.1145/1964858.1964859. [Online]. Available
Varian HR (2016) Causal inference in economics and marketing. Proc Natl Acad Sci 113(27):7310–7315. [Online]. Available. https://www.pnas.org/content/113/27/7310
Bor J, Moscoe E, Mutevedzi P, Newell ML, Bärnighausen T (2014) Regression discontinuity designs in epidemiology: causal inference without randomized trials. Epidemiology 5:729–737
Page ES (1954) Continuous inspection schemes. Biometrika 41(1–2):100–115. https://doi.org/10.1093/biomet/41.12.100.
Page ES (1957) On problems in which a change in a parameter occurs at an unknown point. Biometrika 44(1–2):248–252. https://doi.org/10.1093/biomet/44.12.248.
Baum LE, Petrie T (1966) Statistical inference for probabilistic functions of finite state Markov chains. Ann Math Stat 37(6):1554–1563. https://doi.org/10.1214/aoms/1177699147
Raghavan V, Galstyan A, Tartakovsky AG (2013) Hidden markov models for the activity profile of terrorist groups. Ann Appl Stat 2402–2430
Wilson RC, Nassar MR, Gold JI (2010) Bayesian online learning of the hazard rate in changepoint problems. Neural Comput 22(9):2452–2476
Van Nieuwenburg EP, Liu YH, Huber SD (2017) Learning phase transitions by confusion. Nat Phys 13(5):435–439
He Y, Rao A, Burghardt K, Lerman K (2021) Identifying shifts in collective attention to topics on social media. In: International conference on social computing, behavioralcultural modeling and prediction and behavior representation in modeling and simulation. Springer, Berlin, pp 224–234
Siegmund D, Venkatraman E (1995) Using the generalized likelihood ratio statistic for sequential detection of a changepoint. Ann Stat 255–271
Willsky A, Jones H (1976) A generalized likelihood ratio approach to the detection and estimation of jumps in linear systems. IEEE Trans Autom Control 21(1):108–112
Barber J (2015) A generalized likelihood ratio test for coherent change detection in polarimetric sar. IEEE Geosci Remote Sens Lett 12(9):1873–1877
Willsky AS, Jones HL (1974) A generalized likelihood ratio approach to state estimation in linear systems subjects to abrupt changes. In: 1974 IEEE conference on decision and control including the 13th symposium on adaptive processes. IEEE, pp 846–853
Truong C, Oudre L, Vayatis N (2020) Selective review of offline change point detection methods. Signal Process 167:107299. [Online]. Available. http://www.sciencedirect.com/science/article/pii/S0165168419303494
Rigaill G (2015) A pruned dynamic programming algorithm to recover the best segmentations with 1 to k_max changepoints. J Soc Fr Stat 156(4):180–205
Fryzlewicz P et al. (2014) Wild binary segmentation for multiple changepoint detection. Ann Stat 42(6):2243–2281
Keogh E, Chu S, Hart D, Pazzani M (2001) An online algorithm for segmenting time series. In: Proceedings 2001 IEEE international conference on data mining. IEEE, pp 289–296
Killick R, Fearnhead P, Eckley IA (2012) Optimal detection of changepoints with a linear computational cost. J Am Stat Assoc 107(500):1590–1598
Adams RP, MacKay DJ (2007) Bayesian online changepoint detection. Preprint arXiv:0710.3742
Niekum S, Osentoski S, Atkeson CG, Barto AG (2015) Online Bayesian changepoint detection for articulated motion models. In: 2015 IEEE international conference on robotics and automation (ICRA), pp 1468–1475
Xuan X, Murphy K (2007) Modeling changing dependency structure in multivariate time series. In: Proceedings of the 24th international conference on machine learning, pp 1055–1062
Bardet JM, Kengne WC, Wintenberger O (2010) Detecting multiple changepoints in general causal time series using penalized quasilikelihood. Preprint arXiv:1008.0054
Arlot S, Celisse A, Harchaoui Z (2019) A kernel multiple changepoint algorithm via model selection. J Mach Learn Res 20(162):1–56
Hido S, Idé T, Kashima H, Kubo H, Matsuzawa H (2008) Unsupervised change analysis using supervised learning. In: PacificAsia conference on knowledge discovery and data mining. Springer, Berlin, pp 148–159
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al. (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Lee DS, Lemieux T (2010) Regression discontinuity designs in economics. J Econ Lit 48(2):281–355. [Online]. Available. https://www.aeaweb.org/articles?id=10.1257/jel.48.2.281
Almond D (2006) Is the 1918 influenza pandemic over? Longterm effects of in utero influenza exposure in the post1940 us population. J Polit Econ 114(4):672–712
Stroustrup N, Ulmschneider BE, Nash ZM, LópezMoyado IF, Apfeld J, Fontana W (2013) The caenorhabditis elegans lifespan machine. Nat Methods 10:665–670. lifespan Machine  Supplementary videos  Harvard News
Chan M, O’Connor T, Peat S (2016) Using Khan Academy in community college developmental math courses. New England Board of Higher Education, Tech. Rep, [Online]. Available, s3.amazonaws.com/KAshare/impact/Results_and_Lessons_from_DMDP_Sept_2016.pdf
Duckworth AL, Peterson C, Matthews MD, Kelly DR (2007) Grit: perseverance and passion for longterm goals. J Pers Soc Psychol 92(6):1087
Venter ZS, Aunan K, Chowdhury S, Lelieveld J (2020) Covid19 lockdowns cause global air pollution declines. Proc Natl Acad Sci 117(32):18984–18990
Herlands W, McFowland E III, Wilson AG, Neill DB (2018) Automated local regression discontinuity design discovery. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp 1512–1520
Wager S, Athey S (2018) Estimation and inference of heterogeneous treatment effects using random forests. J Am Stat Assoc 113(523):1228–1242
Acknowledgements
Authors are grateful to Tad Hogg for helping explain the effects of the Khan Academy natural experiment.
Funding
This work was supported in part by DARPA under contracts HR00111990114 and HR001121C0168.
Author information
Authors and Affiliations
Contributions
KB, YH, and KL conceptualized the study. YH created software used in analysis. YH, KB, and KL analyzed results and wrote the paper. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Appendix
Appendix
1.1 1.1 Details of confusion based training
We use four hidden layers for the MLP, each with 64 neurons. We chose the ReLU activation function and the maximum number of epochs for training is 100. The random forest classifier uses 100 decision trees with a maximum depth of 32. Entropy is used as the splitting criterion. To detect changes in video data, a convolutional neural network (CNN) is used with six convolutional layers. The dimensions of each layer are 3 by 3, and the number of filters in each layer are 32, 32, 64, 64, 128, and 128. After the second, fourth and the sixth convolutional layer, max pooling and drop out is performed. The kernel size for max pooling is two and stride two, while the drop out ratio is 0.20. The output of the convolutional layers are sent into a fully connected neural network with one hidden layer and 64 neurons. A ReLU activation function was also used for this neural network and the model was trained for 30 epochs.
1.2 1.2 Kernel regression for effect estimation
We used kernel regressions with RBF kernels to model the average outcomes (persistence rate and correct rate) as functions of time. Namely,
Variable t is first standardized. We use \(\gamma = 1\). To accelerate the calculation, we set cutoff for kernel weights \(k(t, t_{j})\) equals 0.05. For binned data in Fig. 6 and Fig. 5, we perform a kernel regression for each bin, respectively.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
He, Y., Burghardt, K.A. & Lerman, K. Leveraging change point detection to discover natural experiments in data. EPJ Data Sci. 11, 49 (2022). https://doi.org/10.1140/epjds/s13688022003617
Received:
Accepted:
Published:
DOI: https://doi.org/10.1140/epjds/s13688022003617
Keywords
 Change point detection
 Highdimensional data
 Regression discontinuity design
 Causal effect