2.1 Data
Twitter is a popular micro-blogging service that allows users to share thoughts and news with a global community via short messages (up to 140 or, from around November 2017 on, 280 characters, in length). We purchased access to Twitter’s “decahose” streaming API and used it to collect a random 10% sample of all public tweets authored between September 9, 2008 and April 4, 2018 [16]. We then parsed these tweets to count appearances of words included in the LabMT dataset, a set of roughly 10,000 of the most commonly used words in English [15]. The dataset has been used to construct nonparametric sentiment analysis models [17] and forecast mental illness [18] among other applications [19–21]. From these counts, we analyze the time series of word popularity as measured by rank of word usage: on day t, the most-used word is assigned rank 1, the second-most assigned rank 2, and so on to create time series of word rank \(r_{t}\) for each word.
2.2 Theory
2.2.1 Algorithmic details: description of the method
There are multiple fundamentally-deterministic mechanistic models for local dynamics of sociotechnical time series. Nonstationary local dynamics are generally well-described by exponential, bi-exponential, or power-law decay functions; mechanistic models thus usually generate one of these few functional forms. For example, Wu and Huberman described a stretched-exponential model for collective human attention [22], and Candia et al. derived a biexponential function for collective human memory on longer timescales [23]. Crane and Sornette assembled a Hawkes process for video views that produces power-law behavior by using power-law excitement kernels [24], and Lorenz-Spreen et al. demonstrated a speeding-up dynamic in collective social attention mechanisms [25], while De Domenico and Altmann put forward a stochastic model incorporating social heterogeneity and influence [26], and Ierly and Kostinsky introduced a rank-based, signal-extraction method with applications to meteorology data [27]. In Sect. 1.2.2 we conduct a literature review, contrasting our methods with existing anomaly detection and similarity search time series data mining algorithms and demonstrating that the DST and associated STAR algorithm differ substantially from these existing algorithms. We have open-sourced implementations of the DST and STAR algorithm; code for these implementations is available at a publicly-accessible repository.Footnote 1
We do not assume any specific model in our work. Instead, by default we define a kernel \(\mathcal{K}^{(\cdot )}\) as one of a few basic functional forms: exponential growth,
$$ \mathcal{K}^{(S)}(\tau |W,\theta ) \sim \mathrm{rect}(\tau - \tau _{0})e^{\theta (\tau - \tau _{0})}; $$
(1)
monomial growth,
$$ \mathcal{K}^{(S)}(\tau |W,\theta ) \sim \mathrm{rect}( \tau - \tau _{0})\tau ^{\theta }; $$
(2)
power-law decay,
$$ \mathcal{K}^{(S)}(\tau |W,\theta ) \sim \mathrm{rect}( \tau - \tau _{0}) \vert \tau - \tau _{0} + \varepsilon \vert ^{-\theta }, $$
(3)
or sudden level change (corresponding with a changepoint detection problem),
$$ \mathcal{K}^{(Sp)}(\tau |W,\theta ) \sim \mathrm{rect}( \tau - \tau _{0})\bigl[\varTheta (\tau ) - \varTheta (-\tau )\bigr], $$
(4)
where \(\varTheta (\cdot )\) is the Heaviside step function. The function rect is the rectangular function (\(\mathrm{rect}(x)=1\) for \(0< x< W/2\) and \(\mathrm{rect}(x) = 0\) otherwise), while in the case of the power-law kernel we add a constant ε to ensure nonsingularity. The parameter W controls the support of \(\mathcal{K}^{(\cdot )}(\tau |W,\theta )\); the kernel is identically zero outside of the interval \([\tau - W/2, \tau + W/2]\). We define the window parameter W as follows: moving from a window size of W to a window size of \(W + \Delta W\) is equivalent to upsampling the kernel signal by the factor \(W + \Delta W\), applying an ideal lowpass filter, and downsampling by the factor W. In other words, if the kernel function \(\mathcal{K}^{(\cdot )}\) is defined for each of W linearly spaced points between \(-N/2\) and \(N/2\), moving to a window size of W to \(W + \Delta W\) is equivalent to computing \(\mathcal{K} ^{(\cdot )}\) for each of \(W + \Delta W\) linearly-spaced points between \(-N/2\) and \(N/2\). This holds the dynamic range of the kernel constant while accounting for the dynamics described by the kernel at all timescales of interest. We enforce the condition that \(\sum_{t=- \infty }^{\infty } \mathcal{K}^{(\cdot )}(t| W,\theta ) = 0\) for any window size W.
It is decidedly not our intent to delve into the question of how and why deterministic underlying dynamics in sociotechnical systems arise. However, we will provide a brief justification for the functional forms of the kernels presented in the last paragraph as scaling solutions to a variety of parsimonious models of local deterministic dynamics:
If the time series \(x(t)\) exhibits exponential growth with a state-dependent growth damper \(D(x)\), the dynamics can be described by
$$ \frac{{d}x(t)}{{d}t} = \frac{\lambda }{D(x(t))}x(t),\qquad x(0) = x _{0}. $$
(5)
If \(D(x) = x^{1/n}\), the solution to this IVP scales as \(x(t) \sim t ^{n}\), which is the functional form given in Eq. (2). When \(D(x) \propto 1\) (i.e., there is no damper on growth) then the solution is an exponential function, the functional form of Eq. (1).
If instead the underlying dynamics correspond to exponential decay with a time- and state-dependent half-life \(\mathcal{T}\), we can model the dynamics by the system
$$\begin{aligned} &\frac{{d}x(t)}{{d}t} = -\frac{x(t)}{\mathcal{T}(t)},\qquad x(0) = x_{0}, \end{aligned}$$
(6)
$$\begin{aligned} &\frac{{d}\mathcal{T}(t)}{{d}t} = f\bigl(\mathcal{T}(t), x(t)\bigr),\qquad \mathcal{T}(0) = \mathcal{T}_{0}. \end{aligned}$$
(7)
If f is particularly simple and given by \(f(\mathcal{T}, x) = c\) with \(c > 0\), then the solution to Eq. (6) scales as \(x(t) \sim t^{-1/c}\), the functional form of Eq. (3). The limit \(c \rightarrow 0^{+}\) is singular and results in dynamics of exponential decay, given by reversing time in Eq. (1) (about which we expound later in this section).
As another example, the dynamics could be essentially static except when a latent variable φ changes state or moves past a threshold of some sort:
$$\begin{aligned} &\frac{{d}x(t)}{{d}t} = \delta \bigl( \varphi (t) - \varphi ^{*} \bigr),\qquad x(0) = x_{0}, \end{aligned}$$
(8)
$$\begin{aligned} &\frac{{d}\varphi (t)}{{d}t} = g\bigl(\varphi (t), x(t)\bigr),\qquad \varphi (0) = \varphi _{0}. \end{aligned}$$
(9)
In this case the dynamics are given by a step function from \(x_{0}\) to \(x_{0} + 1\) the first time \(\varphi (t)\) changes position relative to \(\varphi ^{*}\), and so on; these are the dynamics we present in Eq. (4).
This list is obviously not exhaustive and we do not intend it to be so.
We can use kernel functions \(\mathcal{K}^{(\cdot )}\) as basic building blocks of richer local mechanistic dynamics through function concatenation and the operation of the two-dimensional reflection group \(R_{4}\). Elements of this group correspond to \(r_{0} = \mathrm{id}\), \(r_{1} = \) reflection across the vertical axis (time reversal), \(r_{2} = \) negation (e.g., from an increase in usage frequency to a decrease in usage frequency), and \(r_{3} = r_{1} \cdot r_{2} = r_{2} \cdot r_{1}\). We can also model new dynamics by concatenating kernels, i.e., “glueing” kernels back-to-back. For example, we can generate “cusplets” with both anticipatory and relaxation dynamics by concatenating a shocklet \(\mathcal{K}^{(S)}\) with a time-reversed copy of itself:
$$ \mathcal{K}^{(C)}(\tau |W,\theta ) \sim \mathcal{K}^{(S)}(\tau |W, \theta ) \oplus \bigl[r_{1} \cdot \mathcal{K}^{(S)}(\tau |W,\theta )\bigr]. $$
(10)
We display an example of this concatenation operation in Fig. 2. For much of the remainder of the work, we conduct analysis using this symmetric kernel.
The discrete shocklet transform (DST) of the time series \(x(t)\) is defined by
$$ \mathrm{C}_{\mathcal{K}^{(S)}}(t, W|\theta ) = \sum _{\tau =-\infty }^{\infty } x(\tau + t)\mathcal{K}^{(S)}(\tau |W, \theta ), $$
(11)
which is the cross-correlation of the sequence and the kernel. This defines a \(T \times N_{W}\) matrix containing an entry for each point in time t and window width W considered.
To convey a visual sense of what the DST looks like when using a shock-like, asymmetric kernel, we compute the DST of a random walk \(x_{t} - x_{t-1} = z_{t}\) (we define \(z_{t} \sim \mathcal{N}(0,1)\)) using a kernel function \(\mathcal{K}^{(S)}(\tau |W, \theta ) \sim \mathrm{rect}(\tau )\tau ^{\theta }\) with \(\theta = 3\) and display the resulting matrix for window sizes \(W \in [10, 250]\) in Fig. 4.
The effects of time reversal by action of \(r_{1}\) are visible when comparing the first and third panels with the second and fourth panels, and the result of negating the kernel by acting on it with \(r_{2}\) is apparent in the negation of the matrix values when comparing the first and second panels and with the third and fourth. For this figure, we used a random walk as an example time series here as there is, by definition, no underlying generative mechanism causing any shock-like dynamics; these dynamics appear only as a result of integrated noise. We are equally likely to see large upward-pointing shocks as large downward-pointing shocks because of this, which allows us to see the activation of both upward-pointing and downward-pointing kernel functions.
As a comparison with this null example, we computed the DST of a sociotechnical time series, the rank of the word “bling” among the LabMT words on Twitter, and two draws from a null random walk model, and displayed the results in Fig. 5. Here, we calculated the DST using the symmetric kernel given in Eq. (10). (For more statistical details of the null model, see Appendix 1.) We also computed the DWT of each of these time series and display the resulting wavelet transform matrices next to the shocklet transform matrices in Fig. 5. Direct comparison of the sociotechnical time series (\(r_{t}\)) with the draws from the null models reveals \(r_{t}\)’s moderate autocovariance as well as the large, shock-like fluctuation that occurs in late July of 2015. (This underlying driver of this fluctuation was the release of a popular song entitled “Hotline Bling” on July 31st, 2015.) In comparison, the draws from the null model have a covariance with much more prominent time scaling and do not exhibit dramatic shock-like fluctuations as does \(r_{t}\). Comparing the DWT of these time series with the respective DST provides more evidence that the DST exhibits superior space-time localization of shock-like dynamics than does the DWT.
To aggregate deterministic behavior across all timescales of interest, we define the shock indicator function as the function
$$ \mathrm{C}_{\mathcal{K}^{(S)}}(t|\theta ) = \sum _{W} \mathrm{C}_{ \mathcal{K}^{(S)}}(t, W|\theta )p(W|\theta ), $$
(12)
for all windows W considered. The function \(p(W|\theta )\) is a probability mass function that encodes prior beliefs about the importance of particular values of W. For example, if we are interested primarily in time series that display shock- or shock-like behavior that usually lasts for approximately one month, we might specify \(p(W|\theta )\) to be sharply peaked about \(W = 28\) days. Throughout this work we take an agnostic view on all possible window widths and so set \(p(W|\theta ) \propto 1\), reducing our analysis to a strictly maximum-likelihood based approach. Summing over all values of the shocklet parameter θ defines the shock indicator function,
$$\begin{aligned} \mathrm{C}_{\mathcal{K}^{(S)}}(t) &= \sum_{\theta } \mathrm{C}_{ \mathcal{K}^{(S)}}(t|\theta ) p(\theta ) \end{aligned}$$
(13)
$$\begin{aligned} &= \sum_{\theta , W} \mathrm{C}_{\mathcal{K}^{(S)}}(t, W |\theta )p(W| \theta ) p(\theta ). \end{aligned}$$
(14)
In analogy with \(p(W\theta )\), the function \(p(\theta )\) is a probability density function describing our prior beliefs about the importance of various values of θ. As we will show later in this section, and graphically in Fig. 6, the shock indicator function is relatively insensitive to choices of θ possessing a nearly-identical \(\ell _{1}\) norm for wide ranges of θ and different functional forms of \(\mathcal{K}^{(S)}\).
After calculation, we normalize \(\mathrm{C}_{\mathcal{K}^{(S)}}(t)\) so that it again integrates to zero and has \(\max_{t} \mathrm{C}_{ \mathcal{K}^{(S)}}(t) - \min_{t} \mathrm{C}_{\mathcal{K}^{(S)}}(t) = 2\). The shock indicator function is used to find windows in which the time series displays anomalous shock- or shock-like behavior. These windows are defined as
$$ \bigl\{ t\in [0, T]:\text{intervals where } \mathrm{C}_{\mathcal{K} ^{(S)}}(t) \geq s \bigr\} , $$
(15)
where the parameter \(s > 0\) sets the sensitivity of the detection.
The DST is relatively insensitive to quantitative changes to its functional parameterization; it is a qualitative tool to highlight time periods of unusual events in a time series. In other words, it does not detect statistical anomalies but rather time periods during which the time series appears to take on certain qualitative characteristics without being too sensitive to a particular functional form. We analyzed two example sociotechnical time series—the rank of the word “bling” on Twitter (for reasons we will discuss presently)— and the price time series of Bitcoin (symbol BTC) [28], the most actively-used cryptocurrency [29], and of one null model, a pure random walk. For each time series, we computed the shock indicator function using two kernels, each of which had a different functional form (one kernel given by the function of Eq. (10) and one of the identical form but constructed by setting \(\mathcal{K}^{(S)}(\tau |W,\theta )\) to the function given in Eq. (1)), and evaluating each kernel over a wide range of its parameter θ. We also vary the maximum window size from \(W = 100\) to \(W = 1000\) to explore the sensitivity of the shock indicator function to this parameter. We display the results of this comparative analysis in Fig. 6. For each time series, we plot the \(\ell _{1}\) norm of the shock indicator function for each \((\theta , W)\) combination. We find that, as stated earlier in this section, the shock indicator function is relatively insensitive to both functional parameterization and value of the parameter θ; for any fixed W, the \(\ell _{1}\) norm of the shock indicator function barely changed regardless of the value of θ or choice of \(\mathcal{K}^{(\cdot )}\). However, the maximum window size does have a notable effect on the magnitude of the shock indicator function; higher values of W are associated with larger magnitudes. This is a reasonable finding, since higher maximum W means that the DST is able to capture shock-like behavior that occurs over longer timespans and hence may have values of higher magnitude over longer periods than for comparatively lower maximum W.
That the shock indicator function is a relative quantity is both beneficial and problematic. The utility of this feature is that the dynamic behavior of time series derived from systems of widely-varying time and length scales can be directly compared; while the rank of a word on Twitter and—for example—the volume of trades in an equity security are entirely different phenomena measured in different units, their shock indicator functions are unitless and share similar properties. On the other hand, the Shock Indicator Function carries with it no notion of dynamic range. Two time series \(x_{t}\) and \(y_{t}\) could have identical shock indicator functions but have spans differing by many orders of magnitude, i.e., \(\operatorname{diam} x_{t} \equiv \max_{t} x _{t} - \min_{t} x_{t} \gg \operatorname{diam} y_{t}\). (In other words, the diameter of a time series in interval I is just the dynamic range of the time series over that interval.) We can directly compare time series inclusive of their dynamic range by computing a weighted version of the shock indicator function, \(\mathrm{C}_{\mathcal{K}}(t) \Delta x(t)\), which we term the weighted shock indicator function (WSIF). A simple choice of weight is
$$ \Delta x(t) = \mathop {\operatorname {diam}}_{t \in [t_{b}, t_{e}]}x_{t}, $$
(16)
where \(t_{b}\) and \(t_{e}\) are the beginning and end times of a particular window. We use this definition for the remainder of our paper, but one could easily imagine using other weighting functions, e.g., maximum percent change (perhaps applicable for time series hypothesized to increment geometrically instead of arithmetically).
These final weighted shock indicator functions are the ultimate output of the shocklet transform and ranking (STAR) algorithm; the weighting corresponds to the actual magnitude of the dynamics and constitutes the “ranking” portion of the algorithm, while the weighting will only be substantially larger than zero if there existed intervals of time during which the time series exhibited shock-like behavior as indicated in Eq. (15). We present a conceptual, bird’s-eye view of the STAR algorithm (of which the DST is a core component) in Fig. 7. Though this diagram is lacking in technical detail, we have included it in an effort to provide a bird’s-eye view of the entire STAR algorithm and to help orient the reader on the conceptual process underpinning the algorithm.
2.2.2 Algorithmic details: comparison with existing methods
On a coarse scale, there are five nonexclusive categories of time series data mining tasks [30]: similarity search (also termed indexing), clustering, classification, summarization, and anomaly detection. The STAR algorithm is a qualitative, shape-based, timescale-independent, similarity search algorithm. As we have shown in the previous section, the discrete shocklet transform (a core part of the overarching STAR algorithm) is qualitative, meaning that it does not depend too strongly on values of functional parameters or even the functions used in the cross-correlation operation themselves, as long as the functions share the same qualitative dynamics (e.g., increasing rates of increase followed by decreasing rates of decrease for cusp-like dynamics); hence, it is primarily shape-based rather than relying on the quantitative definition of a particular functional form. STAR is timescale-independent as it is able to detect shock-like dynamics over a wide range of timescales limited only by the maximum window size for which it is computed. Finally, we believe that it is best to categorize STAR as a similarity search algorithm as this seems to be the best-fitting label for STAR that is given in the five categories listed at the beginning of this section; STAR is designed for searching within sociotechnical time series for dynamics that are similar to the shock kernel in some way, albeit similar in a qualitative sense and over any arbitrary timescale, not functionally similar in numerical value and characteristic timescale. However, it could also be considered a type of qualitative, shape-based anomaly detection algorithm because we are searching for behavior that is, in some sense, anomalous compared to a usual baseline behavior of many time series (though see discussion at the beginning of the anomaly detection subsection near the end of this section: STAR is an algorithm that can detect defined anomalous behavior, not an algorithm to detect arbitrary statistical anomalies).
As such, we are unaware of any existing algorithm that satisfies these four criteria and believe that STAR represents an entirely new class of algorithms for sociotechnical time series analysis. Nonetheless, we now provide a detailed comparison of the DST with other algorithms that solve related problems, and in Sect. 2.1 provide an in-depth quantitative comparison with another nonparametric algorithm (Twitter’s anomaly detection algorithm) that one could attempt to use to extract shock-like dynamics from sociotechnical time series.
Similarity search—here the objective is to find time series that minimize some similarity criterion between candidate time series and a given reference time series. Algorithms to solve this problem include nearest-neighbor methods (e.g., k-nearest neighbors [31] or a locality-sensitive hashing-based method [32, 33]), the discrete Fourier and wavelet transforms [5, 34–36]; and bit-, string-, and matrix-based representations [30, 37–39]. With suitable modification, these algorithms can also be used to solve time series clustering problems. Generic dimensionality-reduction techniques, such as singular value decomposition/principal components analysis [40–42], can also be used for similarity search by searching through a dataset of lower dimension. Each of these classes of algorithms differs substantially in scope from the discrete shocklet transform. Chief among the differences is the focus on the entire time series. While the discrete shocklet transform implicitly searches the time series for similarity with the kernel function at all (user-defined) relevant timescales and returns qualitatively-matching behavior at the corresponding timescale, most of the algorithms considered above do no such thing; the user must break the time series into sliding windows of length τ and execute the algorithm on each sliding window; if the user desires timescale-independence, they must then vary τ over a desired range. An exception to this statement is Mueen’s subsequence similarity search algorithm (MSS) [43], which computes sliding dot products (cross-correlations) between a long time series of length T and a shorter kernel of length M before defining a Euclidean distance objective for the similarity search task. When this sliding dot product is computed using the fast Fourier transform, the computational complexity of this task is \(\mathcal{O}(T \log T)\). This computational step is also at the core of the discrete shocklet transform, but is performed for multiple kernel function arrays (more precisely, for the kernel function resampled at multiple user-defined timescales). Unlike the discrete shocklet transform, MSS does not subsequently compute an indicator function and does not have the self-normalizing property, while the matrix profile algorithm [39] computes an indicator function of sorts (their “matrix profile”) but is not timescale-independent and is quantitative in nature; it does not search for a qualitative shape match as does the discrete shocklet transform. We are unaware of a similarity-search algorithm aside from STAR that is both qualitative in nature and timescale-independent.
Clustering—given a set of time series, the objective is to group them into groups, or clusters, that are more homogeneous within each cluster than between clusters. Viewing a collection of N time series of length T as a set of vectors in \(\mathbb{R}^{T}\), any clustering method that can be effectively used on high-dimensional data has potential applicability to clustering time series. Some of these general clustering methods include k-means and k-medians algorithms [44–46], hierarchical methods [47–49], and density-based methods [47, 50–52]. There are also methods designed for clustering time series data specifically, such as error-in-measurement models [53], hidden Markov models [54], simulated annealing-based methods [55], and methods designed for time series that are well-fit by particular classes of parametric models [56–59]. Although the discrete shocklet transform component of the STAR algorithm could be coerced into performing a clustering task by using different kernel functions and elements of the reflection group, clustering is not the intended purpose of the discrete shocklet transform or STAR more generally. In addition, none of the clustering methods mentioned replicate the results of the STAR algorithm. These clustering methods uncover groups of time series that exhibit similar behavior over their entire domain; application of clustering methods to time series subsequences carries leads to meaningless results [60]. Clustering algorithms are also shape-independent in the sense that they cluster data into groups that share similar features, but do not search for specific known features or shapes in the data. In contrast with this, when using the STAR algorithm we already have specified a specific shape—for example, the shock shape demonstrated above—and are searching the data across timescales for occurrences of that shape. The STAR algorithm also does not require multiple time series in order to function effectively, differing from any clustering algorithm in this respect; a clustering algorithm applied to \(N=1\) data points trivially returns a single cluster containing the single data point. The STAR algorithm operates identically on one or many time series as it treats each time series independently.
Classification—classification is the canonical supervised statistical learning problem in which data \(x_{i}\) is observed along with a discrete label \(y_{i}\) that is taken to be a function of the data, \(y_{i} = f(x_{i}) + \varepsilon \); the goal is to recover an approximation to f that precisely and accurately reproduces the labels for new data [61]. This is the category of time series data mining algorithms that least corresponds with the STAR algorithm. The STAR algorithm is unsupervised—it does not require training examples (“correct labels”) in order to find subsequences that qualitatively match the desired shape. As above, the STAR algorithm also does not require multiple time series to function well, while (non-Bayesian) classification algorithms rely on multiple data points in order to learn an approximation to f.Footnote 2
Summarization—since time series can be arbitrarily large and composed of many intricately-related features, it may be desirable to have a summary of their behavior that encompasses the time series’s “most interesting” features. These summaries can be numerical, graphical, or linguistic in nature. Underlying methodologies for time series summary tasks include wavelet-based approaches [62, 63], genetic algorithms [64, 65], fuzzy logic and systems [66–68], and statistical methods [69]. Though intermediate steps of the STAR algorithm can certainly be seen as a time series summarization mechanism (for example, the matrix computed by the DShT or the weighted shock indicator functions used in determinning rank relevance of individual time series at different points in time), the STAR algorithm was not designed for time series summarization and should not be used for this task as it will be outperformed by essentially any other algorithm that was actually designed for summarization. Any “summary” derived from the STAR algorithm will have utility only in summarizing segments of the time series the behavior of which match the kernel shape, or in distinguishing segments of the time series that do have a similar shape as the kernel from ones that do not.
Anomaly detection—if a “usual” model can be defined for the system under study, an anomaly detection algorithm is a method that finds deviations from this usual behavior. Before we briefly review time series anomaly detection algorithms and compare them with the STAR algorithm, we distinguish between two subtly different concepts: this data mining notion of anomaly detection, and the physical or social scientific notion of anomalous behavior. In the first sense, any deviation from the “ordinary” model is termed an anomaly and marked as such. The ordinary model may not be a parametric model to which the data is compared; for example, it may be implicitly defined as the behavior that the data exhibits most of the time [70]. In physical and social sciences, on the other hand, it may be observed that, given a particular set of laboratory or observational conditions, a material, state vector, or collection of agents exhibits phenomena that is anomalous when compared to a specific reference situation, even if this behavior is “ordinary” for the conditions under which the phenomena is observed. Examples of such anomalous behavior in physics and economics include: spectral behavior of polychromatic waves that is very unusual compared to the spectrum of monochromatic waves (even though it is typical for polychromatic waves near points where the wave’s phase is singular) [71]; the entire concept of anomalous diffusion, in which diffusive processes with mean square displacement (autocovariance functions) scaling as \(\langle r(t)\rangle \sim t^{\alpha }\) are said to diffuse anomalously if \(\alpha \not \approx 1\) (since \(\alpha = 1\) is the scaling of the Wiener process’s autocovariance function) [72, 73], even though anomalous diffusion is the rule rather than the exception in intra-cellular and climate dynamics, as well as financial market fluctuations; and behavior that deviates substantially from the “rational expectations” of non-cooperative game theory, even though such deviations are regularly observed among human game players [74, 75]. This distinction between algorithms designed for the task of anomaly detection and algorithms or statistical procedures that test for the existence of anomalous behavior, as defined here, is thus seen to be a subtle but significant difference. The DST and STAR algorithm fall into the latter category: the purpose for which we designed the STAR algorithm is to extract windows of anomalous behavior as defined by comparison with a particular null qualitative time series model (absence of clear shock-like behavior), not to perform the task of anomaly detection writ large by indicating the presence of arbitrary samples or dynamics in a time series that does not in some way comport with the statistics of the entire time series.
With these caveats stated, it is not the case that there is no overlap between anomaly detection algorithms and algorithms that search for some physically-defined anomalous behavior in time series; in fact, as we show in Sect. 2.1, there is some significant convergence between windows of shock-like behavior indicated by STAR and windows of anomalous behavior indicated by Twitter’s anomaly detection algorithm when the underlying time series exhibits relatively low variance. Statistical anomaly detection algorithms typically propose a semi-parametric model or nonparametric test and confront data with the model or test; if certain datapoints are very unlikely under the model or exceed certain theoretical boundaries derived in constructing the test, then these datapoints are said to be anomalous. Examples of algorithms that operate in this way include: Twitter’s anomaly detection algorithm (ADV), which relies on generalized seasonal ESD test [76, 77]; the EGADS algorithm, which relies on explicit time series models and outlier tests [78]; time-series model and graph methodologies [79, 80]; and probabilistic methods [81, 82]. Each of these methods is strictly focused on solving the first problem that we outlined at the beginning of this subsection: that of finding points in one or more time series during which it exhibits behavior that deviates substantially from the “usual” or assumed behavior for time series of a certain class. As we outlined, this goal differs substantially from the one for which we designed STAR: searching for segments of time series (that may vary widely in length) during which the time series exhibits behavior that is qualitatively similar to underlying deterministic dynamics (shock-like behavior) that we believe is anomalous when compared to non-sociotechnical time series.