In Table 1, we report the results of our analysis. Across all museums, we find that models including data from Google Trends exhibit a lower mean absolute percentage error (MAPE) than models based on historical visitor numbers alone, with the only exception being the ARIMA with Google Trends model for the National Gallery. We also note that in most cases the performance of ARIMA models is better than that of NNAR models. Figure 1 (second row) depicts how the absolute percentage error varies over time for the three example museums (the Tate Modern, the National Portrait Gallery and the Science Museum group) when using ARIMA models. Again, we observe that including data from Google Trends as a predictor tends to reduce the error in the estimates. In the Additional file 1, we depict the same results for the other 13 museums in our analysis (Figs. S3 to S5).
While our findings hold across nearly all of the museums considered, the improvement delivered by including Google Trends data does differ between museums. This could be for a variety of reasons, including the underlying volume of search queries for the museum; the extent to which the Google Trends topic truly captures searches relating to the museum or museum group; the extent to which people have reason to search for the museum other than when they visit; the extent to which people visit without searching for the museum, for example because the museum is in a popular tourist area; or measurement and sampling noise in either the official visitor numbers or Google search data.
We perform further validation of our results with the modified Diebold-Mariano test, which compares errors from time series models to check whether two different models exhibit a statistically significant difference in forecast accuracy [45, 46]. This test can be used with a range of forecast error measures, but it has been shown that the mean absolute scaled error (MASE) satisfies all the required assumptions of the test, such as the asymptotic normality of the forecast errors, whereas other common measures such as the MAPE may not [47]. The MASE is a scale invariant measure of the accuracy of forecasts [48], making it possible to directly compare forecast errors for museums with very different visitor numbers. The MASE is also symmetric [48], such that it results in an equal penalty for underestimates and overestimates of the number of visitors.
The MASE compares the absolute error of a forecast with the error that would be expected from a naive forecast. For seasonal data, the naive forecast is that each value will be equal to the value observed one season ago. For monthly data with annual seasonality, this is therefore the value of the time series twelve months ago.
The MASE for seasonal time series is hence defined as:
$$\begin{aligned} \mathrm{MASE} = \frac{\sum _{t=1}^{T} \vert e_{t}\vert }{\frac{T}{T-m}\sum _{t=m+1}^{T} \vert Y_{t}-Y_{t-m}\vert }, \end{aligned}$$
where \(e_{t}\) is the forecast error, defined as the actual value \(Y_{t}\) minus the value forecast by the model undergoing testing; m is the seasonal period, which is 12 for our analyses of monthly data with annual seasonality; and \(Y_{t-m}\) is the naive forecast estimate. By definition, the naive seasonal forecast model would score a MASE of 1. Values of the MASE lower than 1 imply that the model undergoing testing performs better than the naive forecast model.
We generate monthly estimates from January 2011 until December 2018 using the same procedure described above, varying the training window between 30 months and 72 months. Bearing in mind once again that we start training our models with data from January 2005, our analysis here generates estimates from January 2011 onwards, rather than January 2010, to allow us to explore how performance differs when we use a longer training window of six years (72 months).
Figure 1 (third and fourth rows) depicts the value of the MASE for the three example museums in our analysis (the Tate Modern, the National Portrait Gallery and the Science Museum group). Again, visual inspection suggests that regardless of training window size, models including data from Google Trends tend to exhibit a lower MASE than baseline models based on historical visitor numbers alone. This holds both for ARIMA and NNAR models. In the Additional file 1, we provide similar illustrations of the results for the other 13 museums analysed here (Figs. S3, S4 and S5). While we find a greater boost to performance from Google Trends data for some museums and much worse performance for others, we tend to observe the same broad pattern across our sample of museums.
We then perform the Diebold-Mariano test as follows. For a given training window size, for each model, we calculate the MASE for estimates made across all museums. In our analysis, there are four different models: baseline ARIMA, ARIMA with Google Trends, baseline NNAR, and NNAR with Google Trends. To compare all four models with all other models, we therefore need to carry out six different pairwise comparisons. To correct for multiple hypothesis testing, we adjust the p-values returned by the Diebold-Mariano test using the false discovery rate correction [49]. Across all training windows, we find that models including data from Google Trends have a statistically significantly lower MASE compared to models based on historical visitor numbers alone. We report further details of this analysis in the Additional file 1 (Tables S2, S3 and S4).
To complement the Diebold-Mariano analysis, as a further check, we build a regression model of the mean absolute scaled errors to investigate whether the type of adaptive nowcasting model used is a key predictor of the size of the error once the museum, month and training window size are all taken into account. We fit a generalised linear model using a gamma distribution, a logarithmic link function and robust standard errors, with the model, museum, month and training window as predictors. With 4 different models, 16 museums, 96 months of data and 43 training window lengths, our regression model is fit on 264 192 observations in total.
Each independent variable enters the model as a categorical variable. For the model variable, the four categories correspond to the different models: ARIMA, NNAR, ARIMA with Google Trends, and NNAR with Google Trends. We use the baseline ARIMA model as our reference level in the regression.
We are particularly interested in the coefficients of the model dummy variables, since these indicate whether models using Google Trends data have lower errors than the baseline ARIMA model. The fitted coefficient of the model dummy variable corresponding to the ARIMA with Google Trends model is statistically significant and negative (−0.132, \(p<0.001\)). Similarly, the coefficient for the NNAR with Google Trends model is also statistically significant and negative (−0.078, \(p<0.001\)), whereas the coefficient for the NNAR model with no Google Trends data is statistically significant and positive (0.078, \(p<0.01\); for all regression results, see sample size information above). Both these results suggest that models which include Google Trends data result in smaller errors than their baseline counterparts, for ARIMA and NNAR models alike. More details of this analysis are presented in the Additional file 1 (Table S5).
Our analysis so far has shown that models which use data from Google Trends perform better than models based on historical visitor numbers alone. However, one question remains: must the Google Trends data relate to the museum or gallery in question, or would any data from Google Trends appear to improve our estimates? If Google Trends data from unrelated topics were to significantly improve our estimates, this would suggest that our findings might be the result of a spurious correlation between search data and visitor numbers data. To address this final question, we repeat our analysis using data from Google Trends for control topics with limited or no relation to museums and galleries.
For our control topics, we choose: England, Travel, Buckingham Palace, Hyde Park, London, United Kingdom, Holiday, and Color. Again, we restrict our Google Trends request to data on searches made in the United Kingdom. We generate estimates for all museums in our analysis for the time period between January 2011 and December 2018, and calculate the MASE.
Figure 2 depicts our results for ARIMA models. For simplicity, we present the results averaged across all museums. We again see lower MASEs for estimates generated using data on Google searches for the museums and galleries in comparison to estimates generated using historical visitor numbers alone (Fig. 2(A)). In contrast, data on Google searches for unrelated topics makes near to no difference to estimates of visitor numbers when compared to estimates based on historical visitor numbers alone (Fig. 2(B)). In fact, visual inspection suggests that models that draw on Google Trends data on irrelevant control topics tend to perform a little worse than the baseline overall.
To verify that Google Trends data for control topics does not improve estimates of visitor numbers, we investigate the performance of both ARIMA and NNAR models. We again build a regression model of the mean absolute scaled errors, using a gamma distribution, a logarithmic link function and robust standard errors, with the model, museum, month and training window as predictors. We build one such regression model for each control topic. We find that the fitted coefficient of the model dummy variable for the ARIMA with Google Trends is positive for all control topics, and statistically significantly so in the vast majority of cases. For the NNAR with Google Trends model, the coefficient is statistically significantly larger than the coefficient for the NNAR baseline for two of the control topics (both differences >0.01, both \(p \mathrm{s}< 0.025\)), with no significant difference for the other six control topics (all absolute differences <0.0007, all \(p \mathrm{s}> 0.24\)). We report on this analysis in further detail in the Additional file 1 (Tables S6–S13).
Overall, the results therefore indicate that the MASE either remains roughly the same or increases when irrelevant Google Trends data is fed into the model. We conclude that trying to improve estimates of visitor numbers with data on Google searches for unrelated control topics does not work, and may make the estimates worse. In the light of our previous results, this provides further evidence that data on search queries for a specific museum or gallery truly does contain valuable information on the number of people visiting those sites.