3.1 Multi-armed bandit (MAB) problem
According to DARPA [13], teams expanded their sources of information as the challenge progressed, e.g. purchased information from other teams or obtained from Twitter’s posts. Retrieving information from new sources can be seen as a form of exploration, while verifying the existing information sources is exploitation. We assume that exploration and exploitation are two competing processes [12], due to limited resource (mainly time) each team has. This implies that at in each trial, teams need to make a choice between submitting the information from what they consider the most reliable sources (exploitation) and submitting from another source (exploration), so this kind of social search problem could be modelled as a MAB problem.
The conventional MAB problem is a problem in which one gambler facing multiple slot machines, each of which has an unknown probability distribution of rewards, needs to decide (a) which machines to play, (b) the order of play, and (c) the number of times to play each machine, to maximize rewards [15, 16]. The player should spend a portion of the limited budget to explore every machine (or some of them) to estimate the distribution of rewards, and then use the remaining budget to exploit the ones with highest expectations.
An adaptive allocation rule to attain the asymptotic lower bound for the regret when the reward distributions are the one-parameter exponential family was proposed by Lai and Robbins in [16]. Based on their work, Agrawal and Hedge [21] expanded the problem and introduced switching cost to the MAB problem attaining the asymptotic lower bound for the regret as well. Switching cost occurs along with the exploration of different machines. It discourages frequent switching, which also applies to a number of practical problems, e.g. oil exploration [22], research and development [23], and website morphing [24, 25]. Other variations of MAB problem have been studied, with various objectives. Hauser et al. [24] explored when the best website morphing time is (switching to another website layout style) to increase consumer’s purchase probability. In online shortest path problems, the objectives are to minimize delays occurring in network links which are unknown at first but become more predictable over time [26, 27].
In a social search problem discussed in this paper, the objective is to minimize the time required to find all key information (all correct locations in this study). We propose a general MAB model to solve time-critical social search tasks: Given k people and a set of opinions M, where each person j espouses \(s_{j} \in M\) opinions, a player, who has access to each of the k opinionated people and has heard opinions W. Consider the following process: a player sequentially selects a person, j, and verifies an opinion in \(s_{j} \in W\). If the information is true, the player receives payoff 1, and otherwise receives nothing. The objective is to minimize W when payoff meets a certain threshold.
In the DNC, this general form of MAB model is extended to have the following properties:
-
1.
The sources are regarded arms. In DNC, trading with other teams is regarded as one kind of sources.
-
2.
The reliability of a source can be estimated by submitting certain pieces of the information provided by the source. In DNC, some teams submitted a single location to validate it through DARPA. It could be an effective submission strategy, if not an optimal one, because submitting more than one could be confusing if the score from the feedback is less than the numbers of locations in the submission.
-
3.
The switching cost \(t_{\mathrm{switch}} \times d\) occurs when a team explores d sources and each time it takes \(t_{\mathrm{switch}}\) to get access to a new source, e.g. negotiation time with other teams when trading locations. We assume that the switching cost is one-off in DNC, which means switching to the explored sources would not generate additional cost.
-
4.
Each submission is a trial. The cost \(t_{\mathrm{submission}} \times n\) occurs when a team submits n times and each time it takes \(t_{\mathrm{submission}}\) to wait for the feedback.
-
5.
A team receives payoff 1, only if the submission is correct and unobserved before.
-
6.
The objective is to minimize the total search cost \(t_{\mathrm{search}} = t_{\mathrm{submission}} \times n+ t_{\mathrm{switch}} \times d\) when finding all key information, that is payoff = 10. Moreover, key information could be repeated in different sources.
We consider the following set of strategies, which were previously studied in relation to MAB: ϵ-greedy and its variants, interval estimation (referred as IntEstim in the following), SoftMax, and POKER [28, 29]. As the winning criteria of the DNC is discovering all correct locations, so in an ideal case, where team can submit all 10 correct locations within only 10 trials, the number of trials is 10. However, in reality, the number of submissions is higher than the number of sources, due to the dominating number of false locations over correct ones (Figure 2). In such case, MAB’s heuristic algorithms ϵ-greedy and interval estimation strategy are applied as both are proven to be promising strategies [28, 29]. However, we don’t consider SoftMax strategy [30] and the POKER strategy [28] and their variants, as the former underperforms other strategies and the latter does not suit in this case where there are more trials than arms [28]. Overall, we test 4 strategies: basic ϵ-greedy, ϵ-first, ϵ-decreasing, and IntEstim.
The ϵ-greedy strategy and its variants have common greedy behaviors where the best arm (the one of highest rewards expectation based on acquired knowledge) is always pulled except when a (uniformly) random action is taken [28]. The basic ϵ-greedy strategy defines a fixed value of ϵ, which is the probability that a random arm is selected in the next trial. The ϵ-first strategy tends to explore in the first \(\epsilon~\mbox{N}\) trials, and exploit the best arms in the remaining \((1- \epsilon)~\mbox{N}\) trials. As the estimation for the rewards distribution of each arm becomes more accurate over time, a fixed ϵ would possibly make the exploration at later stage inefficient. As an improvement, a more adaptive greedy strategy called ϵ-decreasing strategy was proposed, where the value of ϵ decreases as the experiment progresses, resulting in highly explorative behavior at the beginning, but highly exploitative behavior at the end [29]. Different to fixing ϵ in the former two cases, ϵ-decreasing strategy requires a user to fine-tune the parameter c, which controls the decreasing rate of ϵ, to achieve approximate optimal solution. According to the Theorem 3 of [29], let Δ be the difference between the expectation \(\mu^{*}\) of the best arm and the expectation μ of the second best arm. The decreasing ϵ is defined as \(\varepsilon\stackrel{\mathrm{def}}{=} \min \{ 1 , \frac{c k}{n \Delta^{2}} \}\), where k is the number of arms, and n is the number of trials. The larger the value of c, the slower the ϵ decreases, the more exploration is performed.
In an IntEstim strategy, each arm is assigned an “optimistic reward estimate” within a certain confidence interval, e.g., 95%, and the arm with highest estimate is pulled [28]. The upper bound of the reward estimation of an arm on step n is computed based on Algorithm 10 in [31]. The confidence level is denoted as z, and the upper bound is defined as follow:
$$ub( \hat{\mu},\nu) = \biggl( {\frac{\hat{\mu}}{\nu} + \frac{{z^{2}} / {2}}{2 \nu} + \frac{{z} / {2}}{\sqrt{\nu}} \sqrt{ \biggl( \frac{\hat{\mu}}{\nu} \biggr) \biggl( 1- \frac{\hat{\mu}}{\nu} \biggr) + \frac{{z^{2}} / {2}}{4 \nu}} \biggr)} \bigg/ {1+ \frac{{z^{2}} / {2}}{\nu}}, $$
where μ̂ and ν are the observed rewards and the number of times that arm has been pulled by step n. The unobserved or infrequently observed arms tend to have overestimated reward mean, which will lead to further exploration of those arms. The more an arm is pulled the closer the estimate to the true reward mean [28]. There are two reasons causing the upper bound to be large: (1) the arm is seldom pulled, and (2) the observed rewards distribution is good. Moreover, higher confidence level z leads to more exploration [28]. This experiment uses confidence levels ranging from 50% to 99.98% to test the IntEstim strategy (the corresponding z scores of different confidence levels can be found in [32]).
3.2 Experiment settings
In theory, the number of sources during a time-critical social search could be unlimited, since participating teams are free to explore Twitter feeds, Facebook groups, online forums, personal contacts and any other type of sources without restrictions. New sources could be acquired at any stage of the challenge. Teams have no a priori knowledge of the number of sources available. However, the course of the DNC demonstrated that teams accumulate all key information from limited number of sources, which they also trade with each other. To reflect this, and to simplify the experiment, we assume that all information M (correct and false) is provided by a fixed number \(k = (20, 40, \ldots, 80)\) of sources that are equally accessible by any team.
Teams have no initial knowledge about the reliability of sources and this knowledge will be gained through submitting the information from them. A set \(M = (1, \ldots, m)\), \(m = 458\) of unique locations was submitted during the whole competition, so each source contains up to \(s_{j} = \frac{m}{k}\) pieces of information. To simplify the experiment, we assume all sources have equal amount of information. Ten correct clusters sorted by aappear contain 20, 2, 17, 17, 4, 21, 3, 4, 13, 1 locations respectively. Therefore, the set of locations \(M = ( L_{c}, L_{f} )\), where \(L_{c} = ( l_{1}^{1}, \ldots, l_{1}^{20} ), ( l_{2}^{1}, l_{2}^{2} ), \ldots, (_{l 10}^{1} )\) is the collection of correct locations, and \(L_{f} = ( l_{103}, \ldots, l_{458} )\) is the set of the false ones. We assume that one-off switching cost \(t_{\mathrm{switch}}\) occurs when a team explores a new source. Due to the lack of information about switching cost in DNC, to simplify the experiment, we assume that switching time is the same for each new source and during each simulation run \(t_{\mathrm{switch}}\) is randomly set to be equal to 5, 10, 15, 20, 25, or 30 minutes. The modified MAB problem is tested in three configurations:
-
I.
Locations in M (regardless whether correct or false) are uniformly distributed between k sources;
-
II.
Locations in M (regardless whether correct or false) are normally distributed between k sources, and
-
III.
Correct clusters are set to contain the same number of locations (10 locations of each), so that \(L_{c} = ( l_{1}^{1}, \ldots, l_{1}^{10} ), \ldots, ( l_{10}^{1}, \ldots, l_{10}^{10} )\), and correct locations are normally distributed in k sources.
The setting III is to test the performance of all strategies when all correct information has equivalent appearances.
Since the actual competition lasted 900 minutes, we set the average interval between two submissions as \(t_{\mathrm{submission}}= \frac{900}{m} \approx 2\) minutes. A team completes the challenge when all 10 correct locations are successfully submitted, therefore the total search time
$$t_{\mathrm{search}} = t_{\mathrm{submission}} \times n + t_{\mathrm{switch}} \times d, $$
where n is the number of trials, d is number of explored sources, and \(t_{\mathrm{search}}\) is the score of the team. Each strategy is run 1,000 times to report the average value of \(t_{\mathrm{search}}\).
3.3 Results
In a randomized dataset (setting I), all sources tend to have similar reliability. Therefore, exploration oriented strategy or exploitation oriented strategy would not be significantly different. The experimental results confirm this assumption, with no strategy standing out from the others, and the \(t_{\mathrm{search}}\) converges at approximately 950 minutes (\(k=20\), \(t_{\mathrm{switch}}=20\)).
In a normally distributed dataset (setting II), all strategies can achieve the best \(t_{\mathrm{search}} \approx 750\) minutes (\(k=20\), \(t_{\mathrm{switch}}=20\)), when parameters are properly set (Figure 7). It should be noted that \(t_{\mathrm{search}}\) could be as low as 400 minutes when a team explores the most reliable sources during the exploration phase. The ϵ-greedy strategy and its variants could underperform compared to IntEstim if the value of ϵ is not properly set. Similar to the findings of [28], making ϵ decreasing does not improve the performance. The results of the ϵ-greedy strategy and its variants imply that the highly exploitative behavior could possibly lead to lower switching cost and overall better performance, which means teams should focus on the most reliable sources ever found if they adopt ϵ-greedy strategies. However, IntEstim performs well no matter how user defines the confidence level, even though the switching cost is relatively higher than the best settings of ϵ-greedy strategies. Therefore, IntEstim strategy should be adopted in this kind of competition, where some key information appears rarely across sources. We also made submission interval to follow Weibull distribution (\(\lambda=1\), \(\kappa \in ( 1, 5 )\), \(E ( X ) = 2\)), and the result hold as well. As some correct clusters only contain relatively small number of locations, a team must switch between many sources to collect them if missing in the early stage during the exploration. Therefore, in setting II, the difference of switching cost between the best strategy and the worst one is marginal.
However, in setting III of the simulation, where all correct clusters have equal number of locations, the switching cost dominates the variances of the total search time (see Section C of Additional file 2). A team would probably collect all ten correct locations from a small number of sources. Therefore, the highly exploitative ϵ-greedy strategy and its variants outperform the IntEstim strategy by switching less. Given that the highly exploitative ϵ-greedy strategy and its variants achieve overall promising performance in setting II and III, they should also be adopted in a more general social search problem with unknown rewards distribution.
In conclusion, the results suggest that there would be no universal optimal strategy for time-critical social search tasks of different rewards distributions. Even though the IntEstim strategy outperforms others in the case of DNC, it could generate higher switching cost than the others on average. While in the cases where switching cost is higher than verification cost, the IntEstim strategy could result in an undesired solution. On the other hand, highly exploitative greedy behaviors could guarantee minimum number of switches, while performance is only marginally downgraded. Therefore, in general time-critical social search tasks where rewards distribution and switching cost is usually unknown, we suggest adopting highly exploitative ϵ-greedy strategy and its variants.