In this section we model the temporal changes in book sales, allowing us to capture and predict the observed sales patterns.
5.1 Bestsellers reach their sales peak in less than ten weeks
In Sect. 3 we argued that the first year sales are the most important for a hardcover. Indeed, for the 2035 fiction bestsellers we have at least two years of sales data, we find that 96% of the sales took place in the first year. Similarly, 94% of the sales of 1699 nonfiction bestsellers also happen in the first year. To systematically explore the dynamics of the sales patterns, we start by showing the weekly sales of all bestselling fiction (Fig. 9(A)) and nonfiction (Fig. 9(A)) books. The thick line corresponds to the median sale values. The peak sale values vary significantly from book to book, some books selling over 100,000 copies at their peak while others only reaching a few hundreds. We therefore use a logarithmic scale to display all sales curves. These plots already indicate that for both fiction and nonfiction peak sales are within the first ten weeks after a book’s release.
We find that almost all books, regardless of category, peak in the first 15 weeks after publication (Figs. 9(C) and (D)). Furthermore, most fiction books have their peaks strictly in the first 2–6 weeks; in contrast for nonfiction, even though peaks at weeks 2–5 are common, the peak can happen any time during the first 15 weeks. For example both The Lost Symbol by Dan Brown and Go Set A Watchman by Harper Lee peaked in their 3rd week, and so did Sarah Palin’s Going Rogue. George W. Bush’s Decision Points had its peak sales even earlier, on the second week after publication.
Still there are some outliers that peak much later, towards the end of their first, or well into their second year. These exceptionally late peaks are typically triggered by exogenous events such as winning awards, being adapted for a movie or in rare cases, having a prominent public figure’s endorsement. We are showcasing several such examples in Figs. 9(E) and (F). All the Light We Cannot See, shown in purple in Fig. 9(F), is a novel written by Anthony Doerr, published on May 6, 2014. The novel had an initial peak and subsequent decline when it was shortlisted for the National Book Award later that year. The sales numbers tripled the week after it lost the award to Redeployment by Phil Klay. The novel later won both the 2015 Pulitzer Prize for Fiction and the 2015 Andrew Carnegie Medal for Excellence in Fiction, causing further peaks in sales numbers, but the most drastic effect was seen at the end of 2015, where people overwhelmingly chose this multiple award-winning book during their holiday shopping. Another example of awards causing late peaks is the nonfiction book The Immortal Life Of Henrietta Lacks (red in Fig. 9(F)) by Rebecca Skloot, which won both the American Association for the Advancement of Science’s Young Adult Science Book award and the Wellcome Trust Book Prize. The Help (red in Fig. 9(E)) by Kathryn Stockett on the other hand was a ‘sleeper hit’ which gradually increased in sales until a movie adaptation was announced. The announcement, coinciding with the holiday shopping season, propelled the book’s sales to more than 60,000 a week. Another peak in sales happened when the first pictures of the movie’s cast appeared and the following holiday shopping season was also beneficial for the book. Finally, Humans of New York author Brandon Stanton (purple in Fig. 9(F)) and his well-known Facebook page of the same title as the book were featured on CNN shortly after the book’s publication, causing a second peak in sales. But the book’s biggest success came when Stanton interviewed the then U.S. President Barack Obama in the Oval Office in January of 2015.
These exogenous events aside, the data indicates that the first few weeks of a book are crucial: This is when the books capture the interest of their readership. Also this is the time when publishers will invest in a book’s advertising and the most likely period for a book to be featured in the front of book stores and considered for reviews in various media. As such, a book’s sales to be the highest in that period.
5.2 Sales follow a universal pattern
As can be seen in Figs. 9(A) and (B), most books follow a similar sales pattern: the sales increase very fast, reach their peak in the first ten weeks and drop dramatically afterwards. This similarity suggests the existence of a universal sales pattern, i.e. the possibility that the properties for all sales curves are the same, independent from the details and degree of complexity of each individual book’s sales narrative. This hypothesis allows us to develop a simple yet general model, helping us identify the mechanisms that drive the sales of books. In general, three fundamental mechanisms contribute to the observed sale patterns:
(i) Each book carries a different value for its audience, stemming from the author’s name recognition, the writing style, the marketing efforts by the publisher and even the quality of the book cover. Some books are anticipated and well-liked, resulting in high sales, some will be unexpected, lacking familiar elements and hard to get into, resulting in lower sales. To account for these inherent differences, we define a parameter called the book’s fitness, \(\eta_{i}\), that captures the book’s ability to respond to the taste of a wide readership.
(ii) Second, a book that sells well will attract even more sales, an effect called preferential attachment [24, 25]. Preferential attachment in this context is likely rooted in collective effects, like recommendations from friends, critics, celebrities, online reviews and bookstores who display a sought after book in visible spots. Mathematically it implies that the likelihood of purchasing a book depends on its up-to-date sales, \(S_{i}^{t}\).
(iii) Finally, even the best books lose their novelty and fade from the public eye some time after their publication. Barring exogenous events, once the book reached its target audience, less and less individuals are interested in purchasing it. To model this gradual loss of interest, we need to add an aging term, using a form adopted from the decay of citations in research papers [26]
$$ A_{i}(t) = \frac{1}{\sqrt{2\pi }\sigma_{i} t}\exp \biggl[- \frac{(\ln t - \mu_{i})^{2}}{2\sigma_{i}^{2}}\biggr], $$
(1)
where \(\mu_{i}\) is the book’s immediacy determined by the time the sales reach its peak and \(\sigma_{i}\) is the decay rate capturing the longevity. In case of books, a lognormal aging term (1) is motivated by the fact that the time of purchase t can be approximated as a multiplicative process, resulting from independent random factors contributing to a reader’s decision to buy a book. Such random multiplicative processes are shown to lead to a lognormal distribution [27–31].
Combining these mechanisms, we can write the probability \(\Pi_{i}(t)\) of a book i to be purchased at a time t after publication as [26]
$$ \Pi_{i}(t) \sim \eta_{i} S_{i}^{t} A_{i}(t), $$
(2)
which depends on (i) the book’s fitness \(\eta_{i}\), (ii) the total number of sales until t, \(S_{i}^{t}\) (preferential attachment), and (iii) the aging factor (1). Combining (i)–(iii), we find that the total sales of book i at time t after publication follows (a detailed derivation is given in Section S2.2 of the supplementary materials of [26]),
$$ S_{i}^{t} = m\bigl[e^{\lambda_{i} \Phi (\frac{\ln t - \mu_{i}}{\sigma_{i}})} - 1\bigr], $$
(3)
where
$$ \Phi (x) = (2\pi )^{-1/2} \int_{-\infty }^{x} e^{-y^{2}/2}\,dy $$
(4)
is the cumulative normal distribution related to the error function as \(\Phi (x)=1/2 \operatorname {erfc}(-x/ \sqrt{2})\) where erfc is the complementary error function given by \(1-\operatorname {erf}(x)\) and \(\lambda_{i}\) is the relative fitness proportional to \(\eta_{i}\).
To demonstrate how the model (2)–(4) can reflect actual sales, in Fig. 10(B) we show the sales pattern of The Appeal by John Grisham which sold over a quarter million copies in a single week after publication. We obtained the parameters \(\lambda = 10.37\), \(\mu = 2.03\) and \(\sigma = 1.12\) by fitting Eq. (3) to the book’s cumulative sales, the fit being shown in Fig. 10(C), trailing closely the real sales pattern (\(R^{2}=0.99\)). In fact, the model (3) can handily explain a wide range of sales patterns by varying only the three parameters μ, σ and λ (Fig. 10(E)).
A key prediction of model (3) is that by transforming all the sales curves into a single curve using rescaled variables, all sales curves should follow the same universal curve. These rescaled variables are \(\tilde{t} \equiv (\ln t - \mu_{i})/\sigma_{i}\) and \(\tilde{S} \equiv \ln (1+S_{i}^{t}/m)/\lambda_{i}\) and by substituting them into (3) we obtain
$$ \tilde{S} = \Phi (\tilde{t}). $$
(5)
As an example, we show the rescaled curve for The Appeal in Fig. 10(D). The rescaled time \({\tilde{t}=1}\) roughly corresponds to the time of the peak sale, and for this book there were almost no sales before that point.
If the model fits the sales pattern for all books, we expect the rescaled curves derived for all books to collapse into a single curve. We therefore measured the \(\mu_{i}\), \(\sigma_{i}\) and \(\lambda_{i}\) values for all books in the New York Times bestseller data using \(m = 30\) and the Least-Square Fitting method on the available sales range for each book. We then rescaled the sales curve of each book accordingly, the rescaled sales curves being shown in Fig. 10(F) for fiction and (G) for nonfiction. The fact that all curves collapse into a single one indicates that the model correctly captures the sales pattern of most books.
One limitation of the proposed model is that it cannot account for exogenous events like awards, movie adaptations or mentions by prominent venues or celebrities. These events may land an otherwise unnoticed book on the New York Times bestseller list years after its original publication as we have seen in Figs. 9(C)–(F). Yet, these are exceedingly rare cases and most bestsellers follow a more typical sales pattern, one that is well accounted for by our model.
Taken together, we find that by using the fundamental mechanisms of fitness, preferential attachment and aging, we can explain and accurately model the sales curves of all bestsellers, regardless of genre. We can do this by relying on our observation that all books follow a well defined, regular path in selling copies including the timing of the peak sales, exceptions being rare.
5.3 Predicting future sales
In the previous section we have seen that only three parameters are needed to describe the sales history of any bestseller: the fitness λ, the immediacy μ and the decay rate σ. In Figs. 11(A)–(C) we show the probability distributions of each parameter for all bestsellers. In (A) we see that the fitness distribution \(P(\lambda )\) is very similar for fiction and nonfiction bestsellers. This is to be expected, since these are all bestselling books and therefore all have high fitness. Yet, the variation of relative fitness is slightly higher for fiction than nonfiction, indicating a broader range. The observation that fiction bestsellers show more variability than nonfiction bestsellers is consistent with earlier findings about the one year (Fig. 3) and weekly (Fig. 4) sales. Additionally, the λ distribution for fiction peaks at a slightly higher value for fiction than nonfiction, indicating a higher relative fitness on average. This is because fiction books sell more copies than nonfiction books on average, as discussed in Sect. 3.
The relative fitness can singlehandedly predict how many copies a book will sell during its lifetime. Taking \(t \rightarrow \infty \) in Eq. (3), we obtain [26]
$$ S_{i}^{\infty } = m\bigl(e^{\lambda_{i}} - 1 \bigr), $$
(6)
predicting that the total number of sales of a book in its lifetime depends only on a single parameter, the relative fitness λ. Consequently, if the model captures the data well, we expect a good match between the predicted and measured total sales, even when using data from a shorter time period than the book’s lifetime to obtain the fitness parameter, allowing us to predict the total sales using (6). Results for different choices of time periods to calculate λ are shown in Fig. 11(D) for the first 25 weeks and (E) for the first 50 weeks after book release. We find that a fit derived from the first 25 weeks results in quite accurate predictions for the total sales of most books, indicating that our model can accurately predict how many copies a book will sell during its lifetime months after publication. As the number of weeks used for the fit increases, so does the accuracy of the prediction. Additionally, total sales of books with higher sales peaks are predicted more accurately, as indicated by the relative closeness of the red and orange dots to the 45 degree line as opposed to the green and blue dots which are generally more spread out.
Figure 11(B) shows the probability distribution for the immediacy parameter \(P(\mu )\), indicating that both fiction and nonfiction books have similar immediacy distributions, i.e. all bestsellers reach their sales peak at a similar times. This result is consistent with Fig. 9, including the observation that \(P(\mu )\) peaks at a slightly higher value for nonfiction than fiction, pointing to later peak sale times for nonfiction compared to fiction bestsellers.
Finally, the probability distribution for the decay rate values \(P(\sigma )\) is shown in Fig. 11(C). The distributions for both fiction and nonfiction are quite narrow, following each other closely except at the very top, indicating very similar longevity and decay rates for all bestsellers. Yet, the distribution is slightly broader for fiction compared to nonfiction, indicating that on average, the longevity and continued success of fiction books vary more than nonfiction books, even among bestsellers.
The dashed lines on all three distributions show where the parameter values for The Appeal fall, indicating an extremely high fitness pointing to very high sales, typical immediacy pointing to average peak sale timing and lower than average decay rate pointing to a relatively slow drop in sales after the peak.
We find, overall, that the model (3) correctly describes the sales pattern of a book, accurately predicting the total sales once the book has been out for some time. However, we have seen in Fig. 9 that for the majority of bestsellers, most sales happen during the first few months. Consequently, a prediction of the future sales many months after the publication date is of value for inventory management.