3.1 Features
Readers tend to choose books by authors they have read before or books written by celebrities; they often have a strong preference for specific genres and are more likely to notice well marketed books. Our features are designed and consolidated with the domain experts to capture each of these aspects of book selection.
Some of the aforementioned factors are easily quantified. For authors, visibility can be measured using Wikipedia pageviews at any given date, capturing how many people visit an author’s Wikipedia page over time [14, 20,21,22]. The sales of an author’s previous books are provided by Bookscan. The genre information is contained in the Bisac code, as discussed in Sect. 3.1.2. Topic information is produced by employing Non-negative Matrix Factorization applied to book descriptions collected from Amazon and Goodreads [23, 24]. However, it is difficult to quantify advertising. Marketing and advertising are usually the publishers’ responsibility and some publishers devote more marketing resources than others. Therefore, we use the publisher as a proxy to quantify the extent of resources and know-how devoted to marketing. Publishers also play a role beyond marketing: they pass quality judgment by selecting books for publication, hence publisher quality/prestige also facilitates access to more prestigious authors and better manuscripts. Finally, we also consider seasonal fluctuations in book sales previously demonstrated as predictive [16].
In summary, we explore three feature categories: (1) author, which includes author’s visibility and previous book sales; (2) book, which includes a book’s genre, topic and publishing month and, (3) publisher, which captures the prominence the of book’s publisher, potentially capturing its marketing and distribution power. Next, we discuss each of these feature categories separately.
3.1.1 Author features
Author visibility:
We use Wikipedia pageviews as a proxy of the public’s interest in an author, capturing his or her fame or overall visibility. There are many aspects of visibility: cumulative visibility representing all visits starting from the page’s creation date is more relevant for some authors, while recent visibility is more relevant for others. To capture these multiple aspects of visibility, we explore several author-linked parameters for each book, representing the visibility feature group:
-
Cumulative visibility, \(F^{\mathrm{tot}}\), counts the total pageviews of an author up until the book’s publication date.
-
Longevity, \(t^{F}\), counts the days since the first appearance of an author’s Wikipedia page until the book’s publication date.
-
Normalized cumulative visibility, \(f^{\mathrm{tot}}\), divides the cumulative visibility with its longevity, i.e., \(f^{\mathrm{tot}} = \frac{F ^{\mathrm{tot}}}{t^{F}}\).
-
Recent visibility, \(F^{\mathrm{rec}}\), counts the total pageviews of an author during the month before the book’s publication. It captures the momentary popularity of the author around publication time.
Previous sales:
We use the Bookscan weekly sales data to calculate previous sales of all books written by an author. Similar to an author’s visibility, we have multiple ways to incorporate previous sales. For example, previous sales in different genres from the predicted book is relevant for authors who change genres during their career (for genre information see Sect. 3.1.2). We use the following information for each book, representing the previous sales feature group:
-
Total sales, \(S^{\mathrm{tot}}\), obtained by querying an author’s entire publishing history from Bookscan and summing up the sales of her previous books up until the publication date of the predicted book.
-
Sales in this genre, \(S^{\mathrm{tot}}_{\mathrm{in}}\), counts the author’s previous total sale in the same genre as the predicted book.
-
Sales in other genres, \(S^{\mathrm{tot}}_{\mathrm{out}}\), counts the author’s previous sale in other genres.
-
Career length, \(t^{p}\), counts the number of days from the date of the author’s first book publication till the publishing date of the upcoming book.
-
Normalized sales, \(s^{\mathrm{tot}}\), normalizes the total sales based on the author’s career length, i.e., \(s^{\mathrm{tot}} = \frac{S^{ \mathrm{tot}}}{t^{p}}\).
3.1.2 Book features
Genre information:
Fiction and nonfiction books have different sales patterns as shown in previous work [16] and within fiction and nonfiction, each sub-genre may have its own behavior as well. We obtain direct information about genres from the Bisac Code [17], a standard code used to categorize books into 53 topical contents like “FICTION”, “HISTORY”, “COMPUTERS”, etc. Under each major topic, there are more than 4000 sub-genres. For example there are 189 genres under fiction, such as “FIC022000” for “FICTION / Mystery & Detective / General”. While we would like to account for each genre separately, some genres have too few books to offer representative statistics. To solve this problem, we use clustering to reduce the number of genres, aggregating genres with comparable size (i.e., number of books) and comparable potential sales. The clustering criteria is based on the number of books and the median sales of the books in each genre that are listed among top-selling (top 100) books, rather than the content of the topics. We conduct clustering on fiction and nonfiction separately using the K-means (\(k=5\)) clustering algorithm [25]. Figure 1 shows the outcomes of clustering for fiction and nonfiction. For example, General Fiction and Literary are clustered to Fiction Group B. Some clusters are unexpected content-wise, like the Nonfiction Group B, which combines Religion, Business & Economics and History. This indicates that in size and sales potential, these three genres are similar. The result of genre clustering is used to group books and calculate features.
We use various distribution descriptors (including the mean, median, standard deviations, 10th, 25th, 75th and 90th percentile, same hereafter) of book sales within each genre cluster, forming a genre cluster feature group. We form these set of features to quantify the properties of each explored distribution.
Topic information:
Genre information is assigned by publishers and can be different from how readers categorize books. For example, books under Bisac Code “BUS” (Business) can cover very different subjects, varying from finance to science of success. Therefore, we extract each book’s topics from one-paragraph book summaries created by publishers, offering a better sense of the actual content of the book. We utilize Non-negative Matrix Factorization (NMF) techniques from Natural Language Processing [23, 24], which output two matrices: a topic-keyword matrix and a book-topic matrix. The topic-keyword matrix allows us to create a topic-keyword bipartite graph showing the composition of each topic as shown in Fig. 2. For each topic, we obtain the book sales distribution and the descriptors introduced in the previous section such as the mean, median, standard deviations, and different percentiles of the distributions. Then for each book, represented as a linear combination of several topics with weights assigned from the book-topic matrix, the features are calculated as a weighted average of each statistics of each topic.
Publishing month:
Previous study of New York Times Bestsellers demonstrated that more books are sold during the holiday season in December [16]. We therefore aggregate all fiction and nonfiction hardcovers in our baseline books published between 2008–2014 by their publishing month, confirming that all book sales are influenced by the publication month (Fig. 3). To be specific, books published in October and November have higher sales within one year and books published in December, January or February have lower sales. To account for the role of the publication month, we develop a month feature group, where for each category (fiction and nonfiction) we obtain the book sales distribution for each month and include in the features the resulting distribution descriptors.
3.1.3 Publisher features
In Bookscan data, each book is assigned to a publisher and an imprint. In the publishing industry, a publishing house usually has multiple imprints with different missions. Some imprints may be dedicated to a single genre: for example Portfolio under Penguin Random House only publishes business books. Each imprint independently decides which books to publish and takes responsibility for its editorial process and marketing. Some imprints are more attractive to authors because they offer higher advances and have more marketing resources. Additionally, more prominent imprints tend to be more selective, and books published by those imprints have higher sales.
To capture the prominence of a particular imprint, we looked at our baseline books collection published between 2008–2014, and discovered that the variation in sales within each imprint can span several orders of magnitude (Fig. 4). For example for Random House, the highest selling book sold one million copies in a year while the lowest selling book sold less than a hundred copies. Similar to publishing month, we develop an imprint feature group where for each category (fiction and nonfiction) we obtain the book sales distribution of each imprint and use the distribution descriptors as the predictive features.
3.2 Learning algorithms
Book sales follow a heavy-tail distribution (see Fig. 5), and in general the prediction and regression of such heavy-tailed distributions are challenging [26, 27]. Indeed, the higher-order moments and the variance of heavy-tailed distributions are not well-defined, and statistical methods based on assumptions of bounded variance leads to biased estimates. The literature on heavy-tail regression problem has developed methods based on prior correction or weighing data points [28, 29]. However, most regression methods show limited performance in learning non-linear decision boundaries and underpredict high-selling books. These high selling books, however, are the most important for publishers, hence for these accuracy is the most desired.
3.2.1 Learning to place
To address the imbalance and heavy-tail outcome prediction problems, we employed Learning to Place algorithm [30] which addresses the following problem: Given a sequence of previously published books ranked by their sales, where would we place a new book in this sequence and estimate sales based on this placement?
Learning to Place has two stages: (1) learn a pairwise preference classifier which predicts whether a new book will sell more or less than each book in the training set; (2) given information from stage 1, place the new book in the ordered list of previously published books sorted by their sales. Note that going from the pairwise preferences to even a partial ordering to a ranking is not trivial. The pairwise preferences may have conflicting predictions. For example, the classifier might predict that A is better than B, B is better than C, and C is better than A. Our majority-vote technique in the second stage is designed to resolve such conflicts by estimating the maximum likelihood of the data. We briefly describe two main stages of the Learning to Place algorithm and graphically explained in Fig. 6.
In the training phase, for each book pair, i and j with feature vectors \(f_{i}\) and \(f_{j}\), we concatenate the two feature vectors \(X_{ij} = [f_{i}, f_{j}]\). For the target (a.k.a. response) variable, if book i’s sales number is greater than book j’s, we assign \(y_{ij} =1\), otherwise we assign \(y_{ij} = -1\) (ties are ignored in the training phase). Formally, denoting with \(s_{i}\) the sales of book i and with B the set of books in the training set, we have the training data as:
$$\begin{aligned} &X_{ij} = [f_{i}, f_{j}], \quad\text{for each } (i,j) \in B \times B, i \neq j, s_{i} \neq s_{j}, \\ &y_{ij} = \textstyle\begin{cases} 1, &s_{i}>s_{j}, \\ -1, &s_{i}< s_{j}. \end{cases}\displaystyle \end{aligned}$$
By defining the training data, the problem is converted into a classification problem, in which we predict 1 or −1 for each book pair. Therefore we can send this training data to a classification algorithm (classifier) F to fit the y label (i.e., target variable) and obtain the weights on each feature in matrix X. In our study, we use then Random Forest Classifier [31] for this phase.
Stage 2 of Learning to Place happens during inference (i.e., testing phase). First, pairwise preferences compute using the binary classification model. For each new (test) book k, we obtain
$$ X_{ki} = [f_{k}, f_{i}], \quad\text{for each } i \in B. $$
We then apply the classifier F on \(X_{ki}\) to obtain the predicted pairwise preference between the predicted book and all other books in the training data,
$$ \hat{y}_{ki} = F(X_{ki}). $$
Later Learning to Place assigns the place of the new book by treating each book in the training data as a “voter”. Books (voters) from the training data are sorted by sales, dividing the sales axis into intervals. If \(\hat{y}_{ki} = 1\) (i.e., book k should sell more than book i), sales intervals on the right of \(s_{i}\) will obtain a “vote”. If \(\hat{y}_{ki} = -1\), book i will “vote” for intervals on the left of \(s_{i}\). After the voting process, we obtain a voting distribution for each test book and we take the interval with the most “votes” as the predicted sales interval for book k. See Fig. 6 for a depiction of the voting procedure.
3.2.2 Baseline methods
-
Linear Regression We compare Learning to Place method with the Linear Regression method. We observe that most features we explored are heavy-tail distributed, and so are the one year sales. Therefore, we take the logarithm of our dependent and independent variables, obtaining the model:
$$ \log (PS_{i}) \sim \sum_{i} a_{i} \log (f_{i}) + \text{const,} $$
where \(f_{i}\) denotes the ith feature among the studied features.
-
K-Nearest Neighbors (KNN) We employ regression based on k-nearest neighbors as an additional baseline model. The target variable is predicted by local interpolation of the targets associated with the nearest neighbors in the training set. We employed same feature transformation as in the linear regression models with an Euclidean distance metric between instances and five nearest neighbors considered (\(k=5\)). The features are preprocessed in the same fashion as in Linear Regression.
-
Neural Network The above two baselines do not capture nonlinear relationship between features, therefore we use a simple Multilayer Perceptron with one layer of 100 neurons as another baseline. The features are preprocessed in the same fashion as Linear Regression.
3.3 Model testing
To test the model, we use k-fold cross validation [32, 33]. We apply an evaluation method for each fold of the test sample. In our testing, we use \(k = 5\). For evaluation methods, we choose not to use the classic \(R^{2}\) score: the book sale is heavy-tailed distributed and we are more interested in the error in the log space. \(R^{2}\) is not well-defined in log space because the error does not follow a Gaussian distribution, the basic assumption behind \(R^{2}\). The performance measure are as follows:
-
AUC and ROC: Evaluate the ranking obtained through the algorithm directly with the true ranking. We consider the true value of each train instance as a threshold and we binarize any predicted value and target value depending on this threshold. Having these two binarized lists, we compute the true positive rate (TPR) and the false positive rate (FPR) for a given threshold. For various thresholds of high- and low-sale books, we compute true positive rates and false positive rates of the ROC (Receiver Operating Characteristic) curve and then calculate the AUC (Area Under Curve) score (see Additional file 1).
-
High-end RMSE: We calculate RMSE (Root-mean-square Error) for high-selling books to measure the accuracy of the sales prediction for high-selling books in the top 20 percentile. Since book sales follow a heavy tailed distribution, we calculate the RMSE based on the log values of the predicted and the actual sales.