Leveraging augmentation techniques for tasks with unbalancedness within the financial domain: a two-level ensemble approach

Ranjbaran, Golshid; Reforgiato Recupero, Diego; Lombardo, Gianfranco; Consoli, Sergio

doi:10.1140/epjds/s13688-023-00402-9

Regular article
Open access
Published: 10 July 2023

Leveraging augmentation techniques for tasks with unbalancedness within the financial domain: a two-level ensemble approach

Golshid Ranjbaran¹,
Diego Reforgiato Recupero²,
Gianfranco Lombardo³ &
…
Sergio Consoli ORCID: orcid.org/0000-0001-7357-5858⁴

EPJ Data Science volume 12, Article number: 24 (2023) Cite this article

1494 Accesses
Metrics details

Abstract

Modern financial markets produce massive datasets that need to be analysed using new modelling techniques like those from (deep) Machine Learning and Artificial Intelligence. The common goal of these techniques is to forecast the behaviour of the market, which can be translated into various classification tasks, such as, for instance, predicting the likelihood of companies’ bankruptcy or in fraud detection systems. However, it is often the case that real-world financial data are unbalanced, meaning that the classes’ distribution is not equally represented in such datasets. This gives the main issue since any Machine Learning model is trained according to the majority class mainly, leading to inaccurate predictions. In this paper, we explore different data augmentation techniques to deal with very unbalanced financial data. We consider a number of publicly available datasets, then apply state-of-the-art augmentation strategies to them, and finally evaluate the results for several Machine Learning models trained on the sampled data. The performance of the various approaches is evaluated according to their accuracy, micro, and macro F1 score, and finally by analyzing the precision and recall over the minority class. We show that a consistent and accurate improvement is achieved when data augmentation is employed. The obtained classification results look promising and indicate the efficiency of augmentation strategies on financial tasks. On the basis of these results, we present an approach focused on classification tasks within the financial domain that takes a dataset as input, identifies what kind of augmentation technique to use, and then applies an ensemble of all the augmentation techniques of the identified type to the input dataset along with an ensemble of different methods to tackle the underlying classification.

1 Introduction

Financial Technology (Fintech) aims to introduce new approaches to improve and automate the delivery and usage of financial services [11–15]. When Fintech emerged for the first time in the 21st century,^{Footnote 1} the term was initially referred to the back-end systems of established financial institutions. Then, there was a shift to more consumer-oriented services. Today, Fintech includes different sectors and industries such as education, retail banking, fundraising and nonprofit, and investment management.

The advent of Big Data and all the recent advances in Machine Learning (ML) and Deep Learning (DL) have the potential to revolutionize the banking industry through practical applications. Indeed, the fact that ML and DL can process a vast amount of data at high speed plays a key role that makes them suitable and applicable to real-world scenarios.

Several financial problems hide different computer science tasks related to classification [21]. For a certain classification task, given an input set (fixed) of categories and a set of objects, the goal is to assign one or more categories to each object. In practical datasets within the financial domain, depending on the underlying task, data are often distributed unevenly among classes. This makes the dataset unbalanced and leads to a decrease in the predictive performances of ML and DL approaches [70]. In particular, this phenomenon leads to classification issues for datasets where only a very small number of samples belong to a certain class (minority class) whereas the remaining classes (majority classes) have a very large number of data instances. When imbalanced datasets need to be handled, basic ML and DL approaches mainly focus on the majority classes due to their occurrences. Identifying the instances in the minority classes becomes more difficult as they are often mislabeled as noise.

To overcome the unbalancedness issue, many techniques and algorithms have been implemented to reduce the gap between the classes to classify [21]. They apply a resampling process at the data level which aims at balancing minority and majority data samples before the training of ML and DL approaches on the data. These resampling techniques can be mainly categorised into under-sampling and over-sampling approaches [2, 6]. Under-sampling is commonly performed by randomly removing data samples from the majority class of the training set [25]. However, this random process is very likely to remove critical or important data samples from the training set, resulting in a critical loss in performance of the classification algorithm [24]. In addition, the under-sampling is not even applicable to all those datasets whose unbalancedness is too extreme (like in the finance domain), since the removal process would drop too many data samples yielding a highly poor training set, which would make the training of any ML and DL approaches on such data practically impossible.

In this context, to handle highly unbalanced data, different over-sampling, or data augmentation, strategies can be adopted [80]. Their goal is to increase the size of data used for training a model by artificially replicating the data instances of the minority class adopting some intelligent look-ahead mechanisms. Within the financial domain where datasets are often highly unbalanced, augmentation techniques are beneficial and help ML and DL approaches in increasing their prediction accuracies. This is the object of the work reported in this paper. In particular, we consider several problems within the financial domain most of which have the characteristic of being associated with very unbalanced datasets and focus on various data augmentation techniques in order to properly handle such extremely unbalanced data. Specifically, we tackle the following tasks:

Classifying Polish and American companies depending on whether they went in bankrupt or not;
Classifying success or failure cases of bank marketing where an agent calls a client to sell a long-term deposit via telemarketing calls;
Classifying a credit card transaction as regular or fraudulent.
Classifying a credit approval as granted or not from the bank for a certain customer asking for a loan;

The datasets we have used for each task, except for the credit approval, are characterized by unbalancedness. Although the credit approval has a balanced dataset, it consists of too few instances to train well a classifier. We aim to experiment with our proposed augmentation strategies in this particular kind of context too. Thus, we apply state-of-the-art augmentation strategies to all the datasets and evaluate the results for several ML and DL approaches. The considered augmentation strategies are of two kinds: i) those that augment the instances associated with the minority class and ii) those that augment evenly a balanced dataset represented by too few instances. We then compare the classification results against two levels of baselines: i) when no augmentation strategies are applied or ii) when basic augmentation strategies are used to increase the training data. On the basis of the results, which prove the benefit brought by the augmentation techniques, we present a two-level ensemble approach to perform classification within the financial domain which: i) takes a dataset as input, ii) identifies from its features whether it contains (and what kind) any unbalance issue, iii) applies all the augmentation techniques that work best with the discovered kind, iv) performs an ensemble of different ML methods using the different augmented datasets. The reader notices that a first-level ensemble is applied when generating the augmented datasets and a second-level ensemble is used when employing several ML methods for the classification task.

Therefore, the contributions of this paper are the following:

We tackle various tasks within the financial domain characterized by different kinds of unbalancedness in the data that we had to deal with using augmentation techniques;
We leverage state-of-the-art augmentation techniques to balance the training data and to bring benefits to the subsequent classification task;
We show that in every case, augmentation techniques are beneficial for classification tasks within the financial domain and suggest best practices to adopt such strategies in other similar problems in the same domain;
We adopt augmentation techniques on the bankruptcy dataset in the USA, which has been recently presented in [48] and has only been used for classification tasks by just leveraging undersampling techniques because of the strong unbalance condition of the bankruptcy class;
We define a two-level ensemble approach focused on the financial domain for classification tasks that can be used to select the best combination (augmentation, ML) approach and also to evaluate any method trying to automatically infer either the augmentation approach to use or the ML approach to run;
We share the code in a public repository^{Footnote 2} and keep it general so that it is possible to replicate our work and easily adapt it to tackle other classification tasks suffering from similar unbalance problems.

The remainder of this paper is organized as follows. Section 2 contains the related work on augmentation techniques for tasks within the financial domain. Section 3 describes the financial tasks that we have considered and tackled in this paper. The augmentation techniques that we have analysed and used within the mentioned tasks are detailed in Sect. 4. In Sect. 5 we present an overview of the ML algorithms we used to address the considered financial problems. The experiments that we have carried out containing the obtained results for the aforementioned tasks and the proposed augmentation techniques are described in Sect. 6. Furthermore, Sect. 7 includes the details of a two-level ensemble approach, previously mentioned, that we propose in this paper. Finally, Sect. 8 ends the paper with conclusions and future directions where we are headed.

2 Related work

A number of problems in finance can be formulated from a ML perspective as classification tasks, where is often the case that the class to categorize is much smaller than the number of samples in the other classes [70]. As an example, suppose you want to classify in a financial market the companies which will bankrupt. Here, the number of bankrupt companies is much smaller than the others, and the performance of any classifier trained on this data is generally poor, since it is complex to estimate an effective decision boundary to distinguish such companies from the healthy ones with few observations. As a result, the minority class of bankrupt companies is typically categorized as data outliers or even noise of the healthy companies distribution. On the other hand, the design choice of undersampling the majority class of healthy companies leads to an improvement of the recall over the bankruptcy events but a very poor precision over the healthy ones. In both cases, performance is low, especially in dynamic and temporal contexts where the data distribution is not stationary over time [48]. The issue of imbalanced data in the financial domain is therefore very critical, and a number of research works have appeared in the literature to handle various financial problems by resampling the data to balance minority and majority classes before training on a classifier [65]. Augmentation techniques, in particular, obtain a balanced dataset by increasing the individuals of the minority class with new synthetic samples [80], and are usually employed when highly unbalanced data need to be handled [24].

In this context, to manage a different over-sampling or data augmentation, different strategies can be adopted [80]. Among these, maybe the most used and effective one is the Synthetic Minority Oversampling Technique (SMOTE) by Chawla et al. [17], which generates synthetic data for the minority class by using the similarities computed with k-Nearest Neighbors (kNN) for each of the minority samples. Veganzones and Séverin [74] used SMOTE in the context of bankruptcy prediction from the financial ratios of a large set of companies in France. Dal Pozzolo et al. [26] adopted SMOTE and an ensemble of incremental learning classifiers for the detection of fraudulent credit card transactions. A main disadvantage of SMOTE is that the over-sampled synthetic data may overlap in some cases with samples of the majority class, creating redundancy in the training phase of the ML algorithm. To deal with this issue, variations of this method have also arisen in the literature. Le et al. [43] considered various resampling techniques based on SMOTE to improve the performance of basic classifiers on the Korean companies’ bankruptcy data, ranging from 2016 to 2017. Similarly, in [31] the authors tested a hybrid approach combining SMOTE with an ensemble of classification algorithms for bankruptcy prediction on a real dataset from Spain. Pranavi et al. [60] combined SMOTE with a Random Forest for detecting fraudulent transactions, increasing the overall classification accuracy of the algorithm to a remarkable 90%. Garcia [33] proposed a combination of SMOTE with cluster-based under-sampling, leading to promising classification results for bankruptcy prediction. In [45] the authors proposed a fast and accurate ML model called XGBS, using the extreme gradient boosting model and the squared logistics loss (SqLL) for handling the bankruptcy forecasting problem, and validating the approach on imbalanced datasets for firms in Korea, US, and Japan. In [44], the authors developed an ensemble approach to handle the problem of data imbalance in bankruptcy forecasting, combining three algorithms, namely the CBoost algorithm [46], the technique with a cost-sensitive (HAOC) framework, and the XGBS algorithm [45]. The CBoost uses the k-means clustering algorithm to calculate the initial weight vector for the training set. Next, CBoost performs several iterations to determine a set of weak classifiers, and finally, XGBS combines this set to create the final classifier.

Other relevant augmentation techniques for financial unbalanced problems have been also proposed in the literature. Authors in [36] proposed a novel boosting regression data resampling method based on a conditional variational autoencoder that can be used in different tasks for regression including unbalanced datasets. Others designed deep learning approaches for the prediction of hourly movement directions of different banking stocks leveraging stock prices and technical features [34]. These last were reduced through a recursive feature elimination selection [81].

Alarfaj et al. [3] compared different decision tree splitting criteria for credit card fraud detection, deriving a new measure for separating class samples which obtains decision tree solutions with higher performance. Alfaiz and Fati [4] considered over-sampling in various ML techniques, including random forests, decision trees, logistic regression, support vector machines, and artificial neural networks, to detect fraudulent credit card transactions, achieving top performance when the data augmentation was included relative to the models alone. Last, but not least, Chugh and Malik [19] employed random forest and the Adaboost algorithm with data augmentation to detect fraudulent transactions in various countries.

Beside the issue of imbalanced data, it is worth mentioning another important problem when handling economic and financial data, which is data imputation. It is common to find missing values in financial time series, like e.g. stock market data. The reasons might be diverse, like the closing periods of markets during holidays, or the inability to capture financial data in a specified period of time, recording errors and noise, and so on [68]. Missing data makes it daunting to predict future financial time series points using the most up-to-date market information. Thus, when the problem of missing data arises, hence there is an urgent need to handle it [72]. A number of data imputation methods have been proposed in the literature. For a comprehensive overview the reader is referred to [18, 67], and in particular to [39] for specifically handling financial time series.

Another common issue that is often encountered in financial time series is conditional heteroskedasticity [42]. In these cases, the level of volatility cannot be predicted over time, and weighted regressions represent a frequent approach to produce estimations [20, 47]. With heteroscedasticity, the least squares assumption of constant variance in the residuals is violated. Weighted regression minimizes, with the correct weight set, the sum of weighted squared residuals to produce residuals with a constant variance [20].

As it will be shown in the following, in our paper we have used data augmentation on various classical, very unbalanced, financial tasks. We have sampled the datasets using different data augmentation techniques to overcome class imbalances. The balanced datasets were used to train a set of popular classifiers, which have been validated according to common classification metrics. In line with the findings of the state-of-the-art works that applied specific augmentation techniques and specific machine learning or deep learning approaches, we have proved the benefit of augmentation techniques on different financial tasks involving unbalanced datasets and on different machine learning methods. Also, ours is the first work to propose a classification task using augmentation techniques on a new dataset related to the bankruptcy of publicly traded companies in the American stock market.

3 Proposed tasks

In this section, we describe the tasks that we have tackled in this paper.

3.1 Bankruptcy prediction

Corporate bankruptcy prediction is one of the main tasks in credit risk assessment due to its economic damage and social consequences. After the 2007/2008 financial crisis, it has become a priority for most financial institutions, regulatory agencies, and academics [66]. Bankruptcy prediction has been widely researched as a binary classification problem with several ML techniques [27, 54, 78, 84]. The basic goal is to assess the likelihood of companies’ default by looking for relationships among different types of financial data, and the financial status of a firm in the future [30]. Barboza et al. [7] show that, on average, ML models exhibit 10% higher accuracy than scoring-based ones [55, 77]. Specifically, Support Vector Machines (SVM), Random Forests (RF), as well as bagging and boosting techniques were tested for predicting bankruptcy events and compared with results from the discriminant analysis, Logistic Regression, and Neural Networks. However, the main open problem related to this task consists in dealing with a very large imbalance among the classes due to the rarity of bankruptcy events in the real economy. This issue becomes even worse when considering DL approaches that usually require a vast amount of data for training [40, 51]. Private and public companies have in general different dynamics that may significantly impact the probability of dealing with financial troubles like bankruptcy. In light of this, to further analyze our results, we applied the same methodology to two different publicly available datasets:

Bankruptcy prediction for private companies: This dataset is related to the bankruptcy prediction of private companies in Poland between 2000 and 2012. The dataset has been proposed in [84] and is publicly available on the UCI ML Repository.^{Footnote 3} It provides financial statements for both healthy and bankrupted companies in their last 5 years of activity. Table 1 shows the total number of companies under investigation, within each time frame. Each company-year observation is composed of 64 accounting variables from the financial statements. A detailed description of these features is provided in the UCI repository.
Table 1 Statistical data about the bankruptcy dataset for Polish companies
Full size table
Bankruptcy prediction for public companies in the stock market: This dataset is related to the bankruptcy prediction of publicly traded companies in the American stock market (New York Stock Exchange and NASDAQ) for the period between 1999 and 2018. It has been proposed in [48] and is publicly available on GitHub.^{Footnote 4} It provides data from 8262 different companies and, in particular, 18 accounting variables for each fiscal year. Companies are labelled each year depending on their state in the next year and according to the Security Exchange Commission (SEC) rules. Table 2 shows the dataset distribution.
Table 2 Firms distribution by year for the American stock market dataset
Full size table

3.2 Bank marketing

Marketing managers try to improve the effects of their campaigns by carefully choosing the target audiences and the best communication channels. ML techniques can be used to improve these direct marketing initiatives [76]. One of the freely available datasets collected for this purpose is from the Portuguese marketing campaign companies.^{Footnote 5} Deposit subscriptions as well as actual statistics were gathered from a marketing campaign of a Portuguese banking institution. Finding a model that can explain a contact’s success, or whether a client subscribes to a deposit, is the business’s goal. Better use of the available resources (such as human effort, phone calls, and time) and the selection of a high-quality and affordable group of potential buyers are all advantages of using such a strategy to boost campaign efficiency [53]. In this task, telemarketing calls are used to sell long-term deposits to target clients. Depending on who initiated the contact (the client or the contact center), contacts can be categorized as inbound or outbound. Both categories are present in the dataset. Human agents call a list of clients during a campaign to sell the deposit (outbound); otherwise, if the customer calls the contact center for another reason, he/she is requested to subscribe to the deposit (inbound) [53]. The result is an interaction that can either be successful or a failure, which translates to a binary classification task to solve. This study takes into account actual data that was gathered from a Portuguese retail bank between May 2008 and June 2013, for a total of 52,944 phone contacts. Only 6557 entries of the dataset are related to successes, rendering the samples highly imbalanced. Each record included the output target, the contact outcome (“failure”, “success”), and candidate input features like age, education, housing, marital status, etc. Some of these features are text, so they must be encoded into numbers. Some bank marketing features considered in the research studies are reported in Table 3. In the mentioned table, for example, the age, marital status, and annual balance of each consumer are specified. It has also reported whether the customer owns a debit card, if he/she experienced a loan payment delay, etc. The customer’s agent who made the call, the date, and the duration of the conversation are additional details further included in the dataset.

Table 3 Examples of some Bank Marketing features

Leveraging augmentation techniques for tasks with unbalancedness within the financial domain: a two-level ensemble approach

Abstract

1 Introduction

2 Related work

3 Proposed tasks

3.1 Bankruptcy prediction

3.2 Bank marketing

3.3 Credit card frauds

3.4 Credit approval

4 Augmentation techniques

4.1 Baseline: min-max approach

4.2 Baseline: mean-std approach

4.3 SMOTE

4.4 SCUT: SMOTE-clustering

4.5 VAEs: variational autoencoders

5 ML algorithms

5.1 Naive Bayes classifier (NB)

5.2 Support vector machines (SVMs)

5.3 Multilayer perceptron (MLP)

5.4 K-nearest neighbours (KNN)

5.5 Random forests

5.6 Stochastic gradient descent (SGD)

6 Experimental evaluation

6.1 Evaluating metrics

6.2 Computational results

6.2.1 Bankruptcy in Poland

6.2.2 Bankruptcy in USA

6.2.3 Bank marketing

6.2.4 Credit card frauds

6.2.5 Credit approval

7 A two-level ensemble approach for financial classification tasks

8 Conclusions and future directions

Availability of data and materials

Notes

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords