 Regular article
 Open Access
Quantifying decision making for data science: from data acquisition to modeling
 Saurabh Nagrecha^{1} and
 Nitesh V Chawla^{1}Email author
Received: 27 January 2016
Accepted: 9 August 2016
Published: 20 August 2016
Abstract
Organizations, irrespective of their size and type, are increasingly becoming datadriven or aspire to become datadriven. There is a rush to quantify value of their own internal data or the value of integrating their internal data with external data, and performing modeling on such data. A question that analytics teams often grapple with is whether to acquire more data or expend additional effort on more complex modeling, or both. If these decisions can be quantified a priori, it can be used to guide budget and investment decisions. To that end, we quantify the Net Present Value (NPV) of the tasks of additional data acquisition or more complex modeling, which are critical to the data science process. We develop a framework, NPVModel, for a comparative analysis of various external data acquisition and inhouse model development scenarios using NPVs of costs and returns as a measure of feasibility. We then demonstrate the effectiveness of NPVModel in prescribing strategies for various scenarios. Our framework not only acts as a suggestion engine, but it also provides valuable insights into budgeting and roadmap planning for Big Data ventures.
Keywords
 cost sensitive learning
 business value
 external data
1 Introduction
Organizations are rapidly embracing data science to help inform their decision making and generate an impact in their business or operations, whether it is in increased revenue, reduced costs, or improved efficiencies. To that end, organizations are incorporating analytics programs to not only deliver value from their internal data, but also connect their internal data with external data sources to develop a more complete data profile for modeling.^{1} Acquiring external data is not cheap and requires an investment from the organization. Similarly, developing more advanced or complex models may also require investment in people or computational resources. To that end, the analytics teams in organizations may grapple with the following questions: 1) Can they optimize aspects of the data science process to lower the costs, resulting in a higher overall Return on Investment (RoI)? 2) Is there an objective way to compare the value of different strategies — data acquisition or modeling or both?
While there is a paradigm of costsensitive learning or budgeted learning, it does not take into account the explicit costs of data acquisition or modeling, and the Net Present Value (NPV) of implementing the overall data science process or the analytics program. This adds a whole new dimension to the problem of implementing and deploying analytics strategies. For example, an organization switching external data providers would incur upfront costs to switch the data pipeline, potential warehousing, integration into existing model, etc. These concerns are further complicated by the variety of offerings by potential data vendors in terms of features, instances, costs, and delivery model.
Problem statement
 1.
Should one invest dollars on external data acquisition or more complex/advanced modeling or both?
 2.
How much money should one invest in external data acquisition? What is a fair price estimate to pay for the expected outcomes?
 3.
What returns should one expect?
 4.
How can one pivot from an existing strategy over time? How should teams and organizations chart a roadmap for the data analytics project horizon?
These questions can form an objective datadriven set of strategies for analytics teams as they consider the cost and impact of analytics programs for an organization. The answers to these questions rely on a deeper understanding of how one acquires external data and how one develops their analytics solutions (modeling).
1.1 Assumptions
We assume that the following costs, or an estimate of them are known beforehand: internal model development costs, external data acquisition costs, opportunity costs, misclassification costs at each stage, and costs involved with pivoting strategies. These costs can be dynamic and subject to change over the life cycle of the project. If the costs are unknown, then we can use minimum and maximum costs to establish bounds on the final NPV. The cost matrix presented in Section 3.3 shows how this is done. One relevant usecase for this would be to help negotiate the price for external data with an external provider.
We do not, a priori, know whether or not a certain external dataset will improve performance. Instead, in our approach, we use a standard industry practice — running the model on a pilot external dataset, and then evaluating its performance. In our framework, NPVModel, we take this further and convert it into the NPV of costs for the external (pilot) dataset. If the NPV of costs is lowered, then we can say that this external dataset was useful. The generalizability of this relies on the assumption that the pilot data (on which these decisions are made) is indicative of the test data (which is used in practice). Over the course of time, if the external data no longer adds to the performance of the model, then it is clearly reflected in its NPV, which will be greater than or equal to those for a model run without external data.
Since it is highly subjective to comment on an industryagnostic returns in investment (RoI) from inhouse data science development, we abstain from commenting on the inner workings of costbenefit obtained from inhouse development of machine learning models. However, we do allow for practitioners to use their own values for internal model development in order to exploit the full range of strategies incorporated in the NPVModel.
It is also assumed that relevant external data providers serve as a readily compatible sources of data relevant to the prediction task. In practice, some offerings may contain irrelevant features bundled in with the relevant features. Features known to be irrelevant can be weeded out at the data integration level, or at the modeling level (feature selection during preprocessing). Section 3 contains further details on the subject.
1.2 Target metrics
A machine learning or statistical model’s performance is evaluated on metrics like accuracy, precision, recall, Area Under the ROC Curve (AUC), \(f(\beta)\), etc. on the incoming test data; but in order for these to be usable for business decisions, these need to be converted to a dollar value. A static dollar value at a given point in the future needs to be contextualized for current considerations. Thus, the NPV of this dollar amount offers a good estimate of that cost, and as a result, the timevalue of predictions obtained. This now enables us to compare inherently different strategies headtohead, solely on the basis of expected returns. We refine the question as what is the monetary value of a dynamic prediction system over a period of time?
1.3 Contribution
There can be an inherent tradeoff between investing resources into external data acquisition and inhouse model development. The feasibility of these strategies is evaluated statically, frequently during initial model development. However, a static evaluation fails to take into account the time value of model development. To account for this consideration, our approach is to decouple the considerations regarding acquisition of external data and model development, translate each component’s contribution to monetary returns, and then evaluate the relative strength of an investment strategy.
Our contribution is to propose and develop a recommendation framework, NPVModel, which suggests the best possible business practice for analytics tasks or strategy. In this framework, one can unify costs of model development, external data acquisition, and those of the time value of predictions; this facilitates the development of strategies that derive synergy from the appropriate confluence of model development and external data acquisition.
2 Related work
Since this paper reconciles multiple aspects of machine learning, we provide a brief overview of the relevant techniques in literature from the following subdisciplines.
2.1 Cost sensitive classification
A survey of costsensitive learning techniques over the years is covered in [1]. Motivated by their popularity, [2] focuses on costsensitive learning for treebased classification techniques.
These papers evaluate performance by using multiclass datasets from the UCI repository, associating arbitrary costs with (mis)predictions for each class. These papers demonstrate effective performance by showing that their cost is lower than the contemporary state of the art.
Cost sensitive learning can be implemented using an inherently costsensitive classification approach, or using a “wrapper” which converts an otherwise costagnostic classifier into a cost sensitive one. Popular techniques for inherently cost sensitive classifiers involve minority class resampling, treating thresholds of minority class differently, tweaking the splitting criteria for minority class, pre and postpruning of hypotheses, and combinations thereof.
MetaCost [3] is a widely accepted technique that acts as a costsensitive wrapper around an existing classification technique. Approaches like CostSensitiveClassifier (CSC) [4], Cost Sensitive Naive Bayes [5], and Empirical Thresholding [6] operate similarly and can be alternatively used as wrappers instead of MetaCost. Other metalearning wrapper techniques such as Costing [7] and Weighting [8] employ sampling in the training phase and then use costagnostic classifiers on the resampled data. Of the two (wrappers and resampling), wrapper based metalearning techniques are easier to integrate into the workflow of existing solutions.
Our main focus is in the application of cost sensitive methods as part of a broader framework, and not the finer workings or comparisons of each of the specific solutions. To that end, we use MetaCost as a representative method from the wrapper techniques discussed above.
2.2 Timeliness of prediction
Cost sensitive classification generally works in a static framework, and does not explicitly address the timeliness of predictions. Applications require a classification system to perform over the duration of their deployment. Domingos [9] considers the feasibility of implementing a cost model for machine learning systems, and illustrates a netpresentvalue investmentreturn model until perpetuity. In this paper, we consider a dollarvalue based cost, but it can be easily modified to other cost considerations as well.
2.3 External data acquisition
Another avenue to potentially enhance the business value of a prediction system is to incorporate external data. However, this often begs the question: is external data acquisition feasible? In order for it to be feasible, external data must provide an increase in performance, for which it must enhance the discernibility of the classifier. Provost and Weiss have discussed [10] the impact of class distribution in the training data on classifier performance. This answers the question regarding what quality of external data one must aim for, when such data is available at a premium. Transforming machine learning metrics to dollar amounts has been discussed in [11]. Again, the authors’ treatment of the subject is limited to a static setting. The paper by Weiss et al. [12] is one of the few papers to discuss costs of acquiring external data, though it discusses it in terms of a CPU based cost.
2.4 Active learning
In the spirit of labeled instances being available at a premium, the field of active learning has various solutions that must be acknowledged with respect to the NPVModel. Active learners seek out additional data labels at a cost. Cost sensitive active learners can be induced as shown in [13]. Ideas from Proactive learning [14] can be incorporated just the same as for active learning. Proactive learning goes beyond some of the assumptions in active learning and relaxes the assumptions that the external ‘oracle’ is always right, always available, always costs the same to query, or that there is just one oracle. NPVModel applies the concept of timevalue of costs to the existing idea of active learning.
Overall, in literature, we see that the problems of model development, external data acquisition, and time value of prediction have been separately addressed, but no singular work ties these concepts together.
3 Proposed framework: NPVModel
Data science or analytics strategies can be characterized in terms of external data acquisition and/or model development. The data acquisition costs could include the cost of purchasing and integrating new data in to a data science workflow. The cost of model development can be the human resources expended towards model development as well as possible computational cost. Costs associated with cloud computing services^{2} can be estimated and factored in. Each of these decisions has an underlying investmentreturn model. Our goal is to best characterize this model in terms of an objective NPV so as to compare vastly differing strategies headon. Section 3.1 discusses several models used for external data investment strategies. Each strategy’s NPV of cost is calculated and the strategy with the highest NPV (i.e. lowest NPV cost) is deemed most feasible. The most feasible strategy is a solution of the form which informs analysts of three parameters: what kind of external data model is the best, what kind of model development is best, and when any/each of these should be deployed, specific to their environment.
3.1 External data acquisition
External data can be obtained under three basic models — additional training data instances, additional features/attributes or both. This is as illustrated in Figure 1. In the case of additional instances, one can purchase these all at once, or in batches [15, 16]. The former is relevant in cases where the external data provider does not update their data at least within the period of the project. In reality, these could be external data sources where it may not even be necessary to update the data sources so long as sufficient external data is available. The latter is reflective of the practice followed by many external data providers, who themselves keep updating their data warehouses with new data instances. We use both models for our experiments where external data instances are added.

Case A: (The basic model) No external data.

Case B: Uptodate batchwise external instances.

Case C: Onetime external instance dump.

Case D: External features for each of the inhouse instances.
The setup of the experiments in this paper is such that each dataset’s prediction task is evaluated with and without various cases of external data. Since we aim to make no assumption regarding the nature of what form the external data can take, it covers each scenario from Section 3.1, viz. Cases A through D. In case of external data instances, the instances are only taken into account for model development and not for prediction. In case of external features, the feature values are added to the featurespace of the internal data. This means that upon querying an external dataset with an instance (\(X_{i} = (x_{\mathrm{int},1},\ldots,x_{\mathrm{int},m})\)), we get a new set of features (\((x_{\mathrm{ext},1'},\ldots,x_{\mathrm{ext},m'})\)), which are added to the featurespace of the internal dataset, resulting in the instance \(X_{i} = (x_{\mathrm{int},1},\ldots,x_{\mathrm{int},m},x_{\mathrm{ext},1'}, \ldots,x_{\mathrm{ext},m'})\).
3.2 Machine learning model development
Inhouse model development comes with many associated costs. In order to evaluate the RoI of an inhouse modeling (analytics) team, we break it down into its two components: the investment (the salaries and upkeep costs) and the returns (the difference in NPV of predictive performance). We set forth tiers of coststocompany for certain inhouse modeling costs. These costs are merely estimates of how much overall investment has gone into development, and serve to indicate the breakeven point where it would be feasible to pursue inhouse model development.
3.3 Predictions to dollar value
Thus, when dealing with optimization on the cost matrix, we can simply focus on η as a parameter. We consider sweeps of η in order to show how a system would react when subject to differing cost objectives as tradeoffs between false positives and false negatives.
 1.
Get cost matrix (CM);
 2.
Learn model (\(M_{0}\)) from training data (\(Tr_{0}\)) at \(t=0\);
 3.
Learned model is optimized on cost matrix;
 4.
Predict subsequent test data instances (\(Te_{1}\)) based on model \(M_{0}\);
 5.
Combine \(Te_{1}\) with existing data \(Tr_{0}\), and retrain model \(M_{0}\) to \(M_{1}\) to minimize costs;
 6.
Repeat for subsequent instances.
Each batch generates an associated cost at that given time instance. Added to this cost, is the cost of whatever strategy is in effect — cost of external data, cost of model development, etc. This total cost needs to be contextualized in terms of its time value, and therefore an appropriate discount rate is applied and an NPV calculation is made over all the batches. Each strategy is evaluated in terms of NPV alone and compared with the baseline.
 1.
Get cost matrix (\(CM_{0}\));
 2.4.
… (same as before);
 5.
Combine \(Te_{1}\) with existing data \(Tr_{0}\);
 6.
Retrain model \(M_{0}\) to \(M_{1}\) using a discounted cost matrix \(CM_{1}\) and data costs;
 7.
Repeat for subsequent instances.
This brings us to the overall comparison of the objective function: the cost. Our approach is to calculate this cost function separately for each time value in the future. This makes it behave very different from the conventional static picture of cost sensitive learning — e.g. an error in the immediate time frame is now costlier than that same error occurring at a future time. More discussion on the Discount Rate is available in Additional file 1, Section 3.
3.4 Integrating with standard techniques
The second matrix in the above equation considers costs for correctly classified instances as well. This is due to the fact that cost of external data acquisition and that of model development are agnostic of prediction outcome. Since this cost matrix is time sensitive, it is important that the correct version of this matrix be used for all considerations or else the NPV consideration would not be relevant to the desired time period. It should be noted that this modified cost matrix is consistent with the total cost calculations as per Section 3.3.
3.5 Variable cost modeling
The above costs are known beforehand for the training set. In practice, one might not have this luxury for the test set. The paper by Zadrozny and Elkan [17] obtains estimates of overall cost in the test set by establishing boundaries.
4 Experimental setup
Datasets used: the aim is to get datasets which resemble those used in contemporary cost sensitive prediction tasks and have corresponding external datasets
Dataset  % Minority  Instances  Ext. Data Instances  Time stamps  Costs 

Pendigits  8.3  13,821  simulated  simulated  simulated 
Medicare  12.9  611,785  853,360  simulated  simulated 
Open city data  33.2  250,000  77  actual  actual 
4.1 Datasets
UCI dataset  pendigits
The imbalanced UCI Machine Learning Repository datasets are used in many standard cost sensitive learning papers in literature [2, 3, 10]. These are standard datasets used in literature which have varying levels of binary class imbalance. Since this paper’s contribution is not directly tied to the number of instances in a dataset or its class imbalance, we choose the pendigits datasets as an example dataset from the UCI datasets. Choosing any other dataset would result in a similar process to glean insights and that is the overarching goal of this paper.
To establish baselines and preserve simplicity, the multiclass pendigits dataset has been converted to a binary class dataset in keeping with [10]. It should be noted that the techniques discussed in this paper can be directly extended to multiclass problems. Since the UCI datasets are meant for standalone prediction tasks, we discuss how external data in a dynamic form is simulated from UCI datasets in Additional file 1, Section 5.
Medicare data
Medicare.gov contains official data from Centers for Medicare & Medicaid Services (CMS). Given a set of descriptors, we would like to predict whether a given health care professional is a physician or not. The National Downloadable File contains information about all of the providers enrolled under the Medicare program [18]. This is to be considered our inhouse data. The procedures they perform are enumerated in the Medicare Provider Utilization and Payment Data: Physician and Other Supplier file [19]. This file can be used as an external feature lookup dataset. The particulars of delineating class and feature selection are as per Section 5 in Additional file 1.
The feature vector is composed of: location of medical school at a city level, graduation year from medical school, the gender of the health care professional and participation in various initiatives like PQRS [21], EHR [22], eRx [23], Million Hearts [24]
Open city data
The Public Safety dataset of crimes in Chicago can be used to predict whether an arrest will be made in a particular case. This dataset is available publicly [20] and it reflects reported incidents of crime that occurred in the City of Chicago from 2001 to approximately the present day. As our external queryable featureset, we have various census based socioeconomic indicators at the “community area” level. With this in mind, we want to find the applicability of these socioeconomic factors towards improving the predictive power of the crime reports dataset. The external data instances are queried using the relevant “community area”, thus postulating that socioeconomic factors may be indicative in predicting arrests.
4.2 Testing workflow
We take the datasets from Section 4.1 in batches (real or simulated, as the case may be). We then create external data queried avatars of them according to Section 3.1 and perform predictions as per Section 3.3 using techniques in Additional file 1, Section 2. We repeat this for various cost factors to cover different penalties for False Negatives.
5 Results
The experiments in Section 4 are designed to support the applicability of NPVModel to real business decisions. As a result, the parameters each experiment deals with reflect the same idea. The results for each dataset have been adjusted for number of instances in order to enable a cursory comparison across various datasets. These results serve as usecases for decisions that can be made with NPVModel.
5.1 Interpreting the results
5.2 Should one get external data?
As the results in Figure 3 indicate, it may not always be the best idea to get external data. In the pendigits dataset, it is directly advisable to acquire external data in the basic classifier model (DT). In order to consider the feasibility of a hybrid approach involving CSDT, one must also factor in the costs of model development as per specification. This is discussed in Section 5.4.
5.3 How much external instance data should one get?
5.4 Model development strategies
5.5 Cost factor
When comparing similar strategies across cost factors, we see that the cost factor of 1 incurs larger costs overall, since the majority class is penalized more than its counterpart, and thus by sheer abundance, drives up the NPV costs. It is also evident that the convexity of the curves changes by changing the cost factor — the pricepoint of $1 is a good example of this. The best strategy for a cost factor of 1 is to get the same amount of external instances as the incoming instances. For a cost factor of 100, the best suggestion is to get half of the number of incoming instances instead.
In terms of strategy, inferences from one cost factor model are not directly transferable to the other by simply adjusting for a different cost penalty model.
5.6 Price negotiation for external data
In order to minimize the total cost function, we would like to minimize the price for external data. As seen from Figure 3, there exists a maximum permissible pricepoint, below which, acquiring external data is feasible. From a business perspective, it is thus useful to locate these points. It is to be noted that since these maximum permissible pricepoints are bound by feasibility, they are indirectly linked to the underlying cost factor. A higher cost factor would drive the maximum feasible price point higher, as can be seen from Figure 8. For a cost factor of 1, it is feasible to acquire external data if each instance costs $0.25 or less, whereas in the case of a cost factor of 100, we can afford to acquire external instances at $1 each.
5.7 Scalability
NPVModel consists of two main components — (1) training models on data and (2) computing NPV over a search space of parameters. These parameters can include all relevant external data strategies (Section 5.2), amount of external data (Section 5.3), cost of external data (Section 5.6), various candidate models (Section 5.4), cost factor(s) (Section 5.5) and discount factor(s) (Additional file 1, Section 3).
The training component’s complexity is directly dependent on the classification techniques considered. NPVModel facilitates the use of any userspecified classifier to be considered in the training stage, so the overall complexity for this stage is simply the complexity of the classifier. No additional cost is imposed by the NPVModel.
The second component consists of computing the modified cost matrix from Section 3.4 for each of the applicable parameters in the gridsearch, and each of its historical test sets. The overall complexity of this step is simply the combined cardinality of the grid comprising of all parameter permutations scaled by the number of test instances. If a particular parameter cannot be changed, its cardinality in the grid search is simply equal to 1. All of the parameters listed in this paper appear as design choices in the complexity of implementing NPVModel.
For a given time, the cost computation is dependent on the respective model having been trained. Since the NPV calculation is independent across parameters, this becomes an embarrassingly parallel computation.
Overall, NPVModel’s scalability is directly affected by that of the training complexity of models considered (classifier dependent), the number of incoming test instances (linear), and of the parameter space one seeks to optimize over (multiplicative).
Throughout the experiments in the paper, we see that the insights can be derived for small datasets (pendigits) and for larger datasets (medicare and open city data). The modified cost matrix defined in Section 3.4 is agnostic to the size of the dataset in question. This makes NPVModel a powerful and easy addition to the existing modeling techniques and considerations.
6 Conclusion
We proposed and demonstrated a unified framework, NPVModel, to help quantify the tradeoffs between data acquisition and modeling. The main contribution of this paper is to serve as a strategyboard for business decisions when implementing and deploying analytics programs. Throughout the paper, effort has been made to address contemporary analytics and data science needs. In Section 3, we fill the gap in literature between external data acquisition, model development strategies, and provide a method to obtain a dollar value of predictive output. The methods outlined here can not only help choose a strategy for immediate use, but also provide a horizon for the future. Section 5 discusses how an organization might need to think of cost factor and discount rate as design parameters in the context of their data science process. We not only demonstrate whether or not it is a good idea to invest in external data or model development, but we also demonstrate which implementations of such strategies are worthwhile.
Future work
In this paper, we considered the scenario that the acquisition of external feature data occurs for all instances in the dataset. That might not apply for all possible usecases. We propose that active learning can then be applied within this construct. Furthermore, we did not incorporate the notion of “data aging”, that is we place the same importance on older and newer data. So, it remains to be investigated whether older data’s value decays over time. If so, how does it affect data management strategies?
Modeling in our work assumes development and application of statistical or machine learning based algorithms/methods and performing analytics.
Declarations
Acknowledgements
This material is based upon work supported by the National Science Foundation under Grant No. 1447795.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 Ling CX, Sheng VS (2010) Costsensitive learning. In: Encyclopedia of machine learning, pp 231235 Google Scholar
 Lomax S, Vadera S (2013) A survey of costsensitive decision tree induction algorithms. ACM Comput Surv 45(2):16 View ArticleMATHGoogle Scholar
 Domingos P (1999) Metacost: a general method for making classifiers costsensitive. In: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 155164 View ArticleGoogle Scholar
 Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques MATHGoogle Scholar
 Chai X, Deng L, Yang Q, Ling CX (2004) Testcost sensitive naive Bayes classification. In: Fourth IEEE international conference on data mining, 2004. ICDM’04. IEEE, New York, pp 5158 Google Scholar
 Sheng VS, Ling CX (2006) Thresholding for making classifiers costsensitive. In: Proceedings of the national conference on artificial intelligence, vol 21. AAAI Press, Menlo Park, p 476. Google Scholar
 Zadrozny B, Langford J, Abe N (2003) Costsensitive learning by costproportionate example weighting. In: Third IEEE international conference on data mining, 2003. ICDM 2003. IEEE, New York, pp 435442 Google Scholar
 Ting KM (1998) Inducing costsensitive trees via instance weighting. Springer, Berlin View ArticleGoogle Scholar
 Domingos P (1998) How to get a free lunch: a simple cost model for machine learning applications. In: Proceedings of AAAI98/ICML98 workshop on the methodology of applying machine learning, pp 17 Google Scholar
 Weiss GM, Provost F (2003) Learning when training data are costly: the effect of class distribution on tree induction. J Artif Intell Res 19:315354 MATHGoogle Scholar
 Dalessandro B, Perlich C, Raeder T (2014) Bigger is better, but at what cost? Estimating the economic value of incremental data assets. Big Data 2(2):8796 View ArticleGoogle Scholar
 Weiss GM, Tian Y (2008) Maximizing classifier utility when there are data acquisition and modeling costs. Data Min Knowl Discov 17(2):253282 MathSciNetView ArticleGoogle Scholar
 Greiner R, Grove AJ, Roth D (2002) Learning costsensitive active classifiers. Artif Intell 139(2):137174 MathSciNetView ArticleGoogle Scholar
 Donmez P, Carbonell JG (2008) Proactive learning: costsensitive active learning with multiple imperfect oracles. In: Proceedings of the 17th ACM conference on information and knowledge management. ACM, New York, pp 619628 Google Scholar
 TLOxp (2015) TLOxp Pricing alternatives available for all industries. http://www.tlo.com/pricing.html. [Online; accessed 17June2016]
 LexisNexis (2015) LexisNexis Pricing Plans. http://www.lexisnexis.com/gsa/76/plans.asp. [Online; accessed 02June2015]
 Zadrozny B, Elkan C (2001) Learning and making decisions when costs and probabilities are both unknown. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 204213 View ArticleGoogle Scholar
 Center for Medicare, Medicaid Service: Provider Utilization and Payment. http://www.cms.gov/ResearchStatisticsDataandSystems/StatisticsTrendsandReports/MedicareProviderChargeData/PhysicianandOtherSupplier.html. Accessed: 20150316
 Medicare Provider Utilization and Payment Data. http://www.cms.gov/ResearchStatisticsDataandSystems/StatisticsTrendsandReports/MedicareProviderChargeData/PhysicianandOtherSupplier.html. Accessed: 20150603
 Crimes 2001present. https://data.cityofchicago.org/PublicSafety/Crimes2001topresent/ijzpq8t2. Accessed: 20150603
 Physician Quality Reporting System. http://www.cms.gov/Medicare/QualityInitiativesPatientAssessmentInstruments/PQRS/. Accessed: 20150603
 The EHR Incentive Program. http://www.cms.gov/RegulationsandGuidance/Legislation/EHRIncentivePrograms/index.html?redirect=/EHRIncentiveprograms. Accessed: 20150603
 The ERx Incentive Program. https://www.cms.gov/Medicare/QualityInitiativesPatientAssessmentInstruments/ERxIncentive/index.html?redirect=/ERxIncentive/. Accessed: 20150603
 The Million Hearts Initiative. http://millionhearts.hhs.gov/. Accessed: 20150603
 Cloud AEC (2011) Amazon web services. Retrieved November 9, 2011 Google Scholar