Skip to main content
Figure 2 | EPJ Data Science

Figure 2

From: Uncovering the size of the illegal corporate service provider industry in the Netherlands: a network approach

Figure 2

Flowchart with our approach, consisting of two phases. (i) The data cleaning and feature engineering phase (top part, in black). We started by downloading the datasets on licensed CSPs, companies, and directors. Directors (right branch) with one or two positions are excluded, and the remaining 36,543 directors are merged with company data to construct the network features. Licensed CSPs (left branch) are augmented using company data to reach a final set of 909 licensed CSPs. (ii) The modeling and validation phase (bottom part, in color). First, we use the nearest neighbors algorithm to find similar directors to the 909 licensed CSPs (red branch). We kept all directors that were within the 100 closest directors to at least three licensed CSPs. We manually validated 100 of them to estimate the size of the illegal CSP industry at 161–572 entities. Second, we conducted a validation test using penalized logistic regression (blue branch). We found 3677 new potential candidates and manually validated 100 of them to estimate that the first approach missed 9–199 illegal CSPs. Taken together, we estimate the size of the illegal CSP industry at 402 entities (95% confidence interval 212–668). Solid arrows indicate a transformation or creation of a dataset. Dashed arrows indicate inputs

Back to article page