Fair automated assessment of noncompliance in cargo ship networks

Cargo ships navigating global waters are required to be sufficiently safe and compliant with international treaties. Governmental inspectorates currently assess in a rule-based manner whether a ship is potentially noncompliant and thus needs inspection. One of the dominant ship characteristics in this assessment is the ‘colour’ of the flag a ship is flying, where countries with a positive reputation have a so-called ‘white flag’. The colour of a flag may disproportionately influence the inspector, causing more frequent and stricter inspections of ships flying a non-white flag, resulting in confirmation bias in historical inspection data. In this paper, we propose an automated approach for the assessment of ship noncompliance, realising two important contributions. First, we reduce confirmation bias by using fair classifiers that decorrelate the flag from the risk classification returned by the model. Second, we extract mobility patterns from a cargo ship network, allowing us to derive meaningful features for ship classification. Crucially, these features model the behaviour of a ship, rather than its static properties. Our approach shows both a higher overall prediction performance and improved fairness with respect to the flag. Ultimately, this work enables inspectorates to better target noncompliant ships, thereby improving overall maritime safety and environmental protection.


Introduction
Maritime cargo transport is of major importance to global trade, often being the most cost-effective way to move goods from one place to another. It results in many ship movements across the world; around 80% of world merchandise is carried by sea [1]. However, maritime transport also comes with risks, such as labour exploitation, culpable ships accidents, and environmental pollution. Obviously, these risks need to be mitigated by shipowners. To ensure mutual trust that ships from all countries are adhering to international laws, port state control inspections are conducted when ships berth a port. There are two possible outcomes of an inspection; either the ship is found compliant, or there are particular deficiencies. These port state control inspections check for compliance with many regulations, including any deficiency that could lead to one of the aforementioned maritime risks. If severe enough, such deficiencies can lead to a detention, meaning that the ship is not allowed to depart the port before the deficiencies are rectified, or to a ban meaning that the ship is not allowed to enter certain ports any longer. In this work we aim to predict in an automated manner whether a ship will have a deficiency in port state control, and thus is potentially noncompliant, which we consider equivalent to a ship posing a high risk.
In recent years, governments have established more strict laws to mitigate the negative consequences of maritime transport. Members of the Paris Memorandum of Understanding 1 introduced a so-called 'new inspection regime' [2]. Arguably the biggest innovation in the renewed memorandum, is the introduction of a ship risk profile. It awards a score to each ship based on a weighted sum of six factors [2]. The factors used in the risk profile for a given ship are its: (1) type, (2) age, (3) commercially issued safety certificate, (4) owning company's performance, (5) historical misconducts and (6) the flag a ship is flying, or equivalently, the country of registration [3]. Ships with a low risk profile should be inspected every three years, while ships with a high risk profile should be inspected every six months. The new inspection regime, with its ship risk profile, allows inspectorates to focus on noncompliant ships. It also leads to efficient use of the inspection capacity and budget, as every unnecessary port state control inspection costs the inspectorates on average around $1,000 [4]. This money can then be allocated to find more noncompliant ships. In [5] it was estimated that a noncompliant ship saves on average around $400,000 on maintenance by not complying to regulations, whereas the loss of a ship can incur costs up to $67,000,000. Shipowners with a low risk profile can benefit by having a reduced inspection burden, saving precious turn-around time in the port.
From the six factors used in the current ship risk profile, the flag plays an important role [6,7]. The flag is considered black, grey, or white, based on the detention ratio of the country over a three-year rolling period [8]. Fleets from countries on the black list were significantly more often detained over a three-year rolling period, compared to fleets from countries on the whitelist. We mention three drawbacks in considering the flag for the ship risk profile. First, there are ethical concerns. The use of the flag can be considered disparate treatment [9], because ships are intentionally treated differently based on membership of a privileged class, being the white flag. Second, there are opportunities for ships to change flags, opening up the possibility for noncompliant ships to 'hide' under a white flag [10]. Although changing flags does not necessarily improve compliance, the current new inspection regime would grant such a ship a lower risk profile. In an ideal situation, merely changing an administrative property of a ship should not change the assessment of the risk associated with that ship. Third, inspectors can use their own discretion (possibly leading to subjectivity) to decide how thorough an inspection is. Hence, ships flying a black flag could possibly be subjected to stricter inspections, resulting in a higher probability of finding a noncompliant issue [11,12]. This potential greater focus on ships flying a black flag may mean that these ships are inspected more often and stricter, contributing to a confirmation bias in historical inspection data [10]. The potential danger of inspec-tors' bias has been recognised and great efforts are made to harmonise the training of inspectors, thereby making the overall inspectorate system consistent [3]. Nevertheless, complete global harmonisation is not achieved yet [12].
In our study, we could choose to start ignoring the flag of a ship altogether to reduce the aforementioned confirmation bias, thus providing what in the literature [13] is known as equal opportunity. However, correlations between the other characteristics of a ship and its target exist, thus the classifier will indirectly learn to use the flag of a ship, resulting in inequality of outcomes. Considering all drawbacks of using the flag in risk prediction, we argue that it might be better to get equal outcomes and therefore investigate how we can decorrelate the flag with respect to the outcome of the automated prediction of noncompliance. We do so by employing a so-called fair classifier [14], that can classify whether a ship is noncompliant but prevents (to a specified extent) correlation between its output and the ship's flag. Such a fair classifier may reduce the confirmation bias and therefore improve overall fairness of the risk assessment, compared to the aforementioned ship risk profile, which we consider to be the baseline model.
We consider the actual behaviour of the ships for prediction of noncompliance, in contrast to using the aforementioned six factors of the ship risk profile. Ship behaviour has been used to find anomalous ships [15], which may be indicative of noncompliance. An example of behaviour of a ship that might be characteristic for noncompliance, is that a ship is sailing primarily on routes with a lot of competition, potentially leading to the cutting of costs at the expense of safety. While we do not know the fares on specific routes, our proposed classifier will still take relations between noncompliance and the sailed routes into account. In the current study, we derive a cargo ship network from data containing notifications of ships calling to a port. This data is available to all inspectorates that are member of the European Maritime Safety Agency.
In the network nodes are ports, and edges are ships that travel between ports. By considering the structural function each port has in the network, we extract mobility patterns for each ship. These mobility patterns are provided to the fair machine learning classifier, enabling automated assessment of the risk of ships based on their behaviour. The use of these mobility patterns is novel, since data on port calls has only recently become available to inspectorates.
Hence, our goal is to devise an automated and accurate assessment of ship noncompliance. We do so by answering two research questions. First, how can we obtain our goal using behavioural data? Second, how can we obtain our goal in a fair manner?
The structure of the remainder of this paper is as follows. We start with related work on the ship risk profile and ship risk classification in general. Then, we explain the data used in this work. Subsequently, we describe our methods used to answer the research questions. We then present the results, and end with a discussion and conclusions.

Related work
It is widely recognised that the introduction of the 'new inspection regime' , and thereby the ship risk profile, has been beneficial to a reduction of the number of noncompliant ships [7,12,16,17]. We remark that some weaknesses of the current ship risk profile have already been identified in literature [18][19][20][21][22][23][24]. We mention two of them, together with the solutions that were provided. We then continue with discussing related work on the cargo ship network.
The first weakness in the existing ship risk profile, which assesses risks based on a weighted sum of six characteristic ship factors, is that the weights are manually determined [25]. In doing so, the model ignores any interactions between the factors. Here we remark that more complex models may take into account more dependencies and correlations, thereby improving performance [25,26]. To this end, machine learning classifiers have been introduced that can learn the weights automatically and do capture correlations between the factors. Gao et al. proposed to use a support vector machine and k-nearest neighbours pipeline to find high risk ships [25]. The support vector machine takes more complex (and non-linear) interactions into account and generalises well, while k-nearest neighbours makes the overall approach noise tolerant. Yan et al. acknowledged that only a small fraction of ships is detained, and therefore used a balanced random forest classifier to predict ship detentions [24].
The second weakness of the ship risk profile is that so far relatively static factors are used in the assessment of risk, meaning that the factors rarely change for a given ship. To remedy this weakness, datasets have been exploited that better reflect the current condition of a ship and hence may improve prediction. Xu et al. used web scraping techniques to gather more information from inspection reports [20]. Knapp and Franses use company inspections and data from other inspection regimes to enhance the ship risk profile [27]. Yan et al. proposed to add more historic information to the model, such as times of changing flag and casualties in the last five years [24]. Also, they suggested to make information exchange between different inspection regimes more coherent, such that deficiencies and detentions in other regions can be used as well [26]. An additional suggestion is to enrich the risk profile by incorporating more specific information, such as data pertaining whether the ship has been involved in an accident.
In this work we use port call data modelled as a cargo ship network. The first publication of such (global) cargo ship network was by Kaluza et al. [28]. They noted that the diameter of the network, the longest shortest path length, was smaller than expected for a random network with the same number of nodes and edges, with a value of only 8. Also, they found that the average topological distance between any two ports in the world was only 2.5. Likewise, Liu, Wang and Zhang found a diameter of only 7 and an average distance of 3.3 [29]. Peng et al. studied the robustness of the cargo ship network based on transponder data available in 2018 [30]. They differentiated between the different ship types (oil tankers, container, dry bulk) and reported the properties for each of the sub-networks derived for just those ships. No measure of the distances in the network was reported, but a density (of ∼0.02) similar to the first published cargo ship network was found. Finally, Van Veen analysed the cargo ship network as derived from data of port calls [31]. Although the data involved only journeys either departing or arriving at one of the members of the Paris MoU, a diameter of 7 was found and an average distance of 2.49, which is similar to the reported values of other works. In the data section, we compare the properties of these networks to those of the cargo ship network as derived by us. Ultimately, we predict noncompliance using a classifier supplied with mobility patterns extracted from the cargo ship network. Our approach thus addresses the two weaknesses observed in the ship risk profile that is currently used by inspectorates.

Data
The purpose of the paper is to classify in a fair manner the noncompliance of ships, using behavioural data. The data used in the paper stems from two sources; (1) port calls and (2) inspections.
The first data source, being the port calls, contains notifications of all cargo ships calling to a port. Our port call data contains only calls to a port participating in the Paris Memorandum of Understanding and is accompanied with the following five pieces of information: (1) the port it calls to, (2) the arrival date, (3) the duration that the ship is berthed, (4) the flag of the ship when it called, and (5) the ship risk profile (low, medium, high risk) computed at the time of entering the port. From this port call data, we can reconstruct journeys that took place. A journey of a ship goes from one departure port to an arrival port, and has an associated travel time.
The second data source, being the inspections, provides information about ships that had a deficiency. Also, we know if such a deficiency has led to a detention. Ships without deficiencies were assumed to be compliant, because every ship should be inspected at least every three years at one of the ports participating in the Paris MoU [3]. The inspection results are used as ground truth for our classifier.
Ships in these two datasets are linked together by means of the International Maritime Organisation number-a unique identifier used in the maritime sector. We select data from years that occur in data from both sources (2014-2018), resulting in over 3 million calls from 28,416 cargo ships to a port in one of the 30 countries. Most of them, 97.3% (27,647), did not change their flag during the years under consideration. From these ships, the total number of ships with a white, grey, or black flag are 26,300, 672, and 675, respectively. Because only a small proportion of ships is flying a black or grey flag, we take them together and refer to the group as non-white flags. As mentioned before, ships can easily and quickly change their flag to either a so-called flag of convenience or to a more trustworthy flag with a better reputation [32]. In the data, 2.7% (1347) of all ships changed their flag in 2014-2018. More details on the used data are presented in the Appendix. In the next section, we present our approach to the prediction of noncompliance.

Methods
We aim to create a machine learning classifier that performs fair automated assessment of the risk for each ship. To this end, two types of features are used as input to the classifier; network features and temporal features.
We start by explaining the construction of the cargo ship network. In the second part we explain our approach to feature engineering, dealing with both the network features and temporal features. In the third part, we discuss the classifier in the context of machine learning. We elucidate the fair random forest classifier and explain the performance measures and fairness measures.

Cargo ship network
To obtain the structural importance of each port, later used to characterise the behaviour of ships, we construct a cargo ship network. The edges of the directed weighted network are obtained by considering the journeys of all ships, linking a port to another port if at least one ship made a journey visiting those two ports immediately after each other. Edges are weighted according to how many such journeys exist between the two ports. Hence, each node of the network is a port.
Below, we explain the structural properties of the cargo ship network in terms of their density, diameter, average distance and clustering coefficient (for a definition of these elementary network measures, see [33]). These structural properties help understand whether our cargo ship network is fact similar to earlier constructed networks of the same type. The structural importance of each port is obtained by computation of the following twelve centrality measures: • (1) in-degree, (2) out-degree, (3) degree, • (4) in-strength, (5) out-strength, (6) strength, • (7) closeness centrality and (8) weighted closeness centrality, • (9) betweenness centrality and (10) weighted betweenness centrality, • (11) eigenvector centrality and (12) weighted eigenvector centrality. These centrality measures are used in the features provided to a machine learning classifier. Degree and strength capture (a) the number of routes and (b) the weighted number of routes connected to a port, respectively. The strength of a port is thus equal to the number of journeys towards a port. Closeness centrality is equal to the reciprocal of the average shortest path distance from a node to all other nodes [34]. A more central node is closer to all other nodes and hence has a high closeness centrality. The betweenness centrality is equal to the number of shortest paths between every pair of nodes that pass through to the node under consideration [35]. A node with high betweenness centrality is associated to playing an important role in the network; a disruption of this node will affect many shortest paths. The eigenvector centrality is determined using eigendecomposition of the adjacency matrix [36]. High values of the eigenvector means that the node is connected to many nodes that themselves also have a high eigenvector centrality value. With these centrality measures, the aim is to capture a diverse set of measures for the structural role of a port in the cargo ship network.
The training set (used to learn the classifier) and the test set (used to estimate the performance of the classifier) should be independent. To prevent that data used to construct the network is used in both training and testing, we work with a separate hold-out data to construct the network. Hence, we divide every ship i ∈ I into one of the two disjoint sets (here, I denotes the set containing all ships). A 10% sample of all ships I is then used for network construction (I network ), where the remaining ships (I classification ) are used in the classification part.

Feature engineering
We have two types of features that describe the behaviour of ships in I classification ; network features and temporal features.

Network features
The network features aim to capture what type of ports a given ship visits, which can correlate with noncompliant behaviour. We obtain the network features in four steps.
Step 1. Determination of centrality measures. We characterise each journey of a given ship by the structural importance in the cargo ship network of both the departure and arrival port. If the port is observed in the cargo ship network, the twelve centrality measures (see Sect. 4.1) are determined. For each centrality measure, we combine the value obtained from the departure port and the value obtained from the arrival port using the four arithmetic operations separately (sum, multiplication, absolute difference, division). After this step, we have 12 · 4 = 48 values characterising each journey.
Step 2. Binning. To capture the distribution of the values obtained for each journey, we make a histogram of these centrality measures, by splitting each of the values obtained in the previous step into ten equal-width bins. The edges of all these bins are learned from the journeys of I network , to prevent information leaking. After this step, we have 48 · 10 = 480 values for each journey.
Step 3. Aggregation. Now, the classifier is ultimately provided with information about the instances, the individual ships. Hence, we need to aggregate the information of each journey to a fixed set of values per ship. The 480 values, obtained from step 2, can then be aggregated for each ship by summation of all journeys. Thereafter, we normalise these values by dividing them by the total number of journeys. We use the total number of journeys as a separate feature, and add it to the list of features. The procedure of normalising allows the classifier to compare the distributions, regardless of the number of journeys of a ship. In this way, we obtain 480 + 1 = 481 features.
Step 4. Encoding the missingness. In step 1 we explained that the centrality measures are only defined if the port was observed in the cargo ship network. Obviously, the information that a port is missing in the network is informative for the classifier. Hence, we will encode this missingness. We do so by two separate features. The first feature equals the number of journeys where only one port was unobserved. The second feature equals the number of journeys where both ports were unobserved. In the end, we thus have 481 + 2 = 483 network features.

Temporal features
The temporal features are computed from the duration of a ship's journeys and port berths. Abnormal short or long ship berths or journeys may be indicative of noncompliance. For example, very short berths may lead to rushing through safety procedures while significantly longer berths may be indicative of problems with the port authorities. We calculate the temporal features from (a) the berth duration in ports and (b) the travel time of journeys. To preserve the estimated distribution of the berth durations and travel timing during aggregation, we first make a histogram of these values for each ship. The histogram is made by splitting for each ship all berth and journey durations into ten equal-width bins. To prevent information leaking, the boundaries of the bins are learned from the port calls and journeys occurring in I network . In this way, 2 · 10 = 20 temporal features are obtained. For each ship, we sum all the values obtained for each ship of (a) the histogram of the berth duration and (b) the histogram of the journey duration and divide them by the total number of berths and journeys, respectively. In total, we have 483 network features and 20 temporal features, resulting in a total of 503 features describing each ship. We will represent the 503 features by a vector x i for some ship i in the remainder of this paper.

Machine learning classifier
We employ a machine learning classifier to perform the automated assessment of noncompliance. The goal of the classifier is to learn for each ship i ∈ I classification from the feature vector x i ∈ X and target scalar y i ∈ Y a function f : X → Z where Z ∈ [0, 1] is a score. The positive instances, i.e., y i = 1, indicate a noncompliant ship and the negative instances a compliant ship. From the introduction we may recall, that in search for a particular type of fairness our aim is to reduce the classifier's dependency on a sensitive attribute s i ∈ S, where s i = 0 marks a ship with a white flag (non-sensitive) and s i = 1 otherwise.
We employ a fair random forest classifier [37], which is a modified random forest classifier. In brief, a random forest classifier works as follows. For every tree in the forest, a bootstrapped sample of the training data is taken. Then, a decision tree is grown, by recursively doing three steps: (1) select a sample from all features available; (2) optimise a criterion (commonly the information gain) calculated on each of the sampled features; and (3) split the node into two child nodes based on the outcome of the optimisation. The score of an instance is calculated as the fraction of positives in a child node. For further details, we refer the reader to [38].
Random forest classifiers have, like other tree learning algorithms, some beneficial properties. We mention two of them. The first property is that their good performance has been confirmed in different domains, even with minimal tuning [38]. The second property is that the criterion considered does not have to be differentiable, in contrast to many other classifiers. Both properties allow us to use a specifically designed criterion, called Splitting Criterion Area under the curve For Fairness (SCAFF) [37]. The criterion ensures both that different labels are separated and that the sensitive class remain mixed. It is defined as follows: with AUC Y ∈ [0, 1] marking the well-known Area Under the receiver operating characteristic Curve: where s + and smarks the number of positive and negative instances, respectively. An AUC Y value of 0.5 suggests random classification while AUC Y = 1 indicates a perfect classifier. The AUC S considers the sensitive attribute as positive class. It is defined as follows: with σ (Z i , Z j ) defined exactly the same as for AUC Y . The measure is closely related to strong demographic parity [39]. For AUC S = 0.5, corresponding to a strong demographic parity of 0, the split in the node is made regardless of the values of the sensitive attributes, meaning equality of outcome. A value of AUC S = 1, corresponding to a strong demographic parity of 1, is the worst score possible, since in that case the classifier is able to predict the sensitive attribute perfectly. The orthogonality parameter, ∈ [0, 1], allows to balance the performance-fairness trade-off [14]. The fair random forest classifier optimises for performance when = 0 and thus does not consider fairness. Hence, it corresponds in that case to the ordinary random forest classifier with an information gain criterion. In contrast, when provided a value of = 1, it optimises fairness and neglects performance, resulting in a random classifier. More details on the fair random forest classifier are given in [37].

Performance measures
The performance of the classifier can be determined both by threshold-dependent and threshold-free metrics. Scores equal to or above the threshold t ∈ [0, 1] are classified as positive (ŷ i = 1) and values under the threshold are predicted negative (ŷ i = 0). Thresholdfree metrics have the advantage that they do not require this explicit cut-off point and instead consider the ranking imposed by the scores of the classifier. The three thresholddependent performance metrics used by us are the precision, recall, and the harmonic mean of those two, the F 1 score. The threshold-free performance metric used in this work is the AUC Y (see previous section).

Fairness measures
Similar to the performance measures, fairness with respect to the sensitive group can also be quantified by threshold-dependent and threshold-free metrics. We report the threshold-dependent precision and recall for the two groups, i.e., ships with a white flag and a non-white flag. A large difference between the two groups indicates an unfair outcome of the model. Moreover, we use the threshold-dependent demographic parity and equalised odds [13]. These measures consider the difference for some performance measures between the two groups, i.e., ships with a white flag and a non-white flag. The demographic parity measure is the absolute difference between the positive prediction rates of the two groups, i.e., |P(Ŷ = 1|S = 1) -P(Ŷ = 1|S = 0)| ≤ parity . Lower values of parity indicate more equal outcomes and thus more fair predictions. The equalised odds metric is defined as follows: It means that the equalised odds measures the equality of opportunity in a supervised setting, with lower values for odds implying more equal opportunity and thus more fair predictions.
The threshold-independent fairness measure used in this work, is the aforementioned AUC S .

Results
The section starts with our experimental setup. Then, we continue with the analysis of the cargo ship network. Subsequently, we evaluate the performance of the baseline ship risk profile. After that is established, we report on the performance of the (non-fair) random forest classifier. We conclude by reporting the performance of the fair random forest classifiers.

Experimental setup
In this work, we use five-fold nested cross validation with stratified sampling [40]. The inner folds are used to select the best parameter set for that specific outer fold. The considered parameters are combinations of the selected values for the depth of each tree ({1, 2, . . . , 10}) and the number of bins used in discretization of the values of the continuous variables (2 or 10). Hence, there are 10 · 2 = 20 candidate sets of parameters in each outer fold. The mean and standard deviation of the performance of the classifier are evaluated on the five outer folds using the selected parameter set. We report the outcome of this cross validation for 11 different values of the orthogonality parameter, ∈ {0, 0.1, 0.2, . . . , 1}.
The code used in this research is publicly available [41]. It uses several open source Python packages. Specifically, scikit-learn [42], SciPy [43], and Pandas [44] are used for feature engineering and for measuring the performance of the baseline ship risk profile and the proposed classifier. The fair random forest is also open source [45], making extensive use of the CVXpy package for optimising SCAFF [46]. For the analysis on the cargo ship network we used the NetworkX package [47]. The C ++ library teexGraph was used to determine the diameter of the network [48]. The packages used for visualisation, and all other dependencies and supportive software versions, can be found at [41].

Cargo ship network
A quite 'overwhelming' visualisation of the cargo ship network is shown in Fig. 1. Still, we only show ports in Europe, because we are interested in predicting risk for ships that arrive in Europe. From the figure, we can learn the following. First, we see a large component connecting virtually all ports. Second, we observe that only a few ports have a high strength, as indicated by the yellow colour, of which (1) Puttgarden (Germany), (2) Rotterdam (Netherlands), and (3) Algeciras (Spain) have the highest strength. Third, two different types of ports can be distinguised; (1) ports that are well-connected (e.g. ports in Germany, Netherlands, and Belgium), and (2) ports that are more in the periphery of the network (e.g. Iceland and the Azores). Fourth, we see that some ports are connected by thick lines, indicating an edge with a high weight. The nodes that are connected by these edges are likely to have a high weighted betweenness centrality, because failure of such node would cause to have other shortest paths running through edges with less weight.
In Table 1 we provide numeric information on sizes, relations, and distances. We show nine common properties of the cargo ship network of our work in the second column. In the third through sixth column we provide values for the properties of four similar cargo  networks observed in literature [28][29][30][31]. We compare these properties in an attempt to better understand whether our 10% sample used to compute port features is representative. From Table 1 we see that even though very different numbers of nodes and edges are reported in these works, the measures such as density, diameter, average distance, and clustering coefficient are similar. Hence, we may conclude that the constructed cargo ship network can be used to sensibly extract mobility patterns for our ship compliance classifier.

Performance of the baseline ship risk profile
The confusion matrices are shown for the white and non-white flags separately in Fig. 2.
Together with Table 2, where we show the calculated performance and fairness measures, they provide information on the performance of the baseline ship risk profile. Ships having a medium risk are predicted as compliant. We observe that virtually no ship flying a nonwhite flag gets a low risk profile. The majority of the ships (90%) are classified as medium risk. Of these ships, only a small fraction (22%) is compliant. From the ships with a high risk profile, only a tiny fraction (4%) is compliant, resulting in a high precision for the baseline model. However the recall is quite low as many ships with a medium ship risk profile are noncompliant. Interestingly, ships with a white flag having a low or medium risk profile are noncompliant more frequently than ships with a non-white flag. This also results in a low value of the AUC Y value of only 0.543 ± 0.006. Hence, we may conclude that, at least using the data from 2014-2018, we cannot predict compliance with the baseline ship risk profile. The model is quite unfair. In particular, we observe a large difference in the F 1 metric for the white and non-white group, resulting in high values for parity and odds . There is a strong correlation between the sensitive attribute, i.e., the ship flag, and the scores of the model with AUC S = 0.672 ± 0.010.

Random forest classifier
The confusion matrices of the random forest classifier are shown in Fig. 2, and in Table 2 we report the performance and fairness metrics. We observe that more ships are predicted correctly compared to the baseline model. The recall is higher, meaning that many actual positives are predicted as such. This comes with decreased precision, indicating that more compliant ships are predicted as noncompliant. However, the harmonic mean of the two, i.e., the F 1 measure, is higher than the baseline model, indicating that the random forest  classifier outperforms the baseline model. This finding is supported by the AUC Y measure, showing a value of 0.814 ± 0.004. This implies that that we can accurately assess ship noncompliance in an automated fashion with a random forest classifier, using behavioural data. This answers our first research question. The confusion matrices show that ships with a white flag are predicted to be noncompliant more often than ships with a non-white flag. The difference in frequency also results in a higher recall for ships with a white flag. All in all, the prediction is much more fair compared to the baseline model. Since the random forest classifier does not use the flag as a feature. It means that using only behavioural data makes the model more fair.

Fair random forest classifier
Below we list our observations. First, from the confusion matrices in Fig. 2 and the performance and fairness metrics in Table 2 we observe that the fair random forest classifier has more comparable true positive and true negative rates amongst ships flying a white and non-white flag, with only a slight cost in predictive performance on the target. Second, in terms of the F 1 measure, we observe that the performance drops only for the ships flying a white flag, such that the difference between the two groups becomes very small. Third, we observed that the demographic parity and equalised odds measures decrease when using a fair random forest classifier, suggesting that the classifier was able to improve fairness.
Before drawing any conclusion, we show the effect of the orthogonality parameter in more detail (see Fig. 3). The top left figure shows that the AUC Y measure is only weakly influenced for a broad range of values for the orthogonality parameter ( ), meaning that overall, we can reliably ensure equality of outcome while maintaining acceptable performance. An orthogonality value of 0.7 appears to give the best trade-off between performance and fairness in our work, with a performance of AUC Y = 0.776 ± 0.008 and fairness of AUC S = 0.538 ± 0.011. The performance can be further improved (although slightly, to AUC Y = 0.814), but only at decreased equality of outcome and vice versa.
Then we will have a closer investigation of Fig. 3(B), where the two fairness measures decrease monotonically at increasing orthogonality values. At the extreme value of = 1 they are zero, but at this value the predictive performance is also very low, as can be observed in Fig. 3(A).
Subsequently, in Figs. 3(C) and 3(D) we observe that the precision and recall for ships flying a white and non-white flag have only small differences for larger values of the orthogonality. The precision of the ships flying a non-white flag increases slightly at higher values of the orthogonality, at the cost of precision for ships with a white flag. To calculate these values, the threshold was set to t = 0.34 in such a way that P(Z ≥ t) is equal to P(Y = 1). This threshold is also used to calculate the confusion matrix shown in Fig. 2.
Here we remark that the threshold t is important, as it determines how many ships are selected as being noncompliant. Higher values of the threshold result in fewer ships that are predicted as being noncompliant. Therefore, we define the threshold quantile Q t in such a way that P(z ≥ t) equals the threshold quantile.
Finally, in Fig. 4 we show the effect of the orthogonality and the threshold quantile on the selected threshold-dependent fairness measures. We observe that high values of the orthogonality yield a fair prediction for all values of the threshold, even when the threshold quantile is set to a high value, such that most ships are predicted to be compliant. For lower values of the orthogonality, we observe that the fairness of the model is worst when the threshold quantile is near 0.5. This result is expected, as at other values of the threshold quantile the performance for both groups is low, leading to a small difference between the groups. Even at these 'bad' choices for the orthogonality and threshold quantile, the values of the demographic parity measure and the equalised odds measure are still lower than observed for the baseline ship risk profile.
From all these observations and results we may conclude that the fair random forest classifier is effective in reducing bias towards the flag of a ship, for wide ranges of the used threshold and orthogonality. This answers the second research question.

Discussion
In this section we discuss two limitations of our proposed classifier. The first limiation concerns the ground truth. It might be biased towards the flag, as well as towards the inspector's background [12]. The problem is that different inspectorates judge compliance differently for similar ships. The difference in judging leads to inequality between ports and so-called port-shopping. Port-shopping denotes a situation when a noncompliant ship decides to go to another port, solely because the inspection regime there is more favourable to noncompliant ships. In this way the ship yields a lower risk profile. Portshopping influences our model since the ground truth data is unjustly positive for such noncompliant ships. One of the reasons for the existence of the Paris MoU is to avoid this kind of competition between ports [2]. Hence, in future work, the country of inspection could also be added as a sensitive attribute, which can reduce correlation between the inspectorate and the inspection outcome.
The second limitation of this work is conceptually related to Goodhart's law, commonly formulated as: "When a measure becomes a target, it ceases to be a good measure" [49]. This is applicable to any ship risk model because ships have incentive to get a low-risk profile. In the baseline ship risk model, a better risk profile could be achieved by changing an administrative property of the ship. In our proposed classifier ships would need to change their behaviour to get a better score, which is substantially harder to achieve than merely changing administrative properties.

Conclusions
The aim of the present research was to devise an automated, accurate and fair ship risk classification approach. The study has led to two conclusions. First, by using a fair random forest classifier, we can offset the confirmation bias present in historical inspection data. Experimental results indicate that the disparate impact and equalised odds measures improve significantly, regardless of the chosen parameters, meaning that the constructed classifier works well. Second, we may conclude that the performance of our approach provided with behavioural data is AUC Y = 0.776 ± 0.008, clearly improving on the AUC Y = 0.543 ± 0.006 of the ship risk profile currently in use. Hence our work is supportive for global efforts to minimise risks associated to maritime transport by conducting more targeted inspections. More generally, we have shown how ubiquitous mobility information can be used to perform inspection in a better and more fair way. The devised approach may also be applicable in other inspection applications broader than port state control.
A natural progression of this work is to determine with domain experts what behaviour is often associated with high risk and subsequently reduce this risky behaviour. A second possible direction for future work is to consider higher-order effects in the cargo ship network [50]. The construction of a higher-order network allows for a more accurate representation of the complex underlying system, that in turn may enable more accurate network analysis results. It has been shown that relations up to the fifth-order may be relevant in cargo shipping networks [50]. Finally, we may conclude that, the temporal aspect of the network can be exploited to obtain a better, more accurate centrality measure of the true, time-aware structural importance of the ports [51], therewith potentially resulting in an even better performing classifier for the task at hand. This appendix consists of three figures providing more insight into the data used in this work. First, in Fig. A1 the fraction of noncompliant ships that visit all countries is shown. We remind the reader that a ship is noncompliant if at least one deficiency has been found during 2014-2018 (see Data section). We observe that this number is very different across countries in Europe.
Second, in Fig. A2 the number of ships registered to each country is shown. Although difficult to observe, most ships are registered to Panama (2,904), Marshall islands (2,153), and Liberia (2,119). These flag states are typically known as 'flags of convenience' , which is explained in the Data section.
Finally, in Fig. A3, for each flag, the fraction of the noncompliant ships is shown. It is true that some non-white flags are associated to a large fraction of noncompliant ships. The other way around, some of the white flag states have many noncompliant ships, such as the United States of America. In our online repository [41], these figures can be downloaded at a higher resolution.