Characterizing key agents in the cryptocurrency economy through blockchain transaction analysis

The cryptocurrency economy provides a comprehensive digital trace of human economic behavior: almost all cryptocurrency users’ activities are faithfully recorded in transactions on public blockchains. However, the user identifiers in the transaction records, i.e., blockchain addresses, are anonymous. That is, they cannot be associated with any real “off-chain” identify of actual users. Nonetheless, identifying the economic roles of the addresses from their past behaviors is still feasible. This paper analyzes Ethereum token transactions, characterizes key economic agents’ behavior from their transaction patterns, and explores their identifiability through interpretable machine learning models. Specifically, six types of most active economic agents are considered, including centralized cryptocurrency exchanges, decentralized exchanges, cryptocurrency wallets, token issuers, airdrop services, and gaming services. Transaction patterns such as trading volume, transaction tempo, and structural properties of transaction networks are defined for individual blockchain addresses. The results showed that cryptocurrency exchanges and online wallets have signature behavior patterns and hence can be accurately distinguished from other agents. Token issuers, airdrop services, and gaming services can sometimes be confused. Moreover, transaction networks’ features provide the richest information in the economic agent’s identification.


Introduction
The cryptocurrency economy is a complex yet transparent socioeconomic system. Bitcoin, Ethereum, and more than 270,000 other cryptocurrencies and tokens have been issued on dedicated or host blockchains as of June 2020 [1]. The most common ways for users to obtain cryptocurrencies are through coin mining and trading in cryptocurrency exchanges or over the counter. Individual investors and venture capital institutions can also purchase tokens from business teams in exchange for shares of their projects or companies. The cryptocurrencies obtained by users can be further used as money, merchandise, equity, and gaming tokens in various economic activities.

Related work
Early efforts of cryptocurrency address de-anonymization mainly based on heuristic address clustering on Unspent Transaction Output (UTXO) blockchain data models, e.g., Bitcoin. Two typical examples are multiple input and change address heuristics [7]. Multiple input heuristics consider that in a Bitcoin transaction with more than one input address, the input addresses are highly likely to belong to the same user. Change address heuristics consider that when a transaction has multiple outputs, one of the outputs could be a change address, which belongs to the initiator of the transaction. With the addresses clustered, the addresses inside the same cluster can be considered to bear the same identity [7]. Heuristic methods are useful but prone to error. For example, 156,722 addresses were successfully associated with the largest cryptocurrency exchange, Mt. Gox, using Bitcoin transactions up to 2012 [5]. However, only approximately 69% of the addresses can be correctly associated with individual end users [8].
Another line of effort toward cryptocurrency address de-anonymization takes address identification as a classification problem. Machine learning algorithms are used to derive computational models from patterns extracted in transaction records. Transactional patterns used to describe a blockchain address include the amount, time, and frequency of its transactions; the cryptocurrency/token balance; and the active days [9][10][11]. Transaction networks can also be constructed among blockchain addresses, in which nodes are individual addresses or sets of addresses clustered by heuristics, and edges are transactions between addresses. The structural features of nodes in the networks include various centrality measures [10], motifs [12], and network representation learning derived embeddings in vector spaces [13,14]. For smart contracts in Ethereum-like blockchains, their codes and bytecodes are also useful features [15,16]. The above mentioned features are effective in binary classification tasks, e.g., determining whether an address is a Ponzi scheme [15,17,18], phishing address [14], or other kind of scam [19,20], and multi-classification tasks, e.g., differentiating between cryptocurrency exchanges, gambling services, mining pools, and darknet markets [9][10][11]21]. This paper follows the latter research direction, in which we systematically define a spectrum of features in transaction patterns and explore the identifiability of several key agents in the cryptocurrency economy. In contrast to the previous multi-classification tasks, we not only report the precision of classification but also elaborate the transactional patterns of the blockchain addresses with regards to their economic roles and explain in-depth the reasons for their identifiability or lack thereof.

Blockchain data
As of June 2020, Ethereum blockchain network stored the largest number of cryptocurrency transactions among all public blockchains. These transactions can be briefly classified into three types: ether (the original Ethereum currency) transfers, token transfers, and smart contract calls. The transactions of the more than 270,000 tokens account for 56% (414 million) of the total transactions (745 million). As many of the cryptocurrency economic activities, such as fundraising, deal only with tokens rather than ether, we use token transactions to study economic agent behaviors in this research.
A token transaction from one address to another is accomplished by invoking the transfer() function, in the token smart contract, with three parameters, namely from_address, to_address, and value, which stands for the sender, receiver, and amount of this transaction, respectively (Fig. 1). The token sender and receiver can be either a user-owned address (Externally Owned Account, EOA) or a contract address (CoA). ERC20 is the most common standard for creating customized tokens on Ethereum. ERC 20 tokens are fungible; that is, a token can be divided into small proportions, which can circulate in the economy independently. All ERC 20 tokens' transactions up to June 2019 were obtained using an Ethereum blockchain client.

Figure 1
A schematic of ERC20 token transaction. An invoker, i.e., token sender, calls the transfer function in a token contract and passes three parameters, i.e., from_address, to_address, and value. This transaction is recorded on the blockchain if valid. The timestamp of the transaction is the height of the block in which it is logged. Addresses A and C are usually identical

Known key agent identities
Although technically anonymous, the identities of cryptocurrency addresses are sometimes publicly disclosed online. For example, some forum users, e.g., Reddit and Bitcointalk users, post personal Bitcoin or Ethereum addresses in their forum profiles. Addresses owned by cryptocurrency exchanges, wallet services, and gambling services can be identified by proactively trading or interacting with them [7]. Online intelligent platforms, such as Walletexplorer.com [22] and Etherscan.io [1], post known labels for Bitcoin and Ethereum addresses and allow users to tag addresses that they can recognize. We collected 3364 labels from Etherscan.io and retained addresses that belong to six key agent roles: centralized cryptocurrency exchange, decentralized exchange, wallet, token issuer, airdrop service, and gaming service, and that have participated in more than 100 transactions as of June 2019 as the study samples.
Centralized and decentralized exchanges are both cryptocurrency exchanges in which users can buy and sell different types of cryptocurrencies with fiat money or other cryptocurrencies. However, they bear a significant difference. In centralized exchange, a seller first deposits tokens into the exchange's addresses and open a sell order. The sell order is then matched with a buy order, either by the exchange or by the users themselves (over the counter). After clearing and settlement, the buyer can withdraw the token from the exchange. In this case, the exchange address serves as an escrow between the buyer and seller. Typical examples of centralized exchanges are Binance and Kraken. However, decentralized exchange users deal with the exchange directly. A decentralized exchange maintains a pool of different cryptocurrencies and sets the listing prices algorithmically. Buyers buy tokens from the pool, and sellers sell tokens to the pool. Typical examples of decentralized exchanges are Bancor, KyberNetwork, and Uniswap.
For the remaining types, wallet stands for online cryptocurrency banking services in which users deposit their cryptocurrencies and tokens, token issuer stands for the addresses that were used to sell tokens to investors through fundraising activities, e.g., Initial Coin Offering, Initial Exchange Offering, and Security Token Offering, airdrop service stands for the addresses that disseminate tokens freely to cryptocurrency users for advertisement purposes, and gaming service stands for the addresses used by gambling or recreational gaming organizers.
As shown in Table 1, the transactions of the selected addresses span three years and have exchanged billions of USD worth of tokens. Therefore, we believe that these addresses    with disclosed identities can be considered representatives in the Ethereum ecosystem. Evidently, these six types are a non-exhaustive list of the key economic roles in the cryptocurrency economy. Some other major roles are also of interest. For example, the mining pools coin all the new original cryptocurrency in the blockchain system. However, they are not included in the current study because of their obvious marks, i.e., the addresses are stated explicitly in each mined block and thus do not need further characterization.

Transaction feature extraction
Four groups of transaction features (Table 2) are considered when characterizing blockchain addresses. Volumes and temporal features capture the patterns of transactions in which the addresses directly participate. Transaction network's structural features capture the higher order interaction patterns among the address and its counterparties.
Volume features include the mean, maximum, minimum, and total value of token transactions initiated and received by the node, respectively (giving eight variables), as well as the balance on an address. Token values are measured in US dollars using their daily exchange rates published on the online cryptocurrency market intelligence platform Coinmarketcap.com. If a token is not listed at the time of the transaction, its price is treated as 0. Temporal features include the mean, maximum, minimum, and standard deviation of the time intervals between the consecutive edges connecting to an address, i.e., the transactions initiated and received, giving another eight variables.
We use directed network G = (V , E) to denote the transaction network constructed from token transfers. The set of nodes V represents blockchain addresses.
is the set of directed edges, in which each e represents an ERC20 token transfer from addresses V s to V t in a block with height h. The block height can also be considered as the timestamp when the transaction occurs.
Network size features of a node include the numbers of incoming and outgoing edges, i.e., transactions, T in and T out , in-degree N in , out-degree N out , and the size N ego of its 2depth ego network. For each node v, its 2-depth ego network G is defined as the collection of nodes V ego v , including the center node v and its direct and indirect neighbors that can be connected to within a distance of 2, by the edge set E ego v . Duplicated edges between nodes are combined, i.e., there is at most one edge connecting a pair of nodes in the ego networks.
Network structural features of a node v include the reciprocity between the node and its neighbors, i.e., the existence of bi-directional edges between two adjacent nodes, the clustering coefficient, i.e., the existence of edges between the nodes' neighbors, where T(v) is the number of triangles that contains node v, deg tot (v) is the summation of its in-degree and out-degree, and deg ↔ (v) is its reciprocal degree; the density of its 2-depth ego network where n is the number of nodes in the ego network and m is the number of edges in the ego network; and the reciprocity of its 2-depth ego network We train the model for five times and use the average accuracy, macro-precision, macrorecall, and macro-F1 as the final results. In each iteration, 100 addresses are randomly selected from each of the centralized exchange, decentralized exchange, wallet, token issuer, and airdrop services type, and all 25 gaming services addresses are used as the training sample. The 2-depth ego networks are constructed using the transactions in the active period of the center node v. The samples are further divided into a 80% training set and a 20% testing set. Models are trained, using stratified four-fold cross validation on the training set, i.e., 60% training and 20% validation, and further tested on the test set.

Training process of machine learning models
Considering that the number of gaming service nodes are far smaller than other types of key agents, cost-sensitive learning is used to solve the class imbalance problem. Specifically, we calculate the weight of each node as w i = k/p i , where k = 1/no. of types, and p i is the proportion of the number of samples in the type to which sample i belongs in the entire sample set used to train the model, e.g., 15/315 = 1/21 for gaming services.

Results
In this section, we first explore the signature features of different types of key economic agents and then use machine learning models to test the identifiability of the agents using their transactional behavior. Finally, we explore the importance of different features in identifying the agents.

Key agents' transaction patterns
The comparison of feature differentiability is shown in Fig. 2. Each of the data points is a min-max normalized logged median (marked with a·) feature value for each type of economic agents. We have also provided the plots of feature value distributions using their original scales in Additional file 1 Sect. 3. Some obvious differentiation between different types of agents are discussed in the following paragraphs, though more subtle differences can be inspected by naked eyes from the figure.
Centralized exchanges (red lines) show distinctions in the total volume of tokens transferred into the centralized exchanges, e.g.,M in max ,M in std , andM in sum , their balancesM balance , the maximum time interval between transactionsĨ in max andĨ out max , and the number of incoming edgesÑ in , i.e., the number of received transactions. These patterns indicate that centralized exchanges accept many incoming transactions from many users, and the received tokens tend to stay in the exchanges' addresses. However, the deposits to and withdrawals from the centralized exchanges distribute unevenly over time, implying that exchange users' activities might be driven by rare market events.
Decentralized exchanges (orange lines) have large incoming transactionsT in and the total volume of withdrawalsM out sum . These features indicate that the users of decentralized exchanges tend to sell their tokens in many small transactions and buy in large bulks. Decentralized exchanges are particularly distinguishable in terms of their network structural features: the reciprocityR, densityD ego , and reciprocityR ego of their 2-depth ego networks are significantly larger than other types of agents. These features indicate that decentralized exchanges are more popular among sophisticated users who are likely to store tokens in their own blockchain addresses and regularly transfer tokens among themselves. Token issuers (cyan lines) and airdrop services (blue lines) are both economy agents that disseminate tokens to investors. However, they exhibit different characteristics in their transaction behaviors. For token issuers, the level of activities, even in their most active period, are low. But for airdrop services, since they give out tokens for free to a larger user group rather than sell tokens to investors, the standard deviation of received transaction intervals I in std , initiated transactionsT out , and the out-degreeÑ out , i.e., the number of transaction recipients, are significantly larger than other key agents. Notably, the median values of volume features of both token issuers and airdrops addresses collected in our dataset are close to 0, which is largely because most of the tokens that had been disseminated did not reach cryptocurrency exchanges and hence were never priced.
Gaming addresses (purple lines) do not show any distinctive features from other key agents. Their network size features are similar to decentralized exchanges, while the network structural features and the volume features are close to airdrop services and token issuers.

Model classification
The five models yield considerably high prediction power to the collected dataset; see Table 3. The random forest classifier achieved the highest scores in accuracy (89.3%), macro precision (88.8%), macro recall (86.2%) and macro F1 (86.5%).
For each type of key agent, Table 4 shows the precision, recall, and F1 from the random forest classifier. Centralized exchange, decentralized exchange, and wallet addresses can all be accurately distinguished from other types of key agents with precisions >90%. Airdrop services, preserving a certain extent of particular transactional features, can be identified with >80% probability. However, token issuers and gaming services can only be identified with 70% precision due to the lack of distinguishable transactional features.
More specifically, Fig. 3 shows the confusion matrix of the random forest classifier predictions. Token issuer addresses are confused with airdrop services with 15% probability, while gaming services are misinterpreted into token issuers 36% of the time.

Analysis of informing features
Interpretable models such as RF provide quantitative descriptions of the importance of features (e.g., predictive power) in classification tasks. Figure 4 shows the feature importance in the random forest classifier based on the permutation importance. The permutation feature importance is defined to be the decrease in a model score when a single feature value is randomly shuffled [24]. The larger the decrease, the higher the predictive power a feature can hold.
Network size features, especially the out-degree N out and the size of the ego network N ego , are ranked highest. Temporal feature average interval of sent transactions I out mean is also ranked high. Moreover, transaction network structural features, such as the density D ego and reciprocity R ego of the 2-depth ego networks and the reciprocity R of the target addresses, are also ranked high. Volume features did not show high feature importance in the model.
Following the similar logic of permutation feature importance, we adapt a forward and backward feature selection-like process to investigate the importance of groups of features. Table 5 shows the prediction results of the random forest classifier using different combinations of feature groups, with the same hyper parameter settings. It can be seen Figure 3 The confusion matrix of the random forest classifier predictions. Rows are the labels of samples and columns are the predicted identities. Exchanges and wallets can be perfectly distinguished from other agents while token issuers may be confused with airdrop services, and gaming services may be confused with token issuers that using all groups of features achieved the highest macro F1 score. When using a single group of transaction features, network sizes have the highest predictive power. When using two groups, the combination of network size and temporal features achieves the highest identifiability. When using three groups, that is, emitting one group of features, the combination that leaves out temporal features provides the lowest prediction, which indicates that the temporal features provide the most uncorrelated information to other groups of features in identifying key agents in the cryptocurrency economy.

Conclusion and discussion
Key agents are the most significant parties in the cryptocurrency economy. These very few addresses deal with most of the transactions stored in the blockchains. A full understanding of these entry points could lay a solid ground for future exploration of the behavior of other economic agents, such as marketplaces, merchandisers, and various illicit activities.
Cryptocurrency transactions that are publicly stored in blockchains offer a unique data source to the study of cryptocurrency economy user behaviors. In this article, we have extracted transaction patterns, e.g., transaction volumes, transactions time interval, and transaction network structural features, e.g., the connectivity among blockchain addresses, to characterize and identify six types of key economic agents, namely, centralized exchanges, decentralized exchanges, cryptocurrency wallets, token issuers, airdrop services, and gaming services, in the cryptocurrency economy.
Centralized exchanges, decentralized exchanges, and online cryptocurrency wallets all show distinctive features. Centralized exchanges act as escrow between the buyers and sellers, and hence receive large amounts of deposit and hold large balances. Decentralized  exchanges trade with users automatically and, therefore, show significantly higher reciprocity in their transaction network structure. Online wallet services can be considered cryptocurrency banks and, therefore, have a higher minimum value of withdrawal transactions. Token issuers and airdrop services both disseminate tokens to investors. However, since airdrop services give out tokens to a larger user group, they have a much larger number of outgoing transactions than token issuers. Gaming services typically receive many incoming transactions but did not show distinctive features compared to the other types of key agents. Machine learning algorithms trained on the extracted features showed strong predictive power for the six types of key agents, e.g., macro F1 = 0.865. The prediction results are robust to different sampling criteria and model hyper parameter settings. However, even though the exchanges and wallet services can be differentiated accurately from other types of key agents, token issuers, airdrop services, and gaming services can sometimes be confused with each other. Feature importance analysis has indicated that network size and structural features possess the highest predictive power for the key agents, while transaction temporal features provide the most independent information from all other groups of features.
However, the categorization of key cryptocurrency economic agents into six types can be further discussed. For example, decentralized exchanges and online wallets can be easily divided further among themselves, probably corresponding to different business models that the services adopt (see Sect. 4 in Additional file 1 for an exploratory plot).
The significance of blockchain technology is that all user activities are faithfully stored and accessible to the public, enabling any illicit activities, such as market manipulation in cryptocurrency exchanges, the hacking of online wallets, and cheating in games, to be immediately exposed to the public. Though many newly developed cryptocurrencies, e.g., Zcash and Monero, see this nature as a weak link in the original Bitcoin design and try to conceal the traceability of transactions by cryptography designs, we argue that the identifiability of cryptocurrency economy agent roles does not jeopardize the privacy and security feature of cryptocurrency but rather reinforces the trustworthiness of the entire cryptocurrency economy. Understanding the economic roles associated with each blockchain address promotes confidence in their transaction counterparts and is thus the first step toward creating a fully transparent and self-regulated decentralized economy.

Funding
This work is supported by City University of Hong Kong (grant no. 7200649 and 6354050).

Availability of data and materials
Blockchain data are publicly available in Ethereum database. Label data are publicly available on Etherscan.io and our Github repository https://github.com/abcdefg3381/cryptocurrency_analysis. Price data are publicly available on coinmarketcap.com

Competing interests
The authors declare that they have no competing interests.