210 likes | 225 Views
This study utilizes a Bayesian approach to identify Bitcoin users by analyzing transaction data and propagating messages through the network. It explores the behavior-based clustering of Bitcoin addresses and connects users to IP addresses to uncover the identities behind Bitcoin transactions.
E N D
A Bayesian Approach to Identify Bitcoin Users Péter L. Juhász, Józef Stéger, Dániel Kondor, GáborVattay
Author Affiliations • Péter L. Juhász • Dept. of Physics of Complex Systems, EötvösLoránd University, Budapest, Hungary • Business Unit IT & Cloud Products, Ericsson Telecommunications, Budapest, Hungary • Józef Stéger • Dept. of Physics of Complex Systems, EötvösLoránd University, Budapest, Hungary • Dániel Kondor • Dept. of Physics of Complex Systems, EötvösLoránd University, Budapest, Hungary • SENSEable City Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA • GáborVattay • Dept. of Physics of Complex Systems, EötvösLoránd University, Budapest, Hungary Complex Systems is a new field of science studying how parts of a system give rise to the collective behaviors of the system, and how the system interacts with its environment. Social systems (people), the brain (neurons), molecules (atoms), and the weather (air flows) are all examples of complex systems. The field of complex systems cuts across all traditional disciplines of science, as well as engineering, management, and medicine. It focuses on certain questions about parts, wholes and relationships.
Introduction: Blockchain • Originated for digital currency • 2008: “Bitcoin: A Peer to Peer Electronic Cash System” • Idealized as decentralized exchange system • Replace third party with public ledger system • Distributed system • Cryptographic • Proof of work by miners • Double spending paradigm • Immutable
Introduction: Blockchain (cont.) • Block added to chain after verification from nodes in network • Transaction data • Hash for the block • Hash from the previous block • Multistep transaction verification process • Establish private key (to send) • Establish public key (to receive) • Initiate transaction • Send transaction to blockchain network • Miners verify transaction • Miners add new block (record) to network
https://medium.com/@matteozago/50-examples-of-how-blockchains-are-taking-over-the-world-4276bf488a4bhttps://medium.com/@matteozago/50-examples-of-how-blockchains-are-taking-over-the-world-4276bf488a4b
Introduction: Bitcoin • 1 BTC = 10,378 USD (as of 9/7/2019) • Not regulated • Not universally recognized as legal tender • Block reward • Started at 50 in 2009 • Halved every 210,000 blocks (roughly every 4 years) • Was 25 in 2013, by 2020 will be 6.25 • Upper limit of 21 million in circulation • Proof of work = find 64 digit hexadecimal number (hash) that is less than or equal to the target hash • Difficulty level as of 9/2019 is roughly 1 in 6 trillion • Difficulty increases as more miners join network, level is adjusted every 2 weeks • Application Specific Integrated Circuits (ASIC) https://www.investopedia.com/terms/b/bitcoin-mining.asp
Background • Bitcoin addresses vs. bank account numbers • Numerous Bitcoin addresses • Multiple source/destination addresses in single transaction • Source Bitcoin address, destination address, timestamp, and volume of Bitcoin are available in each block • Anonymity allows both legitimate and illegitimate uses for Bitcoin transactions
Related Work • Grouping Bitcoin addresses using behavior based clustering (K Means, Hierarchical Agglomerative Clustering) to bind Bitcoin addresses to real users • Analysis of transaction graphs, identifying clusters and components, distribution of user network • Connecting the users to IP addresses by connecting to all publicly available Bitcoin nodes (servers) and noting the messages they were relaying. Used Bitcoin’s peer discovery mechanism to link transactions to their originators. • Empirical studies of the peer-to-peer network that enables Bitcoin to operate, characterizing some of the important properties of information propagation. • A method for inferring the topology of the peer-to-peer network based on the observation of the message propagation process.
Research Methodology: Bayesian Method Overview • Propagating messages were recorded by several monitoring clients in order to cover as great part of the network as possible. • For each transaction, recorded the list of clients who relayed the transaction in the first time segment (the likely originators of a transaction). • Assigned probabilities to each client to show the probability of being the originator, separately for each transaction. • The blockchain is then used to group the Bitcoin addresses owned by the same user. • With several transactions of the same Bitcoin address, grouping addresses by user allows a combination of measurements from multiple transactions to identify users with higher confidence. • Combining probabilities from the first step, users (and their balances) are paired with the clients that are most likely the originators of their transactions. • The clients can be geographically localized through their IP addresses, which allows the determination of the geographical distribution and flow of Bitcoins.
Research Methodology: Individual Probabilities • The probability that the originator relays the message to the monitoring client in one specific iteration: • If the originator has corig clients connected to it (one is the monitoring client), then in every iteration there is 1/corig probability, that it relays to the monitoring client. • In case of the mediator client, it relays the transaction to the monitor client with a probability of 1/cmedbecause of trickling, and it relays with a probability of 1/4 if the other mechanism is used. • The probability that a specific client relays the transaction in the k-th iteration, follows a geometric distribution.
Research Methodology: Individual Probabilities (cont.) • To calculate the distribution of the iterations for the route through the mediator client, the sum of the two random variables has to be considered. This can be derived from the discrete convolution of the above two distributions. • As every ordinary client initiates 8 outgoing connections when connecting to the network, the number of connections is estimated to be 16 (taking into account the incoming connections). • If the two routes are considered independent from each other, the probability of the direct route being shorter (i.e. requires less iterations) is 0.5785. • Does not take into account the network delay, and that multiple indirect route can exist from the originator to the monitoring client possibly consisting of more steps. • The goal is to determine a time frame that the monitoring client has to wait after receiving the message first until receiving it directly from the originator (if directly connected). • For example, if this waiting time is defined to be 2 sec, the model gives a probability of 0.8841 for the direct route taking less iterations. This is considered the first time segment (t1 = 2s)
Research Methodology: Grouping Transactions • Every transaction can be assigned to the users by looking at the source Bitcoin addresses of the transaction. • To group addresses, consider that Bitcoin addresses appearing on the input side of the same transaction typically belong to the same user. • This can be used for grouping individual Bitcoin addresses. • When a Bitcoin address appears in different transactions, all Bitcoin addresses can be merged and assigned to the same user.
Research Methodology: Combining Probabilities • The originator clients can be identified more efficiently by combining the probabilities belonging to these transactions, thus obtaining a more decisive result. • Can be calculated by the Naive Bayes classifier method. • The transactions assign probabilities to the clients (IP addresses), denoted by tx, which indicate the likelihood that the client is the originator of the transaction. • The probabilities of an IP address related to the different transactions can be combined by the Naive Bayes classification, resulting a row of combined probabilities. • The IP addresses will be divided into two classes: “originator” and “non-originator”. • For each transaction, there can be at most one IP address in the originator class. • If a user used multiple IP addresses to create Bitcoin transactions, after combining multiple transactions, more than one IP address can be in the originator class in the final result.
Research Methodology: Combining Probabilities (cont.) • By the application of the naive Bayes classifier, the combined probability of an IP address (IPi) belonging to the Co originator class is given by: • Where: • txdenotes the vector of all considered transactions • |A| is the average of the total number active clients through the transactions • m is the number of transactions • The naive Bayes classification can only be applied if the transactions provide conditionally independent probabilities. • It is assumed that the Bitcoin users can be identified by a limited number of IP addresses they use when connected to the Bitcoin network.
Research Methodology: Data Collection • The Bitnodes.io database provides the number of active IP addresses of the Bitcoin clients as a function of time. • Modified Bitcoin clients to connect to the network and monitor information about transactions relayed by connected clients. • Program code is open-source, allowing implementation of monitoring client. • Monitoring clients logged the incoming messages along with the IP address of the sender client and the time of reception. • These messages contained the 128-bit hash code of the transactions which were relayed. Using this hash code, the Bitcoin addresses, the amount of Bitcoin sent and other information of interest can then be looked up in the blockchain.
Research Methodology: Data Collection (cont.) • Modified Bitcoin clients were installed simultaneously to 140 computers located at different parts of the world to monitor as large part of the Bitcoin network as possible. • Installed the monitoring clients on computers integrated into PlanetLab, a system maintained for network communication research. • The data collection campaign took slightly more than two months between 10/14/2013 and 12/20/2013. During this period 300 million records were obtained, in which 4155387 transactions and 124498 IP addresses were identified. • The collected data was imported into a SQL database server. • All data used in the analysis is made publicly available by the Bitcoin users as it is required by the Bitcoin protocol.
Research Findings: Identification Metrics Distribution of the probabilities assigned to the accepted user-IP address pairings • The majority of the probabilities are above 0.9, the average value of the pairings is 95.52% • Two peaks at 0.952, and close to 1: • Clients that initiate a relatively few transactions. • Servers offering wallet services • The more initiated transactions can be taken into account, the higher the probability will be that can be assigned to the pairings. • 22363 users identified, and 1797 IP addresses were assignedto them.
Research Findings: Calculating the Balances Total balance of all identified users versus time • The steep drop during the possibly due to the significant increase of exchange rate • −0.91 linear correlation coefficient between total amount of Bitcoin owned by the identified users and the exchange rate • At the time of the measurement 13,500,000 Bitcoin were in circulation Exchange rate of Bitcoin versus time
Research Findings: Geographical Distribution of Bitcoin Distribution of Bitcoin clients
Recommendations for Future Work • Expand scope of monitored systems (those with the installed client program) from ~100 to several times more • Evaluate methodology with other digital currencies • Evaluate methodology with other Blockchain systems outside of digital currency • Evaluate methodology while intentionally introducing Bitcoin obfuscation tools (e.g. tumblers, coinjoins) for researcher owned Bitcoin • Correlate location of Bitcoin transactions to Tor nodes
Conclusions • Installed a modified Bitcoin client program on over a hundred computers, which recorded the propagating messages on the network that announced new transactions. • Based on the information propagation properties of these messages, developed a mathematical model using Naive Bayes classifier method to assign Bitcoin addresses to the clients that most likely control them. • As a result, Bitcoin address—IP address mappings were identified. • Through the IP addresses of the clients, we could determine their geographical location, which enabled the spatial analysis of distribution and flow of Bitcoin. • Method is cheap in terms of resources, the algorithms are easy to implement and can be combined with other Bitcoin-transaction related information. • The monitoring clients do not need to be connected to other Bitcoin users in any detectable way (i.e. communication among them is trivially achieved outside the Bitcoin protocol), making it virtually impossible to reveal their monitoring activity.