The Ethereum blockchain has a ton of data. Every transaction that has ever happened is available for all to see. But the way this data is organized often makes it hard to find the information you want. With a standard Ethereum client, there are two main ways to approach getting the data you want.
If you want to know an account’s balance of an ERC20 token, you can quickly do a
state lookup of
token.getBalance(account). You have to specify the token
you’re interested in and the account you want the balance of, and it will return
the numeric balance of that account. That’s pretty easy.
But what if you want to know all the tokens an holds, and their balance for each of those tokens? Here, things get tricky. An account’s balance isn’t really a property of the account, it’s a property of the token contract. If you wanted to know all the tokens a user holds, you would have to do thousands of state lookups just to see if they hold each specific token.
An alternative approach is to use Event Logs. When certain actions happen on the Ethereum blockchain, events get fired. Events are indexed with the block they were included in, making it very easy to check whether a particular block includes an event that matches certain parameters. So if, for example, you wanted to know whether a given account transferred any tokens in a given block or range of blocks, you can easily construct a query that will check each block for ERC20 transfers in that block.
But again, if you want to know all the tokens a particular account holds, things get messy. You can scan the blockchain for events indicating that the user in question received tokens, but the blockchain has had over 7,100,000 blocks, and scanning all of them is very time consuming.
Simple questions like “What tokens does my account have” end up becoming incredibly complicated to answer with just an Ethereum client. So we started working on OpenRelay Lens as a new way to look at blockchain data.
Lens indexes Ethereum event logs into a conventional relational database. This allows us to query the data more efficiently than using state lookups or trolling through millions of blocks for events every time we want to answer a question.
So far we’ve indexed all ERC20 transfer events from blocks 0 - 7,078,246. Now we can run efficiently indexed queries to answer lots of questions about ERC20. With Lens, we can answer the question “What tokens does this account have” in just a few milliseconds with a simple SQL query.
From our initial findings, we’ve discovered the following (all as of block 7,078,246).
- 84,363 unique tokens have had at least 1 transfer event
- 32,201 tokens have had only one recipient
- 65,617 tokens have had ten or fewer recipients
- 3 tokens have had more than a million recipients
- 139 tokens have had more than 100,000 recipients
- OpenRelay’s Embiggen Token is in the top 20% of tokens by recipient count
- 28,840,270 Ethereum accounts have received at least one ERC20 token
- 19,850,842 Ethereum accounts have received only one distinct ERC20 token
- 27,438,863 Ethereum accounts have received fewer than 10 distinct ERC20 tokens
- EtherDelta’s deposit contract has received 8,509 distinct tokens, giving it the record for the most distinct tokens
- 0x0000000000000000000000000000000000000000, which is often used as the “recipient” when a token is being burned, comes in at #2 with 2,129 distinct tokens.
If you want to get access to the full raw dataset, the methods we used are described below. If you’re interested in aggregate, you can grab the following CSVs from us:
erc20_token_analysis.csv.gz aggregates data by token address. It provides:
- Number of distinct recipients by token
- Number of distinct senders by token
- Total number transactions by token
erc20_recipient_analysis.csv.gz aggregates data by recipient address. It provides the number of tokens received by each address. The provided file is substantially truncated. It includes the 1.4 million recipients that have received 10 or more distinct tokens.
The full aggregated dataset is available on IPFS at
Be advised that it is a CSV with over 28 million records, sitting about 2 GB.
Indexed event data still doesn’t give the whole picture. It can tell us who has
had a given token, for example, but it can be hard to determine a balance that
way. With many tokens, the balance is simply
sum(tokens_received) - sum(tokens_sent),
but our own Embiggen token is an example of a token where the balance can change
without any events to indicate the change in balance. If you wanted to
accurately report all the tokens a given account holds, the best approach would
be to use an indexed database like Lens to find out what tokens a user has, then
use an RPC client to do state lookups for the exact balance of those tokens.
As of right now, OpenRelay Lens is essentially just a database. We’re planning to build some APIs to make it available as a service as time permits, but those are a ways down the road. If you are interested in getting insights from OpenRelay Lens please contact us, and we will be happy to get you a quote on doing some analysis.
As we build out tooling to support Lens, we will make it open source, allowing people to aggregate data from the blockchian on their own. As it stands, here is the process we used to load up the Lens database:
We made a branch of massive with a subcommand for pulling ETH logs. With that, we ran:
massive eth getLogs --fromBlock 0 --toBlock 7078246 --topic0 0xddf252ad1be2c89b69c2b068fc378daa952ba7f163c4a11628f55a4df523b3ef ~/.ethereum/geth.ipc | gzip > erc20_logs_0_7078246.gz
After several hours, this gave us a gzipped file with every ERC20 log from block 0 through 7,078,246 in a single file (about 166 million records).
From there we used this simple Python script to convert the JSON data into a CSV. Note that there will be some events that match the topic we filtered for that Massive that don’t follow the ERC20 transfer event schema, which will be dropped by this script.
import sys import json for line in sys.stdin: item = json.loads(line) try: sys.stdout.write("%s,%s,%s,%s,%s,%s\n", (item["address"], item["topics"], item["topics"], int(item["data"], 16), item["transactionHash"], item["blockHash"])) except (KeyError, IndexError, ValueError): pass
It took a couple of hours for a single Python process to churn through all 166 million JSON records and output the CSV.
Then we created a MySQL database with the following schema:
CREATE TABLE erc20 ( address varchar(42), sender varchar(66), recipient varchar(66), quantity varchar(78), txhash varchar(66), blockhash varchar(66) ); CREATE INDEX address ON erc20 (address); CREATE INDEX sender ON erc20 (sender); CREATE INDEX recipient ON erc20 (recipient); CREATE INDEX address_sender ON erc20 (address, sender); CREATE INDEX address_recipient ON erc20 (address, recipient);
And then used a
query to load the data from the CSV file into the database. It took a couple of
days for our little database to ingest all this data, but now we’re ready to
query it as much as we like.
As the service evolves, we will likely write a microservice to monitor the blockchain on an ongoing basis to populate the database, but for now we can simply repeat the process with updated block ranges to get up-to-date information.
As we look to making externally accessible APIs to the Lens database, we’re interested in what views of this data you’d like to have. Let us know what data you’d be interested, and we’ll see if we can make it easy to access. We will be monitoring interest as a gauge for prioritizing further work on Lens.