Analyzing The Blockchain with OpenRelay Lens

The Ethereum blockchain has a ton of data. Every transaction that has ever happened is available for all to see. But the way this data is organized often makes it hard to find the information you want. With a standard Ethereum client, there are two main ways to approach getting the data you want.

State Lookups

If you want to know an account’s balance of an ERC20 token, you can quickly do a state lookup of token.getBalance(account). You have to specify the token you’re interested in and the account you want the balance of, and it will return the numeric balance of that account. That’s pretty easy.

But what if you want to know all the tokens an holds, and their balance for each of those tokens? Here, things get tricky. An account’s balance isn’t really a property of the account, it’s a property of the token contract. If you wanted to know all the tokens a user holds, you would have to do thousands of state lookups just to see if they hold each specific token.

Event Logs

An alternative approach is to use Event Logs. When certain actions happen on the Ethereum blockchain, events get fired. Events are indexed with the block they were included in, making it very easy to check whether a particular block includes an event that matches certain parameters. So if, for example, you wanted to know whether a given account transferred any tokens in a given block or range of blocks, you can easily construct a query that will check each block for ERC20 transfers in that block.

But again, if you want to know all the tokens a particular account holds, things get messy. You can scan the blockchain for events indicating that the user in question received tokens, but the blockchain has had over 7,100,000 blocks, and scanning all of them is very time consuming.

Introducing Lens

Simple questions like “What tokens does my account have” end up becoming incredibly complicated to answer with just an Ethereum client. So we started working on OpenRelay Lens as a new way to look at blockchain data.

Lens indexes Ethereum event logs into a conventional relational database. This allows us to query the data more efficiently than using state lookups or trolling through millions of blocks for events every time we want to answer a question.

So far we’ve indexed all ERC20 transfer events from blocks 0 - 7,078,246. Now we can run efficiently indexed queries to answer lots of questions about ERC20. With Lens, we can answer the question “What tokens does this account have” in just a few milliseconds with a simple SQL query.

Some Insights

From our initial findings, we’ve discovered the following (all as of block 7,078,246).

84,363 unique tokens have had at least 1 transfer event
32,201 tokens have had only one recipient
65,617 tokens have had ten or fewer recipients
3 tokens have had more than a million recipients
139 tokens have had more than 100,000 recipients
OpenRelay’s Embiggen Token is in the top 20% of tokens by recipient count
28,840,270 Ethereum accounts have received at least one ERC20 token
19,850,842 Ethereum accounts have received only one distinct ERC20 token
27,438,863 Ethereum accounts have received fewer than 10 distinct ERC20 tokens
EtherDelta’s deposit contract has received 8,509 distinct tokens, giving it the record for the most distinct tokens
0x0000000000000000000000000000000000000000, which is often used as the “recipient” when a token is being burned, comes in at #2 with 2,129 distinct tokens.

Aggregate Data

If you want to get access to the full raw dataset, the methods we used are described below. If you’re interested in aggregate, you can grab the following CSVs from us:

Token Aggregates

erc20_token_analysis.csv.gz aggregates data by token address. It provides:

Number of distinct recipients by token
Number of distinct senders by token
Total number transactions by token

Recipient Aggregates

erc20_recipient_analysis.csv.gz aggregates data by recipient address. It provides the number of tokens received by each address. The provided file is substantially truncated. It includes the 1.4 million recipients that have received 10 or more distinct tokens.

The full aggregated dataset is available on IPFS at ipfs://QmVLjN3Jc9xRnEjAquP3tAS9f4R5PRRuPhZEy4EmfzHoWv

Be advised that it is a CSV with over 28 million records, sitting about 2 GB.

Some Shortcomings

Indexed event data still doesn’t give the whole picture. It can tell us who has had a given token, for example, but it can be hard to determine a balance that way. With many tokens, the balance is simply sum(tokens_received) - sum(tokens_sent), but our own Embiggen token is an example of a token where the balance can change without any events to indicate the change in balance. If you wanted to accurately report all the tokens a given account holds, the best approach would be to use an indexed database like Lens to find out what tokens a user has, then use an RPC client to do state lookups for the exact balance of those tokens.

Availability

As of right now, OpenRelay Lens is essentially just a database. We’re planning to build some APIs to make it available as a service as time permits, but those are a ways down the road. If you are interested in getting insights from OpenRelay Lens please contact us, and we will be happy to get you a quote on doing some analysis.

Open Source

As we build out tooling to support Lens, we will make it open source, allowing people to aggregate data from the blockchian on their own. As it stands, here is the process we used to load up the Lens database:

Massive

We made a branch of massive with a subcommand for pulling ETH logs. With that, we ran:

massive eth getLogs --fromBlock 0 --toBlock 7078246 --topic0 0xddf252ad1be2c89b69c2b068fc378daa952ba7f163c4a11628f55a4df523b3ef ~/.ethereum/geth.ipc | gzip > erc20_logs_0_7078246.gz

After several hours, this gave us a gzipped file with every ERC20 log from block 0 through 7,078,246 in a single file (about 166 million records).

From there we used this simple Python script to convert the JSON data into a CSV. Note that there will be some events that match the topic we filtered for that Massive that don’t follow the ERC20 transfer event schema, which will be dropped by this script.

import sys
import json

for line in sys.stdin:
    item = json.loads(line)
    try:
        sys.stdout.write("%s,%s,%s,%s,%s,%s\n", (item["address"], item["topics"][1], item["topics"][2], int(item["data"], 16), item["transactionHash"], item["blockHash"]))
    except (KeyError, IndexError, ValueError):
        pass

It took a couple of hours for a single Python process to churn through all 166 million JSON records and output the CSV.

Then we created a MySQL database with the following schema:

CREATE TABLE erc20 (
  address varchar(42),
  sender  varchar(66),
  recipient varchar(66),
  quantity varchar(78),
  txhash varchar(66),
  blockhash varchar(66)
);
CREATE INDEX address ON erc20 (address);
CREATE INDEX sender ON erc20 (sender);
CREATE INDEX recipient ON erc20 (recipient);
CREATE INDEX address_sender ON erc20 (address, sender);
CREATE INDEX address_recipient ON erc20 (address, recipient);

And then used a LOAD DATA query to load the data from the CSV file into the database. It took a couple of days for our little database to ingest all this data, but now we’re ready to query it as much as we like.

As the service evolves, we will likely write a microservice to monitor the blockchain on an ongoing basis to populate the database, but for now we can simply repeat the process with updated block ranges to get up-to-date information.

Feedback

As we look to making externally accessible APIs to the Lens database, we’re interested in what views of this data you’d like to have. Let us know what data you’d be interested, and we’ll see if we can make it easy to access. We will be monitoring interest as a gauge for prioritizing further work on Lens.