Storing Data in Ethereum

Ethereum clients are designed to balance several goals.

First and foremost, they are participants in a peer-to-peer network. They get information from the network which they must validate according to a set of consensus rules. They also share this information with their peers, who in turn must validate the information they receive. Much of the data storage design of an Ethereum node is geared towards being able to validate the information clients receive from their peers quickly and efficiently, and make sure that all clients are in agreement with respect to the state of the blockchain.

Second, Ethereum clients often act as wallets, securely holding users’ keys. This function is generally being pushed back to a higher layer — tools like Geth’s Clef, sites like MyEtherWallet, and hardware devices like the Ledger Nano all serve from as wallets that rely on a separate Ethereum client.

And finally, Ethereum clients provide dApps, wallets, traders, and many other use cases with access to information about the blockchain. Unfortunately these use cases often want to access data in ways that aren’t easy based on the way the data is stored on disk.

Data Types

Ethereum clients deal with several different types of data.

Block Data: The block itself has the core information that Ethereum clients share with each other. This includes the proof-of-work data, as well as merkle tree roots for the current state, transactions, and log data.
State Data: The state root encoded in each block is essentially a hash of all the Ethereum accounts, their nonces, ETH balances, and contract state if applicable. The contract state holds contract-specific information, essentially as a sort of key-value store for each contract to store its information.
Transactions: Each block encodes a list of the transactions that were included in that block. At this level, these are the transactions as submitted by the users, without any information about what happened when they were executed.
Transaction Receipts: Transaction receipts provide information about what happened when a transaction executed. This includes information such as the amount of gas used by a transaction, whether the transaction succeeded or failed, and any logs emitted during the transaction.
Log Data: Aside from storing data, contracts can emit events to help external systems see what happened during the transaction. Logs are very open ended in terms of what they can be used for, but common use cases include things like token transfers.

The challenges of accessing data

When you’re acting as a peer in the peer-to-peer network, the information as stored above makes it easy to validate information and share data with peers. If you’re trying to run a dApp, there’s some basic information that can be very difficult to retrieve.

For example if you want to know all the unique tokens a particular address has received, how do you get it? It seems like a simple query, but an address’ token balance is not a property of the address, but data stored on potentially thousands of individual contracts. If you want to know a users’ balance on every token, you need to get it from each individual contract (either by making calls to each contract, or by making calls to a contract that calls each contract). This approach requires you to know which token contracts you’re interested - if a user has received tokens you don’t know about, you won’t find out about it this way. For a more comprehensive list of tokens an address has received, you can query the the logs, but that requires analyzing every single block to determine which blocks have pertinent logs.

Another type of data that can be difficult to determine is the list of ERC721 tokens held by a particular user. When a user receives an ERC721 token (like a CryptoKitty), a log is emitted indicating the token transfer, and if you check the owner of the token it will tell you, but without the ERC721 enumerable extension you can’t get a list of the tokens owned by a particular address. The ERC721 enumerable extension has the contract do some double bookkeeping, using extra on-chain storage in order to be able to list the tokens owned by a given address. Again, determining the list of assets owned by a given users becomes a much more complicated process than it seems like it should be.

Different solutions

For many dApps, getting access to the information you need given these constraints is a serious challenge. One solution is to index the data as it hits the blockchain into a separate index. A while back we published an analysis of an index of ERC20 transfers, which built a database of tens of gigabytes in order to be able to query information about ERC20 tokens instantly. Such a database could become a part of an Ethereum client to make log queries more efficient, but it’s not necessary to act as a peer on the peer-to-peer network, so it doesn’t make sense to require the extra storage capacity for something required by a small subset of users.

There are other indexing solutions like VulcanizeDB and The Graph Protocol that let users specify mappings from the data they want to capture into a structure that allows them to access it easily. These solutions trade complexity and storage space for easier access to the data required for specific business logic.

In Rivet and EtherCattle

At OpenRelay, we provide easy access to standard web3 APIs, but as we’ve discussed that’s not always the easiest way to get access to the information you need. EtherCattle’s streaming replication approach gives us some novel opportunities to index the data that users need. We’re currently exploring a variety of such offerings, and look forward to offering more ways to access blockchain data for your application in Rivet.

Data Types

The challenges of accessing data

Different solutions

In Rivet and EtherCattle

Share