About 3:30 this morning friends in the Ethereum ecosystem blew up my phone on Telegram to make sure I knew that Infura was experiencing an outage. I did a quick check of alarms and dashboards from my phone to make sure that Rivet was okay, then tried to go back to sleep. But a nagging tingle in the back of my mind wasn’t about to let me get back to sleep until I knew why they were effected and we weren’t. So I headed to my home office to dig in.

I quickly confirmed that Rivet was responding with blocks with recent timestamps, and the block numbers pretty matched the block numbers reported by Etherscan give or take a second, so Rivet actually was up and nothing was wrong with our monitoring.

So what was going on? Twitter might know.

So evidently the Geth team had fixed a consensus critical bug without telling anybody, and Infura was running an outdated version of Geth that was vulnerable. This meant that the version of Geth Infura was running was rejecting the valid blocks produced by the rest of the network. Infura’s mainnet offering was unresponsive, but CURLing Ropsten their ropsten endpoint revealed:

$ curl https://ropsten.infura.io/v3/$API_KEY -H "Content-Type: application/json" --data '{"jsonrpc": "2.0", "id": 0, "method": "web3_clientVersion"}'

{"jsonrpc":"2.0","id":0,"result":"Geth/v1.9.9-omnibus-e320ae4c-20191206/linux-amd64/go1.13.4"}

Assuming they were running the same version on Mainnet (which seems quite likely), Infura was running Geth v1.9.9, which was released in December of last year — the first release to support the Muir Glacier hard fork, and the oldest release that can still sync the current chain.

Operations and Applying Updates

Many security experts will tell you the number one thing you can do to keep your servers from getting hacked is to apply updates. An exceptionally common way that servers get hacked is by exploiting vulnerabilities that have already been patched on systems that have not already applied the patch. I personally spent a year in graduate school researching vulnerability life cycles, and modeling the risk posed by running unpatched software after an update has been released.

At the same time, new software often introduces new bugs. If you always deploy an update to production the day it comes out, sooner or later you’re going to be bitten by a new bug that snuck its way into that update. Generally the best practice is to deploy an update to a staging environment, run it through a standard set of tests specific to your systems, and deploy it to production after it has proven stable for an acceptable period of time.

Running a bit behind is not uncommon in the software industry. Running 11 months behind is far from best practice.

But aside from Infura being pretty far behind on Geth versions, the other issue here is that the Geth team quietly slipped in a consensus-critical bugfix, apparently without telling anybody. According to a member of the team:

Before a security patch, that makes sense. If you tell people about a security vulnerability before there’s anything they can do to fix it, it’s very likely to be exploited. But once the patch is available, it should be a different story. Dan Kaminsky’s handling of a 2008 DNS vulnerability serves as a good example of how things should be done: Get critical stakeholders together, let them know about the issue, and coordinate a rapid update among everyone involved. Ideally, the Geth team would have notified key Geth operators (examples would certainly include Infura, as well as any number of mining pools, and I’d hope to see Rivet on the list), stress the importance of updating promptly, then publicly disclose the flaw so that any holdouts know the importance of updating soon. Simply relying on nobody looking at commits in an open source codebase is a recipe for what happened today.

The Rivet Difference

So far I’ve covered what went wrong at Infura, but not why Rivet is able to provide more consistent reliability.

Rivet is based on the Open Source Ether Cattle Initiative, which borrows the age old database concept of streaming replication and applies it to Ethereum node infrastructure. Rather than just running lots of Ethereum nodes, we have redesigned Ethereum nodes to be much easier to manage.

With that in mind, how do we avoid issues like the one Infura saw today?

First of all, we’re more rigorous about keeping up to date. We are currently running Geth 1.9.18 — far more recent than the 1.9.9 release they were running at Infrua, while still a bit behind the latest 1.9.23 release. But this is about as far behind as we ever allow ourselves to be; even prior to today, we had on our roadmap to update to 1.9.23 next week (which will likely be adjusted to 1.9.24, which is expected to drop tomorrow). Because we rely on our own fork of Geth, we keep up with merging upstream Geth changes into our own branch on a regular basis even if we’re not at a point in our development cycle where we’re ready to do an upgrade. At the very least, this means we’re ready to upgrade if necessary in the event of an emergency.

Second, our streaming replication system means we have a smaller set of servers we need to upgrade in the event of an emergency. As we were running a bit behind on updates, it’s possible we could have found ourselves on the wrong side of this morning’s chainsplit. But most likely, we could have updated the Geth version on a single one of our masters, which would have gotten past the consensus issue and streamed further updates to our replicas, allowing us to recover in minutes rather than hours. We would upgrade our replicas soon after, but that could happen after we were back up and running.

We have numerous other failure scenarios covered off as well. While we run with a system of redundant master nodes, any of our replicas can also step in to act as a master in the streaming replication system in the event of a catastrophe. In the event that something fails fundamentally with our streaming replication system, each replica is capable of switching over to become a standard Geth node to continue handling requests.

The only disaster recovery scenarios that don’t have us back online in well under an hour are ones where we have to wait on upstream changes to Geth — and we’re even working on that scenario to be able to use other clients as masters in our streaming replication solution.

But I don’t want to rely on a centralized provider!

That’s fair. It’s easy to understand why someone would take this morning’s events and conclude that they don’t want to depend on a centralized provider, rather than concluding that they should switch to a different centralized provider.

The problem is that running Ethereum node infrastructure with high availability requires considerable expertise and infrastructure investments. When OpenRelay set out to build the EtherCattle initiative, our hope was to build an Open Source alternative to Infura, rather than create another centralized provider. The problem is that the base level of infrastructure required to get a highly available EtherCattle configuration costs in the thousands of dollars, and for most projects that’s a pretty steep expense.

So you have some options.

But ultimately, unless you’re truly willing to commit the time and resources into developing the expertise to operationalize Ethereum nodes, Rivet is the best option for node infrastructure you can count on. You can sign up for free at Rivet.cloud — use the promo code AVAILABILITY to get 5 million free requests when you sign up.

Follow Up

While the above post was going through our editorial process, we got a couple of updates.

The Geth team has posted a post mortem, describing the nature of the bug, the fix they included, and why they included it silently.

Meanwhile Jing from Optimism admitted that their team inadvertently triggered the chainsplit unaware of the potential impact:

We will post a followup tomorrow with an analysis of the new information as it unfolds.