Well Done, Geth Team

August 27th saw an Ethereum chainsplit occur due to a bug in Geth that was patched on August 24th. This post looks at the vulnerability lifecycle through the lens of the framework we laid out in Februrary, with a comparison to the chainsplit that occurred on November 11th of last year.

Background

On November 11th 2020, a chainsplit occurred on the Ethereum network as a result of a bug in Geth that had been patched on July 20th, 2020 — 114 days prior the chainsplit. The Go Ethereum Team explained that they had not highlighted the security vulnerability out of concern that drawing attention to the vulnerability could expedite its exploitation, leaving unpatched nodes vulnerable to attack. The counterpoint is that, because node operators were unaware of the security risk they delayed upgrading their nodes, and that if the vulnerability had been publicized more nodes would have upgraded sooner.

On August 19th 2021, the Geth team alerted the community that a vulnerability had been discovered in Geth and that a patch would be released on August 24th. Teams at many organizations, including Rivet, set aside time to upgrade quickly on August 24th, minimizing the window of exposure to the vulnerability. On August 27th, just 3 days after the patch was released, a transaction hit the Ethereum network that exploited this vulnerability and caused a chainsplit between older and newer Geth clients.

Discussion

A key discussion following the recent events is whether the vulnerability notification was a net benefit or detriment.

Per the vulnerability lifecycle framework presented in Februray, we have two key variables to define:

t_exploit: The time it takes an attacker to identify the vulnerability, develop an exploit, and execute it.
t_patch: The time it takes the network to reach a critical mass of nodes that have applied the patch.

t_exploit is fairly easy to measure. It was 114 days from the release of Geth v1.9.17 to the November 11th chainsplit, and 3 days from the release of Geth v1.10.8 to the August 27th chainsplit. It is very likely, but not a certainty, that the Geth team’s announcement of the security vulnerability in Geth v1.10.8 is responsible for the significant decline in t_exploit between releases.

t_patch is harder to boil down to a single number, because it relies on a loosely defined “critical mass of nodes”.

Critical Mass of Nodes

Ethernodes.org collects information on Ethereum nodes on the network and what percentage of nodes are running which client / version. The following chart was compiled from Archive.org snapshots of Geth Ethernodes.org ¹

Here we can see the percentage of Geth nodes that had patched vs the number of days since the patch was released. By the time the vulnerability was exploited on November 11th, approximately 64% of Geth nodes had applied the patch.

There aren’t enough datapoints to plot the adoption of the Geth v1.10.8 vulnerability patch; by the time the vulnerability was exploited on August 27th, approximately 31% of Geth nodes had applied the patch. It is worth noting that the exploit patched in v1.9.17 reached the same 31% adoption threshold approximately 26 days after its release, whereas the highly publicized patch for v1.10.8 reached the same threshold in only 3 days.

In terms of the raw percentage of nodes effected, approximately twice as many nodes were vulnerable to the exploit patched in v1.10.8 as were vulnerable to the exploit patched in v1.9.17.

But raw numbers of nodes aren’t the full picture. As a practical matter, some nodes are more “critical” in terms of “critical mass” than others.

In the case of both exploits more hashing power was on the right side of the chainsplit than was on the wrong side. This is very important, as it means there weren’t many instances of people who believed their transactions had confirmed, only to see those transactions rolled back when their transaction ended up on the other chain.

Additionally, many users connect to the blockchain through nodes operated by a handful of providers. When the exploit patched in v1.9.17 occurred, Infura incurred around 5 hours of dowtime, leaving users of Metamask and many other services unable to get updates from the blockchain for a considerable period. To my knowledge, no major service providers had significant outages that resulted from the exploit patched in v1.10.8.

The effects of announcement on t_exploit and t_patch

We contend that the announcement of the vulnerability patched in Geth v1.10.8 lowered t_exploit about as low as it could get without explicitly disclosing how to exploit the vulnerability.

How the specific attack vector could be quickly identified

The Geth team announced the presence of an EVM vulnerability. While there were 16 pull requests included in Geth v1.10.8, only one of them touched the EVM package.
That pull request was merged only minutes before the Geth v1.10.8 release was made, highlighting it as an area of interest for potential attackers.
That pull request had 11 commits, 10 of which predated the Geth team’s vulnerability announcement, and one which was added to the pull request thirty minutes before the release was made, highlighting that specific commit as an area of interest for potential attackers.
Also notably, while the other commits in that pull request were ostensibly optimizations and included benchmarks, the final commit included had very minimal optimization value, and included no benchmarks.

However, we also contend that there are additional measures that could considerably improve t_patch. Between Geth v1.10.6 (the earliest release to support the London hard fork) and Geth v1.10.8, Geth had several breaking changes made to its APIs:

Issues #23239 and #23363 both made changes to data types in eth_feeHistory API.
Issue #23199 disallowed running eth_call or eth_estimateGas to simulate calls from smart contracts.
Between Geth v1.10.7 and v1.10.8, Issue #23424 partially reverted the change from issue #23119, but this was not noted in the release notes.
Support for the Calaveras testnet was removed from Geth v1.10.8
Internal API changes required some Geth forks to make code changes before they could merge v1.10.8

Any application teams with dependencies on these features would be hard pressed to immediately upgrade from v1.10.6 or v1.10.7 to Geth v1.10.8; it would, in many cases, require changes to their own applications to support the new version of Geth.

If the Geth team offered a Long Term Service release or other form of stable branch that received only critical updates, those teams would have been able to apply the critical security patch immediately without having to reconcile Geth’s other changes. The Geth team has historically contended that a stable branch would highlight potential exploits to attackers (concern for lowering t_exploit), but the Geth v1.10.8 vulnerability announcement had similar effects on t_exploit without the benefits that an LTS release would have on t_patch (lowering barriers for applying the patch).

Long Term Ramifications

Some in the community consider the announcement of the vulnerability patched in Geth v1.10.8 to have been a failed experiment, and that the team should return to old practices of silently slipping security patches into the codebase. I don’t believe that conclusion is supported by the data from this release.

As discussed previously, as the Ethereum blockchain grows to support more value, well resourced attackers will have considerable motivation to seek out exploits against Geth. The market cap of ETH alone has increased 6.88x between the November 11th exploit and the August 27th exploit (to say nothing of value held by other tokens on Ethereum), which creates comparably increased incentives for attackers to spend resources searching for attack vectors.

While there’s room for debate over whether the November 11th attack or the August 27th attack was worse, it’s not hard to imagine an attack with significantly worse outcomes.

If the Geth team returns to quietly slipping security changes into pull requests disguised as optimizations, a well resourced team following Geth’s development could potentially identify a vulnerability before the fix has ever been included in a release ². This could be extremely damaging to the network, as most hasing power would produce invalid blocks and most service providers would accept those invalid blocks. This would leave much debate over whether to retroactively hard fork to accept the blocks most people already believed to be confirmed, or roll back to return to the corrected consensus rules while reverting many hours worth of blocks that users believed had been confirmed.

The August 27th attacks demonstrate that three days with notifications is sufficient to avoid such catastrophic outcomes. We would again encourage the Geth team to take this a step further by making an LTS release to lower the bar for applying updates, and focus on lowering the time needed to reach a critical mass of patched nodes (lower t_patch) rather than relying on obscurity in hopes of delaying exploits (raising t_exploit).

Aug 6 2020,Aug 11 2020, Aug 16 2020, Sep 20 2020, Sep 28 2020, Oct 21 2020, Oct 27 2020. We reached out to Ethernodes to see if more granular data was available, but had received no response by the time of publication. ↩
And adding the commit just prior to the release doesn’t solve this, as prospective attackers could easily learn to watch for commits added just prior to the release for prospective security patches. ↩

Background

Discussion

Critical Mass of Nodes

The effects of announcement on texploit and tpatch

Long Term Ramifications

Share

The effects of announcement on t_exploit and t_patch