Back in November I wrote about an unintentional chain split on the Ethereum network caused by a collection of clients that had not applied critical security updates. We followed this up with a call for a Long Term Service (LTS) release of Geth, even offering to maintain the branch ourselves. Unfortunately, the Geth team has refused to support any kind of LTS release.

The Geth team’s position comes from a concern that announcing security vulnerabilities highlights them for attackers, potentially inviting attackers to exploit those vulnerabilities faster than nodes on the network can upgrade to mitigate the risk. They worry that even by merging security fixes into a support branch, they would be highlighting potential attack vectors that a savvy attacker could exploit before most of the network has a chance to apply the patch.

Vulnerability Lifecycle Modeling

In 2010, I wrote my master’s thesis on vulnerability life-cycle modeling, exploring in great detail the concepts that are relevant to this question today. There are two key variables that are fundamental to the decision of how to handle patch distribution:

Defining "Critical Mass of Nodes"

Above, we defined tpatch as the time it takes the network to reach a critical mass of nodes that have applied the patch, but “critical mass of nodes” is not an easily defined term. When Infura had a nearly 5 hour outage and many, many applications became unavailable, much of the community would have argued that texploit < tpatch. When major service providers are impacted, that impacts the health of the community even if the network of nodes is mostly unaffected.

On the other hand, if your primary concern is avoiding major chain splits, then one can argue that having a majority of mining power and an abundance of nodes that have applied the patch constitutes a critical mass, even while major applications remain unpatched.

While we agree that avoiding major chain splits makes sense as a significant priority, as Ethereum reaches increasing levels of mass adoption, billions of dollars may be on the line for even relatively small segments of the network. Impacts of that magnitude are likely to draw the attention of government regulators, who will be more interested in financial impacts than technical arguments over the definition of “critical mass of nodes.”

When texploit < tpatch, the attacker is able to wreak havoc on the network by exploiting a vulnerability that is predominantly unpatched. When tpatch < texploit, the attacker is mostly ineffective, as their attack is ineffective against the patched nodes.

From here, making decisions on release processes can be couched in terms of their impact to these two variables. Ideal actions would increase texploit while decreasing tpatch. Actions that decrease texploit while increasing tpatch pose serious risks. Actions that move texploit and tpatch in the same direction should consider the scale of each of these moves.

Geth’s Current Approach

Considered in the framework of this vulnerability lifecycle model, the Geth team’s concerns are not entirely without merit. Certainly, making a full vulnerability disclosure the instant an update is made available will exert more downward pressure on texploit than it does on tpatch. Even if node operators are going to apply updates “quickly”, Ethereum operates around the planet, and it’s likely that ⅓ of node operators are asleep at the moment of the release. The vulnerability disclosure might spur some node operators to apply updates on an accelerated schedule, but if a vulnerability disclosure spells out an attack in detail, an attack could quite plausibly be executed before many node operators are even awake to react.

The Geth team’s current approach of not announcing vulnerabilities immediately, combined with silently mixing security fixes in with new features and optimizations, succeeds at putting upward pressure on texploit—as potential attackers have to dig through large code updates (which may or may not include security fixes at all) before they can even begin to develop an exploit.

Unfortunately it also puts significant upward pressure on tpatch, as organizations that run nodes must go through rigorous testing cycles (and sometimes development cycles) to make sure the new features don’t break anything in the downstream application (and update the application when they do). Depending on how you view the “critical mass” component of tpatch, the November Infura outage is a datapoint that puts tpatch on a scale of many months.

The LTS Approach

Having an LTS branch of Geth would put downward pressure on both texploit (which is bad) and tpatch (which is good).

The downward pressure on texploit comes from a reduction in the number of commits an attacker would need to sift through to identify potential security issues. Rather than having to sift through a mix of features, optimizations, bug fixes, and critical security fixes, attackers following the LTS branch would only have to sift through bug fixes (which may not always be security related) and security fixes. They would still need to take time to determine whether a vulnerability exists and how to exploit it, which leaves more time for users to upgrade than making a vulnerability disclosure along with the release, but plausibly less time than if attackers must sift through a larger volume of code changes.

The real value of an LTS branch is the downward pressure on tpatch. After the November Infura outage, Infura went on record saying

In the early days of Infura we would upgrade nodes as soon as the Geth or Parity teams cut a new release. We wanted the latest performance improvements, the latest API methods, and of course bug fixes. We stopped doing that when these changes sometimes brought instability or critical breaking issues which negatively impacted our users.

It follows that they would have updated much sooner if they could have had confidence that an update would bring bug fixes without instability and critical breaking issues that negatively impact their users.

LTS releases or stable branches are standard practice throughout the software industry for exactly this reason.

If updates are very reliably stable, businesses will generally apply those changes quickly, on a regular schedule. For a piece of software that has very reliably stable updates, an update that comes out on a Monday may be applied by Tuesday or Wednesday, especially if it is generally understood that updates on the stable branch are generally critical bug fixes.

If updates tend to cause large, breaking changes, organizations will schedule an upgrade process every couple of months, to allow for engineering resources to make the necessary changes to accommodate the change. If such an update comes out on a Monday, businesses are unlikely to pull resources from planned activities to evaluate and make any necessary changes until the start of the next sprint, likely weeks later at the soonest (especially if the organization has no reason to believe the release contains security updates).

One last factor that works to the benefit of tpatch from an LTS is that it creates a level of node diversity. If a bug is introduced, caught, and fixed between LTS releases, users running the LTS version of software were never exposed to the bug in the first place. Similarly, it is also possible that an undiscovered security bug is fixed inadvertently during a refactor to support a new feature, in which case the security fix may only be applicable to LTS users. In either of these cases, it is easier to reach “critical mass” adoption when only a subset of users were exposed to the vulnerability at the time of its discovery.

The impact of client diversity in blockchain networks

One of the reasons that the Geth team is particularly concerned about vulnerabilities being discovered in their code is that the Geth client comprises a majority of the nodes on the network, and likely a majority of the network’s hashing power. Thus, a flaw in Geth could lead the majority of nodes on the network to follow an invalid chain. If the majority of nodes are accepting blocks that well behaved clients consider invalid, it could cause serious issues as the community struggles to reconcile the transactions that finalized on the larger invalid chain with the transactions that finalized on the smaller, valid chain.

If, however, no single Ethereum client served a majority of the nodes on the network, then a failure in Geth would only impact the nodes running Geth, and the majority of other nodes would still be following the correct chain. While this could still result in some frustration and economic turbulence as nodes on the smaller Geth chain were rolled back and replaced with the transactions on the correct chain, that would likely be much easier to reconcile than if the incorrect chain were also longer.

Running more nodes (especially mining nodes) on clients other than Geth would help secure the network against flaws in any single client. Other options include:

If the hashing power and majority of nodes on the network could be split among these three clients plus Geth, no bug in a single client could result in a longer chain that violated the network’s consensus rules.

The Impact of Network Value on tpatch and texploit

Here, we find a factor that impacts both tpatch and texploit which is largely outside the control of any release process decisions a software team can make.

At the time of this writing, ETH alone has a market cap of over $195 billion. When you account for other tokens on the network, you easily get into a valuation in the range of several hundred billion of dollars. So how does that impact our vulnerability life cycle model?

First lets look at texploit. The more value a network holds, the more value can be derived from causing it to falter. In recent weeks we’ve seen financial exploits outside the crypto space. Vulnerabilities in Ethereum clients open up opportunities for similar financial exploits.

Imagine a hedge fund or other financial institution spending a million dollars a year employing a team of security researchers to track Geth and watch for fixed vulnerabilities. When their team finally finds something, they take out a massive short position on ETH and related tokens, then exploit the vulnerability. Even today, if they could use such an approach to extract just 0.5% of the network’s value it could yield 1000x return on the security research teams’ salaries. As the value of the network increases, so does the potential return from such an attack. Employing a team capable of closely tracking an LTS release might be 10x cheaper than employing a team capable of tracking every commit to Geth itself, but even if that’s not cost effective today, it will be if the value of the network scales the way many of the network’s believers are confident it will.

Even without a single organization that invests heavily in finding Geth bugs for the purpose of exploitation, Linus’ Law states:

Given enough eyeballs, all bugs are shallow

Applying this to Geth, as the Ethereum ecosystem continues to grow, Geth will attract more eyeballs and more opportunities for people to find exploitable bugs—especially by tracking patches.

Then juxtapose that to the impact that an increasing network value has on tpatch. As the value on the network increases, we can expect more enterprise adoption — more businesses that will need to incorporate client upgrades into their business cycle, taking longer than existing users to apply updates, especially if those updates have a tendency to come with breaking changes.

So as the Ethereum network increases in value, we can reasonably expect texploit to fall while tpatch can be expected to rise — a recipe for successful attacks. Given the value at stake, such an attack would be sure to attract the attention of regulators who are unlikely to be placated by claims that “most nodes have indeed updated and were not affected” while substantial losses were incurred by major institutions on a small segment of the network that had not updated.

Comparing with non-blockchain applications

This general model of comparing the effects of an action on texploit and tpatch is useful across all types of software. The definition of “critical mass” for tpatch and the impact of an exploit will differ across projects, but the impact that given factors will have on tpatch and texploit will generally be the same.

While not necessarily framed in these terms, LTS releases and stable branches are common across the software industry because ensuring businesses can apply security updates without having to adapt their software for other changes helps ensure that tpatch occurs sooner than texploit.

It’s also worth noting that there are tools outside of release management that can help lower tpatch. Whether you’re looking at Dan Kaminsky’s DNS Vulnerability, Heartbleed, Meltdown and Spectre, or a handful of other examples, massive coordination efforts were used to make sure that various vendors were prepared for a rapid rollout of updates as soon as a public disclosure was made. It’s worth noting that in many of these cases, vendors made updated releases of software that they had officially dropped support for in order to make sure that their users had a drop-in replacement that would allow them to upgrade rapidly without requiring engineering efforts that would delay the rollout.

The Geth team did something similar when Go 1.15.5 was released. Without disclosing the specifics of the vulnerability, they notified the community that the Go release would be released on a particular day, and that teams should be prepared to upgrade quickly. They provided a release that resolved the issue, but also provided instructions for teams who still had reasons to run older versions of Geth. While it was a significant coordination effort, the results were quite effective.

Conclusions

As the value of the Ethereum network continues on an exciting upward trajectory, client teams need to think critically about both tpatch and texploit. For sufficiently well resourced attackers, the difference between texploit under Geth’s current strategy and texploit under an LTS strategy becomes negligible, while the difference between tpatch improves considerably under an LTS strategy vs the current one.

Our team is still amenable to helping the Geth team maintain an LTS release, but even if they aren’t interested in our help, we hope they will reach the conclusion that an LTS is necessary before the lack of one leads to a major disaster for the ecosystem.