The November 11th Ethereum chainsplit resulted from several node operators, miners, and exchanges who were running outdated versions of Geth. Many peopled jumped on this point — why were major blockchain companies running outdated versions of Geth? Shouldn’t they be up to date?
There are two primary reasons that updates happen: Bug fixes, and new features.
Many projects separate these out into multiple channels — there’s even a
widely agreed specification called Semantic Versioning,
which defines version numbers along these lines. When you see a version number
of the form: MAJOR.MINOR.PATCH
(eg 1.9.17
) this generally means that:
PATCH
: Only includes bugfixes / security updates. Any tools interacting with this software should expect it to behave exactly as it was intended to before, any changes address unintended behaviors. Generally if thePATCH
version goes up, it should be very safe to apply an update, and one can expect the software to become more stable.MINOR
: Includes backward compatible new features. Any tools interacting with this software should still work without having to make any changes, but there are also new features. New features often involve bigger changes, which are more likely to introduce new, unintended behaviors. IdeallyMINOR
versions are safe to apply, but they are more likely to introduce unintended instabilities thanPATCH
versions.MAJOR
: Includes backwards incompatible changes. Tools interacting with this software will probably have to make changes to incorporate the updates.
This makes it possible for v1.3.7
of a project to come out after v1.4.0
— so that projects that aren’t ready to make the leap from v1.3.x
to
v1.4.x
can apply security updates without having to worry about the
ramifications of new features.
While Geth uses the 3 part X.Y.Z version numbering, they don’t comply with
Semantic Versioning. If they did, you would expect v1.9.23
to be a highly
stabilized version of v1.9.0
, serving exactly the same function with no new
features and much more stable. Instead, nearly every PATCH
update includes
new features. For example, v1.9.7
added support for the Istanbul hard fork
— hard forks seem like the definition of a backwards incompatible change,
but Geth increased the PATCH
version instead of the MAJOR
version. Then
v1.9.9
added support for Glacier Muir — another hard fork. v1.9.12
changed the default sender for eth_call
from Geth’s default account to the
0x00...000
account — not a problem for Rivet (since we don’t have a
default account) but a breaking change for some projects. v1.9.13
changed how
much ETH the caller of an eth_call
and eth_estimageGas
had; this was listed
as a bugfix, but was a breaking change for at least one Rivet customer.
v1.9.14
changed the error messages of eth_call
and eth_estimateGas
to
include the reason transactions were reverted — this is very useful, but
if people were relying on the old error messages this is a breaking change.
Complexity In Evaluating Breaking Changes
It’s usually pretty easy to delineate a new feature from a bugfix, but breaking changes can sometimes be harder to define.
For example, Rivet uses Geth essentially as a Go library. We were
probably the only project downstream from Geth that considered it a breaking
change when Geth started writing the latest block hash to LevelDB in a batch
write operation instead of an individual put operation, but for us that
required a significant refactor. Still, we wouldn’t have faulted the Geth team
for putting that in a MINOR
version instead of making a MAJOR
version for
it.
When Geth changed its behavior of allocating an ETH supply for eth_call
invocations, they made it more correctly align with what an equivalent
transaction would do, but at the same time made it impossible to simulate two
or three transactions ahead to see what a particular call would do once a
particular address was sent the ETH necessary to execute the call. For some
projects this was a breaking change and the MAJOR
version should have been
updated, for others it was a bugfix and only justified an increase to the
PATCH
version.
It is not our intent to rile up debate and criticism for each decision of which version to bump; that can quickly turn into bikeshedding and get in the way of progress. But being able to extrapolate meaning from the version number can be quite useful, and distinguishing big new features from critical security updates can be invaluable.
So if you’re a business running Geth, moving from v1.9.9
to v1.9.17
to apply a
security patch means taking on a whole host of potentially breaking changes,
both for your own internal systems and your customers systems. Since there’s no
separation of critical bug fixes / security fixes from new features / breaking
changes, it’s all or nothing. When major security patches are
slipped in secretly,
it’s no wonder that even responsible companies choose to stay behind.
At Rivet, every time a Geth release comes out we go over the release milestone, looking at least the title of each pull request to evaluate:
- Is this a breaking change for our streaming replication system?
- Is this likely to be a breaking change for our users?
- If this is a security update, how likely is it to impact our systems?
On occasion we have backported bugfixes into our fork because we weren’t ready to handle other changes that would impact our streaming replication system, but we can’t do that if critical security fixes are mislabeled as optimizations.
Rivet’s Proposal
We propose to help the Geth team adopt Semantic Versioning. From the Geth
team’s perspective, not a lot has to change — just be more willing to
bump version numbers according to the rules of Semantic Versioning. This may
mean Geth goes from v1.9.24
to v13.2.0
by the end of next year, but users
will be able to more readily evaluate the magnitude of the changes in the
update they are applying.
The Rivet team is then offering to maintain long-term support (LTS) releases, backporting critical bug fixes and security updates into the LTS release. Companies that want to make sure they stay up to date on critical updates but aren’t ready to deal with breaking changes can apply the LTS update and only get critical updates. The Rivet team is happy to maintain these support branches — we already do much of the work for our own fork, and the additional effort should be relatively minimal. But we can’t do it without support from the Geth team, highlighting (at least to us) which updates are critical and should be backported vs which are optimizations.
Now, any LTS update would necessarily end with a hard-fork; we’re not proposing to backport hardfork functionality into the LTS (as hardforks are the definition of a breaking change). When preparing for a hardfork, businesses would need to upgrade to the next LTS release, which will likely be the first Geth release to fully support the pending fork. In a case where a year goes by between hard forks we may have an intermediate LTS to come up to speed with new features, and a defined transition period where both LTS releases are supported for a period of time to allow businesses time to test and transition between versions.
The LTS model is one adopted by many projects with similar complexity to Geth. You see it in operating systems such as Ubuntu and Redhat, where certain versions of the OS get critical fixes for years, while the bleeding edge versions are supported for less than a year. You see it in many software frameworks such as Node.js, Python, an Django where some releases are supported for short periods of time, while others are supported for 30 months. LTS releases are the industry standard way to balance the need for a project to move forward quickly with the need for businesses with dependencies on that software to have operationally stable systems.
The November 11 chainsplit highlighted the need for a similar level of rigor in Ethereum clients. We recognize that the Geth team is stretched very thin, and are happy to pick up the slack to make the LTS model work so long as they’re willing to cooperate by getting us the information we need to know which updates should be backported to the LTS release.
There is some risk that highlighting a bugfix by including it in an LTS release may draw a potential attacker’s attention to that fix, prompting the exploitation of the vulnerability. But we believe that risk is more than offset by the value of teams being able to apply critical updates quickly, without having to untangle security updates from cool new features that introduce unexpected behaviors.