This is part 3 of an 8 part series. If you haven't read from the beginning, we suggest starting with the intro.
Tuesday, April 21, 2020 - 9:41 PM
I was winding down for the evening when alarms start going off again, following the same patterns we’d seen Sunday night. In this one case, we were probably spared a headache by the COVID-19 lockdown — before the pandemic you’d have found much of our engineering team at a bowling alley on Tuesday nights. Instead of having to troubleshoot from a laptop in a noisy bowling alley, we had our full setups to work with.
Once again, we ran through our investigation process again. This time, things looked a bit worse than Sunday night. Response times were higher, and we seemed to be serving more errors. When we tried running the offending query against our testing environment, we were horrified; running the query just once was enough to cause our Ethereum client software to run out of memory and crash.
Why We Survived
Here’s where Rivet’s innovations make a huge difference. If a normal Ethereum client crashes, it takes minutes (if not hours) to come back online. Ethereum clients keep a lot of data in memory and flush to disk only periodically, so crashing means recent data is lost (and in some cases it can even cause the database to become corrupted). When a client restarts, it must first find peers on the Internet, redownload any data since it last committed to disk, reverify that data, then it’s ready to serve again. In the more extreme cases, the database can become corrupt requiring the node to be resynced from the Ethereum network (which can take days), or recovered from backups (which can take hours).
At Rivet, we’ve implemented streaming replication. We have our “master” servers that connect to peers on the Internet, and stream changes to our “replica” servers, which commit to disk immediately. We can scale up our replica servers to handle increased load from customers without requiring new masters. Because we are running replicas instead of normal Ethereum clients, nothing is lost when a crash happens, there is no real risk of database corruption, replicas do not have to reestablish connections with peers, and they are able to resume serving our customers in under half a second.
Obviously, crashing is still not good, and we want to prevent it from happening, but it’s far less catastrophic than it would be for a normal Ethereum client.
We immediately blocked the query to stabilize our servers. We waited a few minutes to see that things did in fact stabilize, and they did. While we were waiting, we started a root cause analysis; we prefer not to make major changes to our production software in the middle of the night — when possible we’ll make the changes the next day — but if these attacks persisted we might not have a choice.
Details of the Attacks
Transactions on the Ethereum blockchain are limited by “gas”. Every operation a smart contract performs consumes a certain amount of gas. If a transaction exceeds its gas limit, it stops executing and ends with an “out-of-gas” status. Allocating memory requires a very small amount of gas, but within the gas limits of a transaction, you’ll run out of gas long before the server runs out of memory.
Outside of transactions that make changes to the blockchain, some kinds of read operations also execute within smart contracts. By default (prior to Geth 1.9.16), read operations are allowed essentially unlimited gas. Unlimited gas for read operations is okay for a privately operated node that only has trusted applications running on it, but causes problems for services like Rivet that serve unknown third parties. This means that they can take a very long time to execute, and tie up a lot of server resources while they do. We used a 5 second timeout to protect our servers from abuse; queries could do as much computation as they like in 5 seconds, but if they go over, they’re cut off.
The problem with this attack was that the smart contract spent all of its time allocating memory. It wasn’t doing any complex calculations; it wasn’t reading a bunch of stuff out of the database; it was just allocating RAM. And with no gas limit, you can allocate a lot of RAM in a 5 second timeout — way more RAM than our servers actually had.
Having spent about an hour identifying the problem and not seeing any new attacks, we decided that we could implement a fix for the problem in the morning. We all made sure our phones would wake us up if anything went wrong 1, and we went to bed.
Wednesday, April 22, 2020 - 12:38 AM
Well, that didn’t last long. Our phones woke us up and we hurried back to our computers. Running through our process again, we identified a new smart contract address to block, and things quickly stabilized. This time the attack just seemed to be slowing down our servers; it wasn’t crashing them. But they were coming after us in the middle of the night and we knew they had a trick up their sleeve that could crash our service, however briefly. No more waiting until morning; the fix needed to go out immediately.
The basic fix for our problem was easy. We needed to limit how much gas 2 a single query can use. Standard Ethereum clients have a flag to do just that, but our Replicas did not. That took only a few minutes.
The problem wasn’t the fix; the problem was knowing the impact it would have on our legitimate customers. The last thing we wanted to do was break one of our customers’ applications 3, so we needed more data. Thus, at around 1:00 AM we were making code changes to our production servers to gather more data.
The Random Query Log
To get the data we needed, we adapted our slow query log into a “random” query log. We captured a small percentage of the requests on a small number of our servers. Once we had a good sample of requests, we could test those requests against both the old and new versions of our code to get a good idea which customers would be impacted and which wouldn’t.
So we had a fix for the problem, and we were gathering data to determine what the impact would be. We tested our solution in our testing environment and confirmed that we could no longer crash our servers with a single request. But until we had enough data to adequately determine the impact to customers, we weren’t eager to push out our change. We agreed that if we saw any more attacks that night we’d go ahead and push the change, and we went back to bed.
Continue the story on Wednesday Morning
We usually have two people from our team designated to wake up if an alarm goes off, and they can call the rest of the team if they need help, but given the attacks we’d just seen we wanted all hands on deck the moment there was a problem. ↩
Gas is a measure of how complicated an Ethereum transaction is. ↩
We knew about one query one of our customers runs regularly that uses a lot of gas, and we knew our fix wouldn’t interfere with that query, but if there was another high gas query other customers ran less frequently, we could block them as well as the attacks. ↩