This is part 4 of an 8 part series. If you haven't read from the beginning, we suggest starting with the intro.
Wednesday, April 22, 2020 - 1:00 PM
We made it through the night with no more attacks. In the morning, we tested the collection of queries we had gathered against our new restrictions. We identified one query run infrequently by one customer that was likely to be impacted by the change. It was a customer we had a close relationship with, so we contacted them, explained the situation, and got the green light to go ahead with the change.
Comparing the query log
To identify queries impacted by our changes, we took the data recorded by our random query log and fed it through a tool called Versus, which runs the query against two different servers and compared the results. We compared our testing environment (which had the change) to our production environment (which did not), and identified the small handful of queries that behaved differently after the change.
By 1:00 PM we had everything ready to go. Since we hadn’t seen any attacks all morning, we took the time to go through our usual predeployment checklist, then started our deployment. Just a few minutes into the process, the next wave of attacks began.
How the Deployment Exacerbated the Attack
This complicated things.
At OpenRelay we subscribe to the philosophy of immutable infrastructure. That is to say, we almost never update a server; we replace it with an updated server.
Our servers are behind a load balancer that only routes to healthy servers. During a deployment new servers start up, but are not considered healthy until they have all the latest data pulled down, and enough of it loaded into memory that they can answer queries quickly. Under normal circumstances this means that we can start new servers, they don’t get any queries until they are able to handle them, then we can shutdown the old servers and all requests get routed to the new ones.
When we’re under attack, this approach becomes a problem. We’re rather conservative about when we consider a server to be healthy. If response times start to creep up on a server or too many requests get backlogged, a server will mark itself as unhealthy to give it a chance to catch back up. Usually this ensures a high level of performance. When you experience a DoS attack during a deployment, it’s a problem.
The DoS attack caused all of our old servers’ response times to exceed their healthy threshold and mark themselves as unhealthy. When all of the servers are unhealthy, the loadbalancer routes requests to all of them. If we hadn’t been in the middle of a deployment, this would have meant that all of our servers shared the load. Response times would be high, but the error rate would remain low.
But since we were in the middle of a deployment, the existing servers getting marked as unhealthy meant that the new servers that had yet to come online started getting routed requests, and instead of just having high response times, these servers were returning errors. We manually removed the new servers from the loadbalancer until they were ready to serve requests.
Aside from the extra steps this forced us to take in our deployment process, we needed to block the incoming attack until the new servers were ready to take the load. The fix we had put in place meant that the new servers could handle these requests much more readily than the old servers, but the new servers weren’t online yet so we needed to to block these requests as we had the ones the night before.
Now, normally when our servers are under-performing we go through the six step process described in Sunday’s post, which takes about 20 minutes. But we had noticed a pattern in the attacks. At this point, we had been attacked using three different Ethereum Smart Contracts, but every single one of them had been deployed by 0x9d9014c8b1fbe19cc0cd6f371a0ae304f46a0ff7 1. So rather than going through our whole 20 minute process, we checked to see what contracts had been deployed by the address we’d established as our attacker. They had deployed a new contract about ten minutes prior to the attack; we blocked it.
Just as the new servers were starting to come online, attacks commenced again. The attackers on the other end were persistent; they saw that we had blocked their last contract, deployed a new one, and were already using it in an attack. We blocked this one too. As we finished our deployment, this process repeated two more times — they would deploy a new contract, we would block it. By the end of the process we were watching their address carefully and blocking any contracts they deployed before they had a chance to use them to attack us. Just when we finished the deployment of servers that would have no trouble handling these attacks, the attacker threw in the towel and the attacks stopped.
A Stroke of Luck
We got pretty lucky that our attacker was just using a single Ethereum address
for all of their attacks. Had they switched up what addresses they deployed
contracts from, we would have been ten minutes into an attack before we had the
contract blocked. They also could have executed the attack in the constructor
function of a smart contract deployed with
eth_call — the contract
doesn’t actually get deployed to the network, but our node would still execute
its code, which would be much harder to block.
Finally, we could breathe a sigh of relief. The attacker seemed to be gone for now, and even if they came back, they’d be coming back to servers that enforced gas limits 2 rendering their attacks completely ineffective.
Continue the story on Thursday
Ethereum accounts can be smart contracts, which are controlled by programs stored on the blockchain, or they can be “external accounts” controlled by a person or computer in control of a cryptographic private key. Smart contract accounts must be deployed to the network by external accounts, and we had identified a single external account deploying all of the contract accounts used in attacks against us. ↩