This is part 1 of an 8 part series. If you haven't read from the beginning, we suggest starting with the intro.
Sunday, April 19, 2020 - 4:57 PM
I was getting ready to cook dinner for my family when an alarm went off on my phone 1 — All of our production servers were reporting that they were unhealthy, and our average response time had crept above an acceptable threshold. I set dinner aside and headed to my home office to investigate.
These sorts of events are uncommon, but not super rare. Our servers are set to alert us before things get bad for our customers. Usually the cause is some customer running a bunch of queries to gather information about something on the Ethereum blockchain; perhaps a wallet 2 is indexing the transaction of a new token they’re adding support for, or maybe a customer is seeing a surge in usage beyond what they’re used to. Less often, the cause is someone with malicious intent trying to hurt us or our customers. Usually the solution is to spin up new servers (which happens automatically) and the extra capacity is online within a few minutes even if our team does nothing. But ensuring a high level of service is critical to our business, so when these alarms go off we investigate just to be sure.
When we see these events happen, we have a process that we go through:
- Check the servers for responsiveness. Just because they’re reporting unhealthy doesn’t mean they’re down, just that they’re not operating at full capacity.
- If the autoscaler 3 hasn’t already done so, bump up the number of servers we’re running. If it turns out we don’t need them, we can always spin them back down, but we’d rather have extra capacity to make sure our users aren’t impacted while we investigate.
- Depending on the level of impact to our service, we notify customers who may need to make adjustments to their services.
- Pull down the slow query log 4. We deal with tens of thousands of requests a minute. A tiny fraction of those take more than a second to run. We filter out the slow queries we’ve come to expect, and look at what’s left.
- Generally, the slow query log shows some new query or set of queries that’s causing a problem. We run that query against our testing environment 5 to see what it’s doing. Depending on the nature of the query, we evaluate whether or not the query is serving some legitimate function for a customer. If it looks valid, we look for ways we can better handle that query (caching, optimizations, etc.). If it looks malicious in nature, we write a filter to block that query from running.
- We create a ticket to address the underlying problem more holistically — Blocking a query is generally a short term fix, and there’s some underlying issue that query was exploiting that requires a more robust solution. We schedule those tickets into workload based on our assessment of the threat.
In general, the process described above takes between 10 and 20 minutes. On the evening of April 19, the slow query log showed us hundreds of occurrences of a particular query.
The query
If you’re a curious Ethereum enthusiast, here is the query we encountered that night:
{
"jsonrpc":"2.0",
"method":"eth_call",
"params":[{
"from":"0x2D2af99CCeBc1Bf26849732f3a0ac2DAfa982421",
"gas":"0x5f5e100",
"gasPrice":"0x174876e800",
"to":"0x01E11a017c18551863F244203f1aDCd50DA43c3a",
"data":"0xcefe0e210000000000000000000000000000000000000000000000000000000005f5e100"
},"latest"],
"id":417
}
Whoever was bogging down our systems was running the exact same query dozens of times per second. This is a tell-tale sign of a Denial of Service (DoS) attack. What’s more, the request was timing out.6 Whoever was running the query wasn’t getting a useful answer, so blocking it wasn’t going to cause them any new problems. We set up a filter to block this request and I went back to making dinner.
It wasn’t long before I was disrupted again. One of the problems with blocking specific queries is that it can be trivially easy for a determined attacker to circumvent. In this case, we had blocked calls dealing with a particular Ethereum smart contract 7, so all the attacker had to do was redeploy the contract at a different address to resume their attacks.
We ran through our process again, and determined that the smart contract
0x8eec0acff407e82531031a10c4673a08a057d844
was the culprit this time, which was deployed just minutes after we blocked the
previous contract by the same address that had deployed the original contract.
But the attacks this time only lasted a few minutes, and we neglected to block
the offending address (which, of course, we would come to regret).
-
When your job is running servers that need to be available 24/7, you set up software to monitor the servers and let you know at the first sign of trouble. At Rivet, we have someone ready to respond when something goes wrong at all hours of the day and night. ↩
-
Wallets are tools that help people keep track of their cryptocurrencies. They can take many forms, from websites to mobile applications, to hardware devices you keep on your keychain. ↩
-
In cloud computing, autoscalers are programs that monitor the load on a group of servers and add or reduce capacity to ensure demand is met. ↩
-
Slow query logs are an IT concept conventionally associated with databases. At times, Rivet processes thousands of requests per second, and we don’t want to record the vast majority of those requests, both to protect our users’ privacy and because doing so would be expensive. The slow query log captures requests that take longer than normal to run. This allows us to see what queries may be causing problems on our systems. ↩
-
In IT it is common practice to have a “production environment”, which is a set of servers with important data that everyone relies on, and also one or more testing or staging environments, where changes and risky actions can be tested without impact to the production environment. Larger organizations tend to have more environments, so different groups can test different things without impacting each other. ↩
-
Some types of queries against Ethereum nodes can take a long time to run. If we let them go without limits, some would literally take forever, so we put a 5 second cap on the runtime of individual queries. ↩
-
Smart contracts are essentially small computer programs that run on the blockchain. They can do any computations their authors want them to do and store any data their authors want them to store, within some limitations. When a smart contract is published to the blockchain it is given an address that looks like “0x01E11a0​17c18551​863F24420​3f1aDCd50​DA43c3a”. ↩