Zaddy: Why and How We Have Our Own Proxy?
Since launching Zid in 2017, we have always relied on using off-the-shelf infrastructure components. We did most of the heavy lifting in the application. Things like tenant isolation, routing, rate limiting, and caching were all done in our applications. Over time, new challenges started to arise. Some challenges were a result of our architecture evolution (having many services written in many languages), and others were simply scaling problems (more customers, more load). It got more difficult to standardize caching mechanisms, logging, and rate limiting across the services. On top of this, even if rate limiting was handled correctly, too many requests bursting could potentially cause an outage. In fact, we had our fair share of service degradations caused by DDoS attacks targeted at some of our customers.
To tackle these challenges, we decided to look into implementing a proxy layer that would receive all of our traffic and filter it. We wanted to first implement rate limiting as an MVP of this component. In our rate limit component, we needed to be able to set the limit dynamically to individual hosts. This means that different customers (i.e., domains) will have different rate limits on them that we will define based on what we define as normal traffic. This meant that we needed the proxy to be programmable – a server that we could extend with code, not just configuration. This would also give us the flexibility to implement more features on top of the proxy without being limited to whatever comes included.
Our Options
Given this, we started surveying our options. First obvious one would definitely be Nginx which can be extended using OpenResty and Lua. The second was AWS API Gateway given that we are already running on AWS. And the last option was Caddy which was brought to us by accident – Mohammed from the Caddy team reach out to see if we are interested in Caddy around the time we started researching this.
We quickly ruled out AWS API Gateway because it was inflexible and did not have the means to allow us to implement rate limiting per host. Also, it would've locked us into AWS. So, the discussion was Nginx + Lua vs. Caddy (written and extensible in Golang).
Nginx vs. Caddy
Nginx came strong at first because, well, it's Nginx. The main, and probably only, drawback with it was only programable in Lua – which we did not have much experience with in the team. Lua seems to be an easy language to learn, but it does not have the strong community and package library of Go. Also, as discussed in a Cloudflare post, Lua can be less efficient in accessing data structures from Nginx because it has to allocate and copy data between the C process of Nginx and Lua's process.
Caddy, on the other hand, is written in Go. We already have a few engineers on the team who have used it elsewhere. Caddy's module system was interesting and directly addresses our needs. Although, it was confusing at first as there is no other system (that we are aware of) that follows their approach. Caddy's modular architecture allows you to write new extensions (called modules) and then directly compile them into a new Caddy binary. No dynamic linking and hassling with version compatibility at run-time – it is all self-contained. On top of that, a good number of "standard" and "non-standard" (aka community) modules are already available on their website.
So, Caddy was our front-runner. We then needed to implement our proof of concept. Before adding rate limiting, we needed to count the page visits per minute (which Cloudflare does not give us). So, we decided to implement a visit counter module that counted both unique visits and repeated visits per host. I will not get into the details of the implementation as it is beyond the scope of this writeup, but it was ~70 lines of code that took two days of work (mostly getting to know how to properly compile Caddy with the module). Then, we spent another day properly setting up a Caddy file to proxy our traffic, and that was it! When that was ready and running in a testing environment, I told the team this was suspiciously easy!
We then pushed it to production to see how it effects the performance of the platform and start counting visits per host with our module. The results were promising. No noticeable added latency, and the counting was working perfectly. A week after we implemented our rate limiting module based on a leaky bucket package in go which, again, only took a few days of work.
With this success, Caddy was the winner of this debate. We even gave our build that contains Zid's custom modules a name: Zaddy (that is Zid+Caddy).
In the following sections, we will discuss areas that Caddy helped us with, and share an incident that was caused by our Caddy setup.
Saving Us From DDoS Attacks
This topic could be a post on its own, but this was the main reason why we started this endeavor. Rate limiting helps us defend against two types of abuse: unfair (yet innocent1) usage of the platform and malicious attacks (e.g., denial of service). Both could cause a noisy neighbor to cause a disruption to the rest of the merchants on the platform.
We wrote a custom rate limiter to first detect abuse, then help us trigger an attack protection rule on Cloudflare so that malicious traffic does not even hit us. For the merchant being attacked, this gives time for Cloudflare protection to kick in and prevent malicious traffic. For the other merchants on the platform, this ensure that they stay up even during an attack.
For merchants receiving legitimate traffic, we would adjust their rate limit accordingly to prevent their stores from being disabled for real buyers.
Caching
Following the success of the rate limiter, we started to look into performance gains that we could get. Cloudflare already does caching, but it is not as flexible as we would like it to be. More importantly, it highly depends on which edge you are routed to. That means if our customers are more distributed and/or our TTLs are too short, we would not benefit as much – which was the case.
So, we used Souin to cache various pages (and APIs) with staleness rules that would allow us to serve a portion of traffic even if some of the upstream servers are down. This has single-handedly helped us take our top page loads from 4s down to 0.6s without having to touch application code. The cherry on top is that Souin uses go's single flight mechanism, which means that identical requests will reuse the same response instead of bombarding the origin servers with the same request – mitigating the thundering herd problem.
Then, It Caused Us an Outage
Unfortunately, it was not all sunshine and roses. We ran into two bugs in both Caddy and Souin that combined caused an outage and multiple incidents of service degradations. The issue in Souin was related to Redis connection management, and the issue in Caddy was a race condition causing Caddy to crash. When Caddy crashes, it restarts and tries to connect to Redis (given the bug in connection management) with too many connections. Causing Redis's CPU utilization to spike and stop accepting new connections. Then Caddy crashes and tries to connect – you get where this is going.
Fortunately, both projects have great maintainers. Mohammed Alsahaf, and Darkweak were very proactive and resourceful in tracing the problems and solving them. Both issues were quickly patched2, and then we were able to push a build to a portion of the traffic to ensure that these issues were solved.
Future of Caddy at Zid
We are looking to expand our use-cases of Caddy as both a proxy and as an application server (something we have not discussed in this post). For the proxy, we are experimenting with optimizations like minifying all content on the proxy3, and dynamically compressing and thumbnailing images.
On the application server, there are two projects we are looking into. One is migrating our php-fpm applications to run on FrankenPHP. A modern application server that runs on Caddy. It will allow us to improve the throughput and efficiency of our applications utilizing worker mode.
The second application that we are experimenting with is our storefront component, which is responsible for rendering every store's theme. We are experimenting with having it entirely run as a Caddy App instead of being a separate application. The idea is that the fewer hops and the fewer components in a request, the faster the response to the end customer is – the higher the chance that they would stay on the site.