A Rewrite Story (and How It Didn’t Go As Planned)

In a previous article, Hussam Almarzooq talked in short about how we rewrote one of our main components: the Products Component, in a language different than the one used in our primary system. In this article, I will tell the story of this adventure from my observation as the main maintainer of that component, explaining some of the challenges we went through from the reasoning behind every decision to the consequences.

How it all started

At some point in time, only one person in Zid understood what every component did, only because they're the one who implemented it all in the first place. Everybody depended on that one person, and it was simply impossible to replace them or reduce the pressure on them to ensure the system's continuity. Especially with the level of growth the company was going through, having people try to learn all the tech debt was a luxury they could not afford.

A consultant was brought in for their advice on how to resolve this dependency. They suggested that the whole team start learning a new programming language and to rewrite some components in that language; that way, all the team members would be progressing with the same level in parallel, and the dependency on one person will eventually disappear.

I joined Zid in the middle of 2019 as a trainee. Because I was a trainee, someone thought that making me work solely on fixing bugs is an excellent idea to explore the system and discover it's dungeons (and it was). We had a backlog of reported bugs, and I was allowed to choose from there whatever I feel like fixing. However, for some reported bugs, I was repeatedly instructed by one of the seniors to be cautious of them. I eventually took one of them out of curiosity, and I did regret it in the end.

Most of the "to be cautious of" bugs were all issues in the products component. Entering that area usually leads to getting stuck for a week or two until you finally manage to fix the problem but discover that you broke ten other things with your fix. No tests nor documentation were in place. Nobody did actually fully understand the component. Most of it was inherited from open-cart and thrown into our Laravel system with tons of spaghetti here and there.

At that time, the team had already started rewriting the products component in Python/Django as advised by the consultant. When I first discovered that, I didn't know the story behind it. I thought it's a bizarre decision in a team that only knows PHP except for one person who was the Pythonista chosen to start implementing the rewrite.

The Implementation

When we implemented our Django version of Zid's products component, the project was named core-api with the intention that all main Zid components be moved there eventually.

Django was an excellent choice for a rewrite, really, it allowed rapid development and scaled incredibly well. The modularity of its apps, alongside Celery's massive powers to handle almost all of our workloads, and the abundance of the plug-in-play packages played very well in letting the features of our products component grow and thrive in a short period.

What previously was a nest of bugs in PHP became a safe place to deliver high-quality code without even looking back. Tests were implemented from the beginning — though not in great coverage >60% was enough for us to have the confidence to add features and move fast with our changes.

Everybody in the team was supposed to be learning python moving with the plan to end the one person dependency. However, they were too busy fixing issues and delivering new features under the high pressure of the fast-changing eCommerce field.

Since everyone else was too busy, there was this new trainee (me) that could afford the luxury of learning and becoming the next Pythonista in the team. While others were supposed to follow whenever they find some free time – which basically never happened. Fixing the organizational issues could always be rescheduled to another day.

The Consequences

With all the excellence delivered by Django, it wasn't very long until we realized that we made a small mistake that had a significant impact on our system later, that is: even though we rebuilt a whole component in another language, we didn't take into consideration how important it was for other components like orders and carts to have direct communication with the products, for example, the carts component checks for the quantity of a product and increment or decrement it based on coming purchases.

Relying on HTTP requests wasn't an option here. We needed ACID transactions and race-condition safe operations to cover that area. But we didn't have the time and resources to manage distributed ACID solutions (like the saga pattern through our cloud infrastructure). So we reluctantly decided to allow the primary system to have direct access to the products database. As a result, changing a column in any table created by the other system broke the main one, so we had to be very careful with our data schema since it was used in multiple places.

It Keeps Getting Better

Products are the essential part when you visit any of our stores. Since we work in very spikey loads driven by merchants' marketing campaigns with influencers, the performance (and uptime) was a fundamental concern of the newly rewritten component. After reaching an adequate level of feature parity on products' development, we started monitoring its performance and enhancing it. We used ScoutAPM as our application performance monitoring tool, which gave us some helpful insights into the bottlenecks of what we built.

Most of the bottlenecks we discovered got fixed by either prefetching relations or caching responses. For caching, we used Django's default cache_page decorator for around a year, a nifty helper that can add caching to an endpoint by just putting the @cache_page decorator to its view definition.

The decorator served its purpose, but it was very limiting. It had no way of building custom keys for cached views, making it quite impossible to invalidate view caches correctly. Data inconsistencies appeared everywhere, and we had to build a customizable solution for caching.

In June 2020, we published an open-source package that we called django-custom-cache-page—designed to replace Django's default decorator and provide flexibility and easier cache invalidations at scale.

After releasing our solution to cache everything, we thought that we closed the door on that area. Yet, the good old issue of communicating between backend and core-api started appearing again. The carts component that had direct access to the products database to update a given product's stock didn't know anything about our cache implementation. Given that the cache was never invalidated when the quantities are updated, When a product went out of stock, people still saw it as if it was available. The number of people seeing Out Of Stock errors on their checkout experience was troubling.

To fix this last issue, we started using a messaging broker between the two systems to let each other know of any events taken by the other and trigger actions like cache invalidations and running background aggregations accordingly. It worked, though not in the best way, especially at peak times when our queues get choked. Delays in processing the messages happen, and some differences can be noticeable. Dealing with cache invalidations in a distributed architecture is definitely fun.

The State We're In

Aside from the technical challenges that emerged from rewriting a single component, we never got the time to breathe and catch up with our plans to fix the organizational issues thanks to the market pressure. It's funny that we did manage to resolve some of the dependency on one person by creating a new dependency, but at least this time, it was for a single component, not a whole system.

The Benefits

As Hussam described in the previous article, the rewrite improved performance by up to 6000% in some endpoints. Currently, the products component has more than 3 million products in production, and we have impressive response times (given the complexity) <70ms for most of the products endpoints.

Yes, there were some challenges and odd design choices. However, I still believe that it had a great outcome. This little journey provided stability for the end-users and allowed our business to grow. That's the important part when it comes to any software project in a startup.

Conclusions

To wrap up this journey, here are some of our key takeaways:

  • A rewrite in the form of extracting a component into a separate service entails tons of challenges.
  • If you manage to deal with such challenges with your available time and resources, the level of freedom from such a rewrite is going to take you places.
  • Resolving problematic dependencies does require not only bold decisions and proper planning but also strict execution.
  • When doing a rewrite, start with a fair amount of written test cases – you'll be surprised with the quality and speed of development.

If you want to learn more about Zid or are interested in joining our team, please visit our careers page, and keep an eye on this blog for articles like this.