Serving The First Million Orders - Part 2
In the previous part, we talked about how Zid bootstrapped on OpenCart to launch in very short period of time, and later started to gradually replace parts as it is moving. In this post, we will continue with our journey, at the point where we were serving thousands of active merchants with over 10M Riyals in transactions.
100K to 1M
In this iteration, we tackled the infrastructure. We understood that Heroku was going to be very costly, and it was missing many of the features we needed at the time. Keep in mind that this was Heroku of 2017. So, we decided to migrate all of our infrastructure to AWS (first to ElasticBeanstalk and later to EKS). On the database side, we migrate from RDS for MySQL to Aurora Serverless. Aurora Serverless has automated scaling and elastic storage, so it takes away the effort of keeping track of allocated volume size and allocating read replicas and resizing the master. We also added caches all over the stack in Redis and Memcache using ElasticCache.
Some of the technical debt that was inherited from OpenCart still existed. In other words, the logic that was originally designed to host one store, and the 2010 way of handling I18n and the unnecessarily complicated database schema was still driving our application. Moving forward, we wanted to gradually refactor the codebase, starting with the most critical components. We started with the component that was giving us the most headaches: the products module. After a long discussion, we decided to rewrite this module in Python/Django – a decision that we still debate its consequences. We were able to condense 15 tables into 4 tables and improve performance by up to 6000% in some endpoints. We were also able to get rid of some of the caches altogether with almost no-hit in performance. Deploying the new service required data migration, so we gradually migrated store by store while maintaining both services and keeping both code paths in related services. The migration took three months between pilot testing, migrations, rollbacks, and fixes.
Some requirements changed, and others only needed to be simplified. Understanding each component's case helped us to choose the right approach to building and migrating to the new system. So, the second component we rewrote was the order management component. This component is much harder to rewrite, given the lack of any form of automated testing. Additionally, this is the piece that determined how much a user was supposed to pay, including VAT. There was no room for error. This rewrite had to be done in-place, meaning that we had to gradually reimplement and deploy while the service was still running.
We knew flying blind was bad, so we had to also start adding monitoring tools. Infrastructure-level monitoring, application-level monitoring, and uptime monitoring; we needed all the visibility possible. We tried multiple combinations tools like NewRelic, DataDog, Prometheus, ScoutAPM, OpsGenie, PagerDuty, and many more. Until we finally settled on using Prometheus for monitoring Kubernetes, ScoutAPM for application performance monitoring, and OpsGenie for incident management and uptime monitoring.
1M and beyond
We believe there will always be room to improve, and there will always be a better approach to solving a problem. It is a matter of balancing the engineering efforts and business values returned by implementing a feature. That was why we sometimes had to compromise in some aspects of choosing the parts to rebuild, and pieces that we will continue to glue features on.
Not all projects go as planned. One project that did not go as expected as called Raqeeb (رقيب)
, an automated functional testing toolkit and runner. The idea was to have a testing service separated from the application code and framework. It would have allowed us to mimic how the users would use the browser. It would have been built on top of Puppeteer, and the developers would add critical test scenarios as needed. A scenario is a browser journey (e.g., add a product to cart, check out). Raqeeb project failed because it lacked ownership; we could not decide who was supposed to implement and maintain the tests. Should the backend developers do it? Should the frontend developers do it as part of feature implementation? Maybe the product managers should do it as part of the acceptance criteria? Who updates the tests when a bug was fixed, or an element's selector was changed?
Next in line for revamping is our catalog system. We want to make it easier to build new themes, whether by our team or by the customers themselves. That will require us to become more and more API driven, where we not only dogfood our APIs, but we also have to detach our API users from API developers. Make them work with each other as clients and providers, where API developers work and enhance the APIs independently from frontend's lifecycle. The developement is done with very loose high-level product coordination.