- The Blueprint
- Posts
- Netflix's strategy to migrate 232M users over to new systems (3 min)
Netflix's strategy to migrate 232M users over to new systems (3 min)
Topics covered in this newsletter:
Replay Traffic Testing
Sticky Canary Deployment
Happy Friday busy engineers! Netflix’s system feeds the bingeing needs of 232,000,000 users worldwide. Let’s talk about they migrate systems at scale!
If you find this article interesting, Netflix is hiring!
![](https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/7919c44e-11b9-4bd1-91fd-040675f11b0d/Untitled__4_.png)
Credit: Netflix Medium Blog
But first…
Why are systems migrations such a huge deal?
At Netflix’s scale, the tiniest of system changes might introduce inefficiencies causing millions of people to have a slower, jankier experience. This creates a need for reliable ways to compare the old vs new system! Let’s talk about that.
Replay Traffic Testing
Replay Testing is Netflix’s way to record all the requests to their existing system over some time and then test them against their new system.
Think of it like a sports team rewatching a recorded video of their game. The recorded video helps analyze, reflect on and understand the game's events and outcomes. Similarly, devs can analyze and find optimizations in the new system!
This has three huge benefits:
Analyzing the performance of the old vs new system with the same requests provides an accurate point of analysis
Acts as a load test to gauge system performance under different production conditions
Helps test new APIs against users who somehow manage to create edge cases :)
![](https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/3815a5fc-f61b-49d1-a4c6-08e52e4a0e77/Untitled__5_.png)
Credit: Netflix Medium Blog
While the authors go into detail on Replay Testing setup, let’s tread onto the actual migrations!
Canary Deployment but Sticky??
Canary deployment is a widely-adopted strategy to release a new system by first rolling it out to a subset of users first, analyzing performance, and gradually releasing it to more users.
Netflix’s Sticky Canary involves creating a special group of customer devices that consistently receive and route traffic to both the new version (canary) and existing version (baseline) of the service (image below).
![](https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/a16dc97b-6716-4d71-9d0a-711beb9a398c/Untitled__6_.png)
Credit: Netflix Medium Blog
For example, video playback requires a series of requests between the client device and various backend services. While Traditional canary deployments only measure the performance of individual services, sticky canary helps measure micro-services as a whole.
However, since sticky canary consistently routes a set of devices to the canary version, this may cause undesirable behavior if the new system is erroneous. Thus, the canary framework can detect this and kill the experiment in that case.
Shoutout to Shyam Gala, Javier Fernandez-Ivern, Anup Rokkam Pratap, Devang Shah for writing the original blog.
Stay tuned for future newsletters where we cover more about their migration process!