We have been very busy, and a lot more quiet than we would like to be. Talking to our community is something we heavily enjoy, because community input is what we shape our service around. We hope to soon incorporate things like newsletters and regular blog posts to keep our users informed of changes, both mundane and exciting!
In this post we are going to recap the past few weeks, a few updates, and some issues that users experienced. Most importantly, this post outlines how we addressed those issues, and our plans moving forward. This blog will be long, so if you want something shorter, you can find a quick TL;DR at the end.
We noticed that quite often, new users didn’t get a chance to try the site because of heavy traffic. As a new user, being hit with a wall saying there is no space for you, is just no fun. With this in mind, we worked on a queue system that gives priority to free users who haven't used the site in the past day over other free users who have used it a lot on that same day. Subscribers still get highest priority as always, and bypass this queue system entirely (big shout out to our subscribers who make Tutturu possible)!
This system we created worked great, and we released it after quite a bit of testing. The queue worked flawlessly in our testing, but we didn’t anticipate the sheer number of issues that would occur with the rest of the code on the website at scale. Tutturu had some less than ideal code from its early stages that was just waiting for the right moment to crack. Little did we know, each update was getting us closer to that point, and the moment we released the queue system, we passed that point, and were subjected to full crashes every few minutes. Shortly after release, we rolled back the update, and began to fix the issue.
Full Code Refactor
The only fix was to rewrite all of the old code, a task that took quite a while and put all other updates essentially on hold. We had everything ready, and after three weeks of testing, we finally decided to bite the bullet, and release the update. We anticipated some issues may arise, because the code had never been put under the stress of traffic from our main website, but the issues with our old code were not tied to just releasing new features, but people accessing the website in general. With our ever expanding growth, getting the update out became increasingly urgent. Without it, we would eventually hit a point where our natural traffic could threaten to take the website down every few minutes.
Upon release, the new code had its fair share of issues. We did address many the same day, but others took some investigation and planning to fix. We made efforts to provide nearly 24/7 support to any user experiencing issues, often manually fixing them at all hours of the day. Each time we received a complaint of something broken, it gave us more info to track it down, and fix it for good. Because of this, we owe a lot of thanks to our community for assisting us in finding the causes of so many problems, proving our issue reporting system really is an asset to our overall ability to address issues with the website.
So what is next?
It has been a bumpy few weeks, but now that this is complete, we will never need to do such a large scale rewrite of the code, so further issues should never be as widespread. Many lessons were learned that will influence our next important tasks.
One thing became clear in all this—interrupting a viewing session is the absolute worst thing we could possibly do, and unplanned full crashes do that for everyone. As bad as that is, we also need to take the website down to apply new code as well. We originally used low traffic moments in the early morning to send out update alerts, to interrupt as few people as possible, but emergency updates could ruin the experience for many more than we anticipated.
Our next goal is to make changes that will allow us to update code on the website without interrupting any ongoing sessions. There may be some types of changes— particularly code related to the VMs—where this will not be possible, but overall the goal is to apply any updates we can without taking the entire site down.
- We Introduced the Queue system
- Queue worked, but it unveiled underlying issues with the website's code, causing it to crash repeatedly.
- We rolled back the Queue update, and began rewriting all of the code related to the website to address issues related to crashing.
- After weeks of testing, the new code alongside the Queue system was released.
- Three weeks of testing was clearly not enough; there were many site stability issues involved.
- We provided nearly 24/7 support on the issues involved, manually fixing problems for users, but the experience was not ideal for users for quite a few days.
- In spite of daily crashes we managed to quickly fix many issues with the new code, and have fully patched each issue we discovered at long last!
- Moving forward we will no longer need to do such large, drastic rewrites of code, so such large scale issues should be a thing of the past!
- Our last major design flaw requires us to take the website down for any moderately sized update.
- Our next task is to change this so that only very specific types of updates need us to take the site down. Minor updates will be applied without interrupting a viewing session.