What is the Weekly Reports feature?
The Weekly Reports feature in Insping is designed to send an email report to users about their summary of checks on a weekly basis. With this feature, every Insping user having at least one active check will receive an email with the summary of last week’s checks.
What went wrong with this feature?
We designed, developed, and deployed prematurely by mistake. Yes, we deployed it accidentally. Because of this, all Insping active users received the report email with some incorrect data on it.
How it happened (The Big Issue because of a small old bug)
Actually, the story started a few months ago. We did a UI change on ‘Manage’ section which introduced a bug. Due to this bug, admins or owners will not be able add users to organizations, specifically when they have more than one organization.
We failed to notice this issue because the above is a fairly rare edge-case scenario. We developed the reports feature and we were doing a quick test, but unfortunately we ended up exactly in the edge case scenario. We found the bug and and fixed it instantly. We did not just do the UI fix, but released the weekly reports as well. Yes, instead of deploying just the front end application, we rolled out everything by running similar commands from the bash history.
What is the invalid data?
- We generated reports for 8 days instead of 7
- We showed the uptime values for the downtime as well, so now both uptime and downtime reflected the same values.
How did we correct it?
Actually, we didn’t. We had already sent the email to all the active users and we don’t know any magic spell to bring them back. So, we just made the fixes and kept the app ready. From the next week onwards, your reports will be proper. After releasing the fix, we sent an apology email to all the users explaining the situation.
What we could have done to avoid this mistake
- Release Pipelining with CI CD
- Avoid on-demand releases
- Focus on one thing at a time
This process will transition the code (commit revision of version control systems like git) to a release. Every feature (i.e) revision should be put into the pipeline. The pipeline process should take care of release. Typically, each release will be going trough multiple stages of pipeline and each stage will do certain sanity checks. For example in stage 1 run automated test cases, stage 2 will place it in internal beta testing environment and wait for approval from QA members, stage 3 brings it to alpha testing and finally stage 4 will release it to production.
What is this CI and CD?
Continuous Integration (CI) and Continuous Delivery (CD) can be part of the release pipelining. With continuous integration, every branch of code (any source code change can be a feature under development) can be tested as soon as possible. Continuous delivery will give the ability to release the code seamlessly, typically by a merge operation of version control systems.
Bunch of ready made solutions are available of CI and CD. Here are a few
Avoid on demand releases
The whole mistake was because we wanted to deploy the UI bug as soon as we fixed the bug, without a QA or pipe line process. On-demand quick release always involve risks like this. So we should have avoided it.
But then again, we should be able to do on demand releases in case it becomes an urgent requirement. In cases like security bugs, we can not wait for pipeline process
Focus on one thing at a time
There always are added errors in parallel processing. Even in the current advanced computing era, there are lots of issues related to concurrency like race conditions, dead locks etc. The main cause of the weekly reports error is becayse we deviated from actual works and started other (UI bug fix).
Our biggest learning has been that we should always try to focus on one prioritized thing. And we certainly promise to do so.