Fix it twice - and eat your own dogfood
How do you get to 5 nines reliability - 99.999% uptime?
Basically it is a matter of methodically fixing the root causes of problems which cause systematic unreliability. It’s all about identifying the worst problems and solving those first. This is classic bottleneck theory.
We had an busy morning on Wednesday November 15th. First of all we had an anonymous Mac OS user who hit a crashing problem at 9:30 am. It turned out it was a member of our staff who wasn’t aware there was a problem.
The good news was that we knew about it immediately because we got a crash notification (many of them).
The issue was because their Bitbucket account had no workspaces set up it revealed a boundary problem in our code caching the list of workspaces. The user in question had access to zero workspaces (they were a sales account manager which explained this unusual case). (We’ve been encouraging our sales people to install and learn IguanaX).
We managed to patch the issue by 11am. We could have patched it within 30 minutes but the problem was compounded with another issue.
Our windows signing certificate had just expired which made patching the issue quickly more difficult - it took us 1 and half hours when it should have taken us about 30 minutes. We’re modifying our procedures so that we will schedule getting new signing certificates 3 months ahead of time.
We are looking at the core problem - the issue is our data-structure for reading JSON in C++ will throw an exception if one attempts to cast say a ‘null’ to an array or map when we try and iterate it. So we’re doing a code review to look for instances of this problem and considering changing the behaviour.
And we will make it difficult not put credentials into Iguana so we know who is experiencing the issue and communicate faster with them. We’ll try and encourage users to put their contact details into Iguana so we can more quickly reach them.
So just doing what we can methodically to look at root problems and make sure they do not happen again. The good news is that no customers were affected and we are few more steps along the road to where we need to get to with IguanaX.