Sunday, October 13, 2013

Deployment woes


We are still having deployment woes here where we deploy our applications on to live servers. What really irks me is that these errors are very amateurish and keeps on giving our whole team a bad image. This is a waste as clients keeps on focusing on what is wrong instead of what was done right. This also puts them into a fault finding mode which is not advantageous to adoption of new functionality.

What do I mean by "Silly mistakes" ? When I say this I mean things like major services not enabled after server upgrades / deployment. Some errors I have found are:
  • Mis configured wsgi files usually something to do with paths or virtualenvs
  • Core services that error-ed out usually web servers or cron server. Our application depends largely on scheduled jobs and cron jobs erroring out and not running can prove disastrous.
  • Wrong people seeing the wrong things on the server or just some silly test strings appearing on the live system. 

This kind of errors, it's very hard to recover and cover or explain as it really looks amateurish and as though we don't know what we are doing. Depending on our own people to eye ball the application after upgrade is not workable for simple reason, no one can ever take leave and people tend to make mistakes. The other challenge is of course to find the mid way point between resources that are stretched to the breaking point and putting in some fashion of testing as a stop gap for this kind of silly mistakes. I still feel that asking developers to check on the work that they have been working day in day out everyday for most of the week or month is not really a good idea. I am just constantly surprised how our clients use our application, normally just something I have never considered using it that way.

I saw that we had really good notification checking and we had tests. We had good selenium tests and good nagios scripts. I thought about incorporating our nagios scripts into our deployment script where the deployment script would call the nagios scripts to check our core services. That would also give us a somewhat basic check script for our application that would give us a head start with minimal effort. Our notifications scripts were actually customized scripts written for Nagios. If we were to run them after the upgrade / deployment of our sites it could then warn or tell us if we fudged up and certain services were indeed dead.

On the selenium side, we could abstract some basic tests from our existing scripts to run after an upgrade or deployment is done. These are just a subset of the full test and should just try the basic functionality of the components comprised in the whole application.

Ideally the collection of behavior about a certain upgrade / deployment should come from the project team or the stake holders in the form of BDD scripts. This flies against what I have been told about running my tests on live systems but then having your clients doubt your system as being buggy because you failed to check your application at the most basic level is just not fun and does not foster good faith in your system nor does it help to solidify job security. I am still at the drawing board for this but then it would work something like this. The post installation step would require us to run either a selenium test or any other black box type testing based on collected BDD scripts written based on core or important functionality that should be deployed in a particular version of deployment or installation.

At last and in no way is this a conclusion, we have decided to go the route where we have some pre-flight tests before we release our deployment out into the wild. These pre-flight tests also should contain BDD scripts that are passed down to us by stake holders in the project.  The other thing that I found is, as much as we want convenience security to live servers should be locked down to as few people have administrator rights as possible and as much as it pains to tune ACL for every specific action, it pays in the end when you want to track who did what to screw up another live server deployment. 
Post a Comment