As usual I am just thinking aloud for this post and thinking about the ways to solve a problem that we are having. Haven't really got the answer yet just sharing hoping that some other smart soul out there might offer me their 2 cents. Our apps currently have got pretty good test coverage or so I feel. I would say that our coverage currently is about 95% of our application functionality. We have unit tests, functional tests and even selenium tests (currently trying out
splinter to replace selenium). We run them all against our test fixtures and use Jenkins to visualize and notify us whenever the tests fail. While I think there are more rooms for improvements, on the whole I am pretty proud of our whole test architecture.
We freeze our code and only branch our code once all of it's tests hits blue on Jenkins. That's all fine. Hunky Dory. That's when the ice berg reared it's ugly head on our sailing Titanic. The problem is our code failed twice while we were trying to deploy the app (wsgi app) and we only detected it through manual means, eye balling it. Once because our ssl certs failed because of this particular bug:
http://code.google.com/p/httplib2/issues/detail?id=202#c2. This bug occurred with httplib2 when using wildcard certs which we don't use in our testing. The whole wsgi app came crumbling down without any of our tests informing us of the wiser. I would be okay if these were some obscure bug that occurred because of some strange use case that we have never considered prior to this, but these are big-pie-on-your-face kind of embarrassing show stoppers. What I want to do is a way to catch show stopper bugs before deploying our code on production servers.
What I am talking about here almost like a full dress rehearsal before deployment. Setting up a dedicated machine for testing that mirrors the live data seems the most obvious way but in someways is too much of a resource hog in my opinion. From what I gather from doing a bit of quick 2 minute research I am not alone:
This seems like an idea too
http://xunitpatterns.com/Test%20Logic%20in%20Production.html. I think overall the main thing to be moving towards is to stream line the test machines to be as close to the live machines as possible and to lessen the 'magic' that we do with test servers to make tests work and determine which environmental variable would be crucial to our application being deployed well and replicating that in our tests.
What I don't like about this solution is that for 'n live sites with different environment' you need n amount of replicated live environments. Another idea that I am playing around with is the create a suite of 'critical tests' that is run after an upgrade just to ensure the most critical services are running fine. Ideally these tests should be short but it should give you an a pretty good idea if the main services will be running when you go home after an upgrade is performed on the server. The tests should just be a subset of the main battery of tests and should complete in a fraction of the time taken to run the full suite of test, just like one of those old windows installers that run 'self diagnosis' after installation.