Most of the desktop computers in the UK's Department for Work and Pensions were paralysed for four days on Monday, when a failed upgrade took them offline. The outage, covering 75-80 per cent of the DWP's 80,000 PCs, is one of the largest in the UK Government's not entirely impressive IT history.

The news coming out on this seems to suggest that they where attempting to upgrade a bunch of test machines, and accidentaly pushed their test image out to the entire production enviroment. Ouch.

Though one can only speculate about how this happened, I suspect that either a lack of operating procedures, or lack of following them, may be to blame.

I am not at all suprised by these kinds of news stories. I have seen many many examples of where work being done on the backend by techs, that could severely impact the production enviroment, was not adequatly isolated, either by technology or by procedures. Often the live production enviroment is used to test and experiment, often due to lack of a test enviroment, or because the test enviroment does not actually represent the infrastructure you want to test on.

In this case, even with a test envirmoment, the distribution system was not isolated from the production enviroment. It doesn’t nesseseraly have to be, if procedures are in place and strickly adhered to. If there are no procedures or safeguards, which is usually the case, its up to the administrator himself check what he is doing, and to remain alert and aware of what the concequences are.. are you willing to take that risk in 60k enviroment?



Posted on Monday, November 29, 2004 10:01 AM

