2008-07-02: Reason For Outage

We are closing down - please check your email for details.

In relation to the data centre power outage on Sunday morning, which affected all of our customers, we now have a full explanation from BlueSquare:

This is a Reason for Outage Report with details regarding with the power outage in BlueSquare 2&3

Power Loss to equipment hosted in BS2 & 3

Power was restored within 25 minutes from initial power off. Fault occurred on 29/6/2008 at 04:40.

At approx 4.40am on Sunday morning BlueSquare 2 & 3 lost mains power, due to an under voltage condition from the National Grid which set off our low voltage alarms. This caused the automatic systems to start the generator, which ran as expected. The system is then designed to switch off the Air Circuit Breaker (ACB) to the mains feed, and close the ACB to the generator, thus supplying the UPS with generator power. This worked as expected and the generator took the load. Approx 2 minutes later, the low voltage condition ended, and mains power was restored, switching down the generator and operating the ACBs to switch back to mains, which all worked as planned.

Shortly after this there was a further under voltage condition, which re-started the above sequence, in that the generator started (successfully), the mains ACB opened (successfully) and the signal was sent to close the generator ACB. This signal was sent to the ACB, however the ACB failed to close, thus meaning that the generator could not supply the UPS with power during the brown out. The UPS worked as expected and took the load. During this time the mains came back to the correct voltage, which would normally mean the system will fall back to mains power feeds. However, the ACBs have a physical and electrical interlocking system, which prevents both ACBs from being operated at the same time, thus preventing the possibility of both mains and generators feeding the load, which would result in a severe failure.

Because the signals were sent to the generator ACB to close, but it never did, the interlocking systems got into a state of deadlock, where they were both stuck in an 'open' position, thus leaving the UPS with no feed, resulting in the batteries draining down after 10 minutes, and the system loosing the critical load.

Work started Sunday and continued yesterday to look at the electrical circuitry that controls the electrical side of the interlocks, as well as the mains phase failure relay, which detects a mains failure and low voltage. This was tested as OK, however it was decided to replace this part with a spare to rule out any issues. After this was completed, we conducted a mains failure test which failed in the same way it did on Sunday morning. We restored mains manually at this point. Work then commenced to look at a possible failure of the manual interlock system, which could cause the same issue. Work continued to check and replace certain parts of this system before we re-ran a mains failure situation. This test passed and the system worked as expected. We then decided to re-run the test, to ensure the issue had been fixed. The next generator test was completed, however the test failed with the same result as the first failure. Mains was again restored manually.

Due to all electrical circuitry testing and operating OK, and all manual interlocks working OK, our board vendor then started to look at a possible fault with the generator ACB. Firing pins in the ACB were tested and passed, which leads the board manufacture to suspect there is an intermittent issue with the generator ACB. This ACB is manufactured by APC/Schneider Electric/Merlin Gerin (now all the same company). As this ACB is under warrantee our board vendor did not want to strip the ACB and look for issues, preferring that Merlin Gerin engineers look at this component directly.

Merlin Gerin were contacted and provided telephone support to the board vendors, however this was un-successful. Merlin Gerin then agreed that an emergency support engineer needed to look at the unit in situ with the hope of swapping the failed component and re-testing.

The Merlin Gerin engineers arrived to site this morning, started testing the ACB, and after two mains failure tests a possible fault was identified with the breaker on the manual interlock system. The suspect interlock component was removed, and a further generator test was performed which proved successful. After replacing this interlock component, two more mains failure tests were performed which both proved successful. Merlin Gerin, C&N Controls and BlueSquare Data engineers are now happy that this fault has been fully resolved and tested to satisfactory standards, that in the event of a real mains failure, or the re-occurrence of a under voltage situation, the generator, ACBs and control equipment will all work as expected.

This fault has now been resolved and full N+1 redundancy has been restored to BlueSquare 2&3 customers.

Please note that all the time stamps given above are in GMT unless otherwise stated. We sincerely regret the inconvenience this service outage may have caused you.