2010-02-28: End of month report

We are closing down - please check your email for details.

99.804% Uptime

This month we had 100% uptime for power.
We had network interruptions on the 1st lasting no more than 71 minutes, and 4 minutes interruptions on the 2nd and 17th.

Refunds

As per our SLA we will be automatically issuing a 10% refund to all affected customers. These will be processed in the next few days.

Reason for Outage

The data centre's RFO is as follows :

The outage experienced on the 1st February 2010 at approximately 16.35 was the result of excessive flapping (where a BGP session with a carrier goes from a "Up" state to a "Down" state then back again in very quick succession) with one of our upstream carriers (Cable & Wireless) which caused a knock on effect for the routes available over other carriers. Cable & Wireless also continued to announce our IP prefixes after the BGP session had been shut down due to flapping which caused some clients further routing issues with reaching their IP ranges hosted with 4D.

The flapping BGP session was shut down to allow the IP routes to settle on the available upstream carriers. At the same time, we logged a ticket with Cable & Wireless about the BGP session flapping; we were advised by the Cable & Wireless engineers that they were not aware of any problems on their network and so our engineers proceeded to bring the BGP session to Cable & Wireless back online. As soon as the session was brought back online, we saw further BGP flapping which caused the previous issues to reoccur. The session was immediately shutdown again and has been left in the shutdown state until it can be tested separately to ensure no further issues with BGP to Cable & Wireless before bringing it back online.

Changes Implemented/Being Implemented

  • Further monitoring will be implemented on BGP peers and BGP packets in order to help detect flapping BGP sessions earlier.
  • We are investigating ways of improving the BGP and routing resilience of the network by off-loading each BGP peer onto a separate router giving much more granular control as well as the ability to completely remove a router for a certain carrier from service should it start affecting other upstream carriers and network availability.