Did a Hitachi Data Systems USP-V array controller failure cause the Barclay's ATM outage yesterday?
Yesterday, to its great embarrassment, Barclays' ATM network in the south of England crashed at 1pm, together with a lot of its online banking facilities. Functions were not restored until 4.30pm or later, and thousands of people were caused great inconvenience through not being able to get cash or manage their bank accounts online.
Barclays said it was due to a hardware failure at its data centre in Gloucester, which serves its ATM network south of the Wash. Various reports on the BBC, The Sun, The Mail and elsewhere said that a hardware component of a drive array had failed and that engineers were replacing cards.
What drive array was this? One that was involved in storing data relevant to cash machine operations and online banking? Also, given that the Gloucester data centre has a history of computing system failures (see here, here, and here) why wasn't there an adequate fallback mechanism in place?
We know that, in 2008, Barclays ordered a large, high-end USP-V storage array from Hitachi Data Systems, as part of a 4-year storage-on-demand contract for its Gloucester data centre. It was to provide storage for mainframe and Unix systems. The capacity would rise to 1PB and would start coming online in February this year. There was a separate mid-range AMS storage array supplied by HDS which provided file-based storage for Windows servers through a NetApp NAS head.
Apparently HDS had a similar USP-V contract in a separate part of the data centre.
Under the new contract there was a penalty clause for downtime with the penalty increasing as the downtime increases.
Some of the accounts of the Gloucester data centre's history of ATM crashes show that the mainframe system is involved in ATM operations and this indicates that the USP-V system could be the failed drive array in yesterday's outage.
This was confirmed by a source familiar with the situation from another IT supplier, who also said that HP/EDS have the maintenance contract for the affected system.
HDS recently announced failover clustering facilities for the USP-V. If a USP-V controller in a cluster fails then operations are automatically picked up by a second USP-V controller. Without such a High Availability Manager arrangement, a failed USP-V controller can cause the storage array behind it to be inaccessible until the controller is repaired.
Bastiaan van Amstel, the senior EMEA PR manager for HDS, said, regarding the outage: "A lot of due diligence is happening at the moment and, before it is completed nothing can be said." He added: "Many vendors are involved in the IT at Barclays." ®