The Channel logo

News

By | Chris Mellor 16th January 2012 09:01

Flash drive meltdown fingered in Swedish IT blackout

Tieto's EMC VNX5700 array sparked 5-day disarray - new claim

Tieto's five-day outage disaster started with multiple failures of its EMC VNX5700 array's FAST Cache, according to a Finnish source close to the matter.

Tieto is a major IT services organisation across Scandinavia and the Nordic region – although it also provides services globally – and pulls in net sales of SEK17bn (£1.59bn). Its large customer base in Sweden means that when it had a five-day outage in November, it caused chaos to IT services across that country. The stoppage was caused by failures in an EMC storage array and compounded by an inadequate disaster recovery plan involving Networker tape backup files which could not be read. The circumstances are not clear and seemed to involve a VNX array with an upgrade to an NS480 (Celerra) system for flash, which is a logical nonsense.

El Reg has been sent a Tieto slide deck (PDF) describing why the service provider migrated from its Celerra NS480 to a VNX5700 and the resulting performance improvements: namely lower latency and more IOPS. This deck is in Swedish but Google Translate gets around that little problem.

Based on the translated slide deck text, the story goes like this: in the 2010/2011 period, with a EMC Celerra NS480 array, Tieto saw its storage challenges as performance, response time, scalability and capacity. So it migrated from RAID (4 + 1) groups to Thick Pools composed of 60 disks and began to segment data types into Fibre Channel and NAS. The next step was to install EMC's FAST Cache with four 200GB SSDs and the cache license, which was beneficial as response times were more than halved to less than 20ms. However the NS480 CPUs were maxed out.

Tieto upgraded to a VNX5700, but retained the 4 x 200GB SSD capacity and Fast Cache license and the 60-disk Thick Pool, although the disks changed from 450GB FC to 600GB SAS ones. 14 x 1.04GB chunks were created in each pool and only FC block access was allowed. The outcome was a boost in IOPS and a further reduction in latency as shown in the chart.

Tieto VNX 5700 chart

Chart showing IOPS increase and latency decrease with move from NS480 to FAST Cache and then VNX5700

So here we have the basic VNX5700 array setup in which the hardware failures that led to the five-day debacle took place. EMC won't comment on any details, having referred us to the Tieto statement seen in our article yesterday. Our source said, for what it's worth: "What basically happened (in my understanding from Twitter rumours) is that Tieto had multiple SSD failures on [its] VNX5700 array Fast Cache, this resulted in data loss."

What needs to be stressed is that Tieto's DR processes were dreadfully inadequate and obviously untested for the eventuality of such a failure. Lawsuits over data loss and business interruptions at Tieto's affected customers are bound to follow. ®

comment icon Read 14 comments on this article alert Send corrections

Opinion

Chris Mellor

Drives nails forged with Red Hat iron into VCE's coffin
Sleep Cycle iOS app screenshot

Trevor Pott

Forget big-spending globo biz: it's about the consumer... and he's desperate for a nap
Steve Bennet, ex-Symantec CEO

Chris Mellor

Enormo security firm needs to get serious about acquisitions

Features

Windows 8.1 Update  Storeapps Taskbar
Chinese Buffet self-service
Chopping down the phone tree to scrump low-hanging fruit
An original member of the System/360 family announced in 1964, the Model 50 was the most powerful unit in the medium price range.
Big Blue's big $5bn bet adjusted, modified, reduced, back for more
Microsoft CEO Satya Nadella
Redmond needs to discover the mathematics of trust