Server vendors make a lot of noise about how reliable their systems are, but how do they really stack up?
It's hard to say. Getting qualitative information out of vendors is easy enough - they all seem to have the most reliable machines ever built - but what about some objective quantitative information that puts these claims to the test? This kind of data is hard to come by, but it does exist.
Laura DiDio, a server analyst who used to be at Yankee Group until she left to start up her own gig over at Information Technology Intelligence Corp, used to do surveys of CIOs around the globe asking them about the amount of downtime their various server platforms experience in a year. DiDio is still doing that work today and is happy to give EL Reg some insight into what she leaned from customers in ITIC's server hardware and operating system reliability survey.
The study that ITIC has put together is based on a survey of more than 400 C-level executives at companies in a representative distribution of industries and platforms located in 20 countries worldwide. The study looked not only at how many different kinds of outages were reported on the platforms running at these sites, but also the length of time the outages took and the experience level of the system administrators at the sites. Not surprisingly, the sites that have the most sophisticated platforms with the most seasoned system administrators and the most rugged platforms have the least amount of downtime.
The ITIC server reliability study puts server outages (be they caused by either a hardware or a software issue) in one of three buckets. Tier 1 outages are the "stupid stuff," says DiDio, such as someone accidentally powering down a box, which are quickly fixed. Tier 2 outages are trickier and result in system downtime of between 30 minutes and four hours. A crashed application, or getting permissions bollixed up, or some kind of patch gone awry, are the kinds of causes of Tier 2 outages.
These usually require more than one system administrator to figure out and often require at least one administrator to be on site to physically deal with the box. Tier 3 outages are the worst, and most rare among enterprise-class servers. These outages can span more than one box in an n-tier application and database setup, and they take more than four hours to resolve. They also can result in lost data and usually cause irritation to end users who can't get into their applications.
Among the customers surveyed by ITIC, IBM's Power Systems running AIX experienced (this includes older System p and pSeries iron) the least amount of downtime per year, when averaged across all customers using these platforms. AIX shops reported an average of 0.42 Tier 1 incidents per year and 0.34 Tier 2 incidents, and not one customer reported a Tier 3 outage on their AIX boxes. The Power Systems machines (and this includes older System i and iSeries iron) had an average of 0.56 Tier 1 outages per year, 0.44 Tier 2 outages per year, and 0.12 Tier 3 outages. So in 2009 at least, the i platform fared a little worse than the AIX platform running on Power iron.
The numbers for the i platform were pretty similar to the numbers reported to ITIC by shops running HP-UX on PA-RISC or Itanium iron or running Solaris on Sparc iron. HP-UX shops deploying HP-UX 11i v3 on older PA-RISC iron reported an average of 0.60 Tier 1 outages per year, followed by 0.43 Tier 2 outages and 0.10 Tier 3 outages. With HP-UX on Itanium, the numbers were a little higher, with an average of 0.65 Tier 1 outages, 0.48 Tier 2 outages, and 0.14 Tier 3 outages. On Sparc boxes running Solaris, customers reported an average of 0.59 Tier 1 outages per year, 0.49 Tier 2 outages, and 01.10 Tier 3 outages.
When you do the math on the outages tracked by ITIC, the average Power Systems-AIX box had less than 15 minutes of unplanned downtime per year, half of what it was last year. HP-UX boxes (averaged across PA-RISC and Itanium machines) averaged just 36 minutes of unplanned downtime on PA-RISC iron and 39 minutes on Itanium iron. Solaris boxes were in the same ballpark, with 35.4 minutes of downtime, but the aging of Sparc iron (caused in part by just concerns among customers about Sun's future when it went onto the financial rocks in early 2008) is pushing up the downtime numbers a little bit here in 2009, according to ITIC's survey results.
Interestingly, servers running Mac OS at the shops polled by ITIC had 37.4 minutes of downtime per year.
Microsoft's Windows Server 2003 and Windows Server 2008 platforms did not fare as well, but they are improving. In 2008, ITIC's survey respondents reported an average of 3.77 hours of unplanned downtime per year for their Windows boxes, but this has shrunk by 35 per cent in 2009 to 2.42 hours of downtime. While Windows servers have more downtime, the percentage of server incidents that make it to the Tier 2 or Tier 3 level are not appreciably higher, with only 29 per cent of total outages being caused by these higher level outages this year.
Those using IBM's Power-AIX machines reported that 19 per cent of their incidents rose to the Tier 2 or Tier 3 level, and Power-i shops reported a similar 21 per cent of incidents at that level. Solaris shops said that 25 per cent of their outages came in at these higher levels.
The other interesting thing that DiDio tracked in the server reliability study is the experience level of the system administrators and the time it takes to patch a server. The average system admin in a Unix shop has 12.7 years of experience (and the average is 11 years for AS.400-i shops), which compares favorably with the 7 years of experience for Windows admins, four years for Linux admins, and three years for Mac OS server admins.
"The experience level for Unix and AS/400 administrators is equivalent to having a master craftsman build something for you or a Grade-A mechanic fixing you car," says DiDio.
She added that commercial Linuxes have improved greatly in terms of documentation and that this is being reflected in the average time it takes a system administrator to patch a server. Linux shops reported that it took them anywhere from 15 to 19 minutes to patch a server, with variation depending on the Linux. (Ubuntu shows the greatest improvement among the Linuxes this year). The Power-based servers took around 11 minutes to patch, on average, whether they were running AIX or i, while Solaris machines took 31 minutes and HP-UX boxes took 33 minutes. Windows Server 2003 machines took an average of 32 minutes to patch, according to an average of survey respondents, and Windows Server 2008 machines took 38 minutes to patch.
"The lesson to learn from all of this," says DiDio, "is that companies should not skimp on training and certification. That's penny wise, but pound foolish."
You can find out more about the ITIC server reliability survey here. ®