ITIC in the News

Loading...
ITIC: Home

For the second year in a row, IBM AIX UNIX running on the Power or “P” series servers, scored the highest reliability ratings among 15 different server operating system platforms – including Linux, Mac OS X, UNIX and Windows.

Those are the results of the ITIC 2009 Global Server Hardware and Server OS Reliability Survey which polled C-level executives and IT managers at 400 corporations from 20 countries worldwide. The results indicate that the IBM AIX operating system whether running on Big Blue’s Power servers (System p5s)  is the clear winner, offering rock solid reliability. The IBM servers running AIX consistently score at least 99.99% or just 15 minutes of unplanned per server, per annum downtime.

Overall, the results showed improvements in reliability, patch management procedures and an across-the-board reduction in per server, per annum Tier 1, Tier 2 and the most severe Tier 3 outages.  Among the other survey highlights:

  • IBM leads all vendors for both server hardware and server OS reliability as well as the fewest number of Tier 1, Tier 2 and Tier 3 unplanned server outages per year. IBM AIX running on the System p5s had less than one unplanned outage incident per server in a 12 month period. More impressively, the IBM servers experience no Tier 3 outages. Tier 3 outages are the most severe and usually involve more than four hours or a half-day worth of downtime and can also result in lost data.
  • HP UX also performed well though HP servers notch approximately 25 minutes more downtime than IBM servers, depending on model and configuration – or just under 40 minutes per server, per annum downtime.
  • IT managers spend approximately 11minutes to apply patches to IBM servers running the AIX operating system, which is again, the least amount of time spent patching any server or operating system. The open source Ubuntu distribution is a close second with IT managers spending 12 minutes to apply patches, while IT managers in the Apple Mac OS X 10.x. Novell SuSE and customized Linux distribution environments each spend 15 to 19 minutes applying patches.
  • IBM also took top honors in another important category: IBM Power servers and AIX experience the lowest amount of the more severe Tier 2 and Tier 3 outages combined of any server hardware or server operating system. The combined total of Tier 2 and Tier 3 outages accounted for just 19% of all per server, per annum failures.
  • Microsoft Windows Server 2003 and Windows Server 2008 showed the biggest improvements of any of the vendors. The Windows Server 2003 and 2008 operating systems running on Intel-based platforms saw a 35% reduction in the amount of unplanned per server, per annum downtime from 3.77 hours in 2008 to 2.42 hours in 2009. The number of annual Windows Server Tier 3 outages also decreased by 31% year over year and the time spent applying patches similarly decline by 35% from last year to 32 minutes in 2009.
  • This year’s survey for the first time, also incorporated reliability results for the Apple Mac and OS X 10.x OS platform.  The survey respondents indicated that Apple products are extremely competitive in an enterprise setting. IT managers spend approximately 15 minutes per server to apply patches and Apple Macs recorded just under 40 minutes of per server, per annum downtime.
Share This Post:
28 Comments:
  • Wallawalla said:

    Hi Laura, great information and an excellent perspective in outages. A couple of points: 1) IBM has unified their Power based product lines from the old “p” series and “i” series lines into the Power Systems. Any new Power 6 based system would be considered a Power system if it were running AIX, Linux or IBM i (did you survey any IBM i customers?) 2) Did you take into consideration the scalability of Power Systems vs Intel. Typically you will have 10s or 100s or even 1000s of Intel servers vs 1 or 10s of Power servers. If each Intel server suffers a 2-3 hour outage, the impact on a business is significantly higher than if fewer Power System iare down for 15 minutes.

    • Hello,

      Thanks for your feedback. Yes, we did survey IBM i customers. And the full Report does go into detail regarding : 1) the workloads of the IBM (and also HP and Sun SPARC) systems being much heavier by 35% to 60% than a typical Wintel box and OS and 2) the full Report does not the greater impact and domino effect of downtime when the less powerful, but more numerous Wintel systems running Linux or Windows are down. I’m happy to discuss with you in more detail. If you have any further questions, feel free to contact me directly via Email at ldidio@itic-corp.com.

      Best Regards,

      Laura

  • This was a very fascinating study. I’d like to see, if it exists, a break-down of the root causes of ‘unplanned downtime’ on the various platforms: hardware (CPU or storage or I/O or network), software (systems software, application software, middleware, etc), environment (power outage or cooling outage) etc…

    My biggest interest may be hard to study/portray. It would be hard to measure: the relative “serviceability” of the software of various platforms. Not all problems cause outages, thus serviceability is a very different subject. But that is another interest I’m personally very interested to assess or see assessed: how well problems can be solved on the various platforms.

    I have some expectations of platform capabilities, but a study would be of interest to me. Even if everyone goes ‘to the cloud’, the serviceability of the platforms that those cloud applications run on is very important. A cloud problem and the ensuing service outage affects many!

    • Hi, Dan:
      Good points, all. Having spent a good part of my career as an investigative reporter, I’m passionately interested in finding out the “whys” and “wherefores” that precipitated problems. This latest ITIC Reliability Study was no different. Every survey I conduct contains an essay question to allow respondents to provide more extensive comments and detailed context of the issues they face. I also conduct two dozen first person interviews so I can do a deeper dive. One thing I quickly discovered is that the IT manager’s definition of unplanned downtime was different from my definition. In today’s fast paced, under-staffed and overworked IT departments, unplanned downtime is literally any time that they have to take a server offline, fix an application or apply a patch when they hadn’t anticipated doing so. The hardware, software, components and network infrastructure devices are all much more reliable than they were 10 or even five years ago. The latest ITIC survey found that the chief root causes of unplanned downtime were NOT the hardware or software failures of the 1980s or 1990s — and not even myriad security issues like viruses, Trojans, bots, zombies, etc. but integration and interoperability problems; performing the pro-active remediation security mechanisms necessary to keep the network and data secure and applying patches and upgrades AND fixing human errors.
      That’s not to say that hardware doesn’t fail and that applications don’t still freeze or crash or that connections aren’t lost. Those things still do happen, but much less frequently than in years past. Also, the newer the platform, the more challenging it will be to service. In their early years Linux and open source platforms lacked the breadth and depth of documentation of UNIX, Windows or even the Mac, so that when a problem did occur, IT managers often spent more time searching for fixes — which in turn, prolonged the service outage. That’s rapidly changing now as the distributions mature and the IT managers gain more experience. The sheer size and complexity of today’s networks and emerging technologies like virtualization and the cloud constitute the biggest challenges to platform serviceability. And of course, this is exacerbated by the ongoing economic crunch which has left just about every IT department with fewer resources and less budget.

  • Phil said:

    Hello, your statement “The results indicate that the IBM AIX operating system whether running on Big Blue’s Power servers (System p5s) or Intel-based x86, is the clear winner, offering rock solid reliability” is false. AIX does NOT support the x86 systems. And when you refer to “Unix”, what do you mean? Unix comes in many different flavors from Solaris, AIX to HP-UX all supporting very different architectures with different levels of RAS. Probably a good idea to use the product names and not pigeon hole Unix. Where can we find which systems were a part of this survey? Clearly, the models, age, OS level will all dramatically impact RAS.

    • Hello, Phil: Thanks for stopping by the site and leaving a message. You are correct: IBM AIX does not run on Intel x 86 platforms; that was a copy editing mistake and I have since corrected it. As an FYI, the full Report does distinguish among the different versions of server operating systems that were compared. However, only the Executive Summary and survey highlights are published on the site. If you have specific questions, I would be happy to answer via private Email.

      Best Regards,

      Laura DiDio

  • Victor said:

    It would be interesting to know which version of each OS was compared… I guess such informationg was not included in the poll and makes it lose a lot of credibility.

    you are not comparing apples to apples… what are you comparing? the latest AIX against Solaris 9? HP-UX 10 against a tuned SuSE? AIX running on Power against Solaris 10 running on x86?… overall, this is more of a “perception” poll, rather than proper, scientificaly obtained real life results…

  • Do you have updated data, since your Dec. ’08 data (reported in Business Week – http://www.businessweek.com/technology/ByteOfTheApple/blog/archives/2008/12/more_good_news.html ) about the growth of Mac in enterprises?
    Gerrit

  • Hello, Gerrit:

    Thanks for your inquiry. Your timing is good: I expect to have new survey data on the growth of Apple and the Mac in enterprises within the next few weeks. Keep watching this site for updates.

    Best Regards,

    Laura

  • Interesting data. I am surprised there was no mention of Sun or Solaris in the discussion. Was the survey issued only to IBM and HP shops?

    • Hi, Chris

      Thanks for your comments. Sun Microsystems SPARC and Solaris are very prominently featured in the survey and they performed very well. Check out the graphics on the home page of my Website and you’ll see. Specifically, Sun Solaris running on the SPARC servers averaged a very reliable 35 minutes of per server, per annum downtime, which is the equivalent of 99.99% or four nines! This is excellent. Similarly, Sun Solaris running on the SPARC servers ranked in the top three (3) of all distributions for the least amount of the most severe Tier 3 outages (lasting more than four hours) — with just .10 outage or per server, per year. And Sun IT administrators spent about 31 minutes to apply patches/fixes to the servers in 2009; this is a 20% reduction from the 37 minutes they spent patching the Sun SPARC Solaris servers in ITIC’s 2008 poll.

      I hope you find this information useful.

  • Very good and interesting study, but unless I overlooked it, I am missing something : Linux.
    In today’s data center, Linux is becoming a big player, and is -as far as I know – still growing.
    But I don’t see any figures an downtime and fixability of Linux-x86 systems. Is there a reason why they are not in the summary ?

    • Hello, Wim:

      Thanks for stopping by and leaving your comments. You are correct: Linux and open source distributions are very important and crucial components in today’s enterprise networks. Worldwide, Linux and open source distributions running on x86 servers represent about 30% of the installed base. I typically do not post all the results on the Website but if you click on the graphics you will see that the Linux and open source distributions performed very well. Specifically, among the Linux and Open Source server operating system distributions, both Novell SUSE Linux Enterprise 10 and 11 versions consistently achieved superior reliability ratings. In fact, Novell SUSE in a customized implementation had the lowest instance — approximately 16 minutes of per server/server OS, per annum downtime – of any distribution with the exception of IBM’s AIX on the Power Series. Red Hat Enterprise Linux recorded a very respectable 1.12 hours of downtime, per server per annum. Ubuntu, which is quickly growing in popularity and deployments had 1.69 hours or approximately 101 minutes of per server, per annum downtime. IT administrators interviewed by ITIC attributed the higher levels of downtime in their Ubuntu networks to the fact that in many cases these were first time deployments and they took more time to familiarize themselves with the servers and the fixes during the deployment, upgrade and patch processes. ITIC fully expects Ubuntu downtime to decrease accordingly as the administrators gain expertise. Overall, though x86-based Linux and open source OSes are extremely competitive and reliable..

  • J Foot said:

    Is it possible to see some of the underlying data for this study? The conclusions are interesting but I it would be difficult to make decisions without seeing where they came from.

    • Hello, J Foot:

      Thanks for visiting the site. Can you please specify what type of underlying data you’d like to see from the ITIC 2009 Global Server Hardware & Server OS Reliability Survey Results? I will be happy to respond if I know what you’re interested in.

  • Michael said:

    I would love to see how these numbers change with the experience of the sysadmin. How much downtime for a neophyte on all 3 platforms? How much downtime for a 12-year guru on all platforms? I’m pretty sure that the Unix folks get better results over time – but what about Windows & Mac? What do the curves look like?

    The relevance is that you’ll get a fascinating picture of the support implications for respective systems.

  • Hi, Michael:

    Thanks for your comments. Your question is a good one and one that I will certainly raise in my next survey. Certainly, the experience or lack thereof of the IT staff in a specific organization can have an impact on reliability and uptime. That said, only the smallest micro SMB organizations with fewer than 25 end users are likely to be wholly reliant on one or two IT managers. ITIC conducted two dozen first person interviews with IT managers representing just about every server hardware and server OS environment and most of them exhibited a high degree of intelligence and involvement regarding the intricacies of their particular implementations. We found that the reliability of a particular server or server OS is most often undermined or adversely impacted by third party software or driver integration and interoperability issues. Downtime is extended while IT administrators search for the fix or wait for a third party ISV to come up with a patch. Based on the survey responses and interviews, these types of issues which adversely impact reliability appear to be more problematic — with rare exceptions — than inherent flaws in the server hardware and server OS software.

    Stay tuned for future ITIC surveys and updates and I will try and answer your question about system administrator experience and downtime.

  • Timothy said:

    Do you have any IBM System z (mainframe) results, by operating system (z/OS, Linux, etc.)?

    Also, one of the difficulties I find in understanding downtime is the difference between the IT point of view and the end user point of view. In my opinion the end user point of view is the only one that matters — what could also be called “business service delivery.” (It’s the CEO, not the CIO, basically.) The end user would count planned outages (as well as unplanned) and “System is slow/I can’t get my job done” issues as downtime. The IT-centric view often (wrongly, I think) excludes both.

    And then there’s the “if a tree falls in the woods…” problem. That is, if there’s an outage, how quickly would anyone notice? That doesn’t matter as much in *relative* rankings if each type of server is deployed, in equal distributions, to similar critical roles. But of course we know that’s not true: some types of servers are deployed more frequently to roles where outages are more visible and more immediately detected. If server types are deployed correctly, with the most reliable and available servers deployed to the more critical (and visible) roles, then this phenomenon would tend to compress the reported outage times. (Outages among less critical servers would be less quickly reported, on average, so the outage durations would be more under reported. “When we came into the office in the morning, the server was down” sort of thing.) There is also the likely factor that more critical server types are deployed with more reliable surrounding infrastructure: better and more redundant networks, as a major example.

    I’m wondering if you’ve given some thought to these issues and whether you have any ideas for how to control for them.

    • Hi, Timothy:

      Thanks for visiting the site and posing your question. You are one of several people who are interested in statistics on the IBM System z reliability. And the answer is: I’m working on it and will conduct a survey on this within the next six (6) weeks. So please be patient.

      Your comments on defining and differentiating downtime based on the IT point of view and the end user point of view is very insightful. I’ve struggled with and confronted this definition myself. When I first began doing reliability surveys about six years ago I quickly realized that the definition of downtime varied according to who answered the question. For example, many IT administrators define unplanned downtime as any event that causes them to take a server offline regardless of the underlying cause. End users of course, are sometimes oblivious or uncaring of the reason for downtime; they are only concerned with the time, the frequency and the severity of an outage in terms of the network, applications and services being unavailable or inaccessible to them.

      Your observations about less critical server outages being under-reported also have a lot of merit and truth. The fact is that IT departments have suffered greatly because of the ongoing economic crunch — staff, resources and training have been cut, in some cases to the bone. So naturally reporting outages and tracking key metrics like reliability, TCO, ROI, the ability to meet SLAs etc. are all suffering.

      So yes, I definitely consider the issues that you’ve raised and one way to address them is that there has to be much better communication, collaboration and cooperation among C-level executives, IT departments and the physical plant facilities managers AND they must also solicit input from their end users. IT departments must also be much more diligent about keeping track of their own performance and reliability metrics as well as performing post-mortems on what remedial actions they took following a service outage.

      Thanks again for your insightful comments. And keep checking back for future survey results.

  • Timothy said:

    Sounds good, Laura. I’m looking forward to reading more.

    It also occurs to me that “Linux” (or even “Red Hat Linux”) is not descriptive enough these days. There’s “Red Hat Enterprise Linux on X86 servers” and “Red Hat Linux on System z servers,” to pick a couple examples. (Solaris X86 and Solaris SPARC is another likely differentiated pair that comes to mind.) Granted, you can cut the data as fine as you want, with patch levels and specific hardware models and configurations. At some point it’s a judgment call. But it would be interesting in a few OS cases to see what impact hardware has (directly or indirectly). In particular, I think (if there are enough survey responses) it would be interesting to see a couple different flavors of Linux iron (X86 and z would be good) and Solaris broken out.

    You raise another really good point about the economic situation possibly impairing availability reporting. I think that’s quite likely. But I also see some anecdotal evidence that service qualities (including availability) become even more important in an economic downturn. I think the reason is Darwinian: “survival of the fittest,” basically. When producers/suppliers are more desperate for business, the few remaining customers can be more discriminating, more fickle. And there’s also a general “flight to quality,” particularly for capital goods and high relationship services, because customers don’t want to get stuck doing business with a company that’s going out of business. All of that probably conspires to drive IT service quality requirements up, especially in areas that might attract unwelcome public attention and customer flight such as highly visible availability failures or security breaches.

    Thanks for working on this.

  • I looked at the graphics on your site’s home page, but did not see any reference to the one platform that has traditionally been at the top of this list: HP’s OpenVMS and VMSclusters.

    Was it considered and just didn’t show up on the charts? Or was it not even considered?

    • Hello, Aaron:

      I appreciate you taking the time to stop by and leave a comment on the ITIC reliability survey results. In answer to your question, yes, the HP OpenVMS platform was considered. However, in order to include a platform in the final published results, we require 100+ responses, so that it is statistically valid. We did not have that many responses to the HP OpenVMS platform. However, the responses that we did get indicated that the platform was extremely reliable and scored very favorably when compared against the other top distributions. Overall, HP OpenVMS servers had less than 30 minutes of per server, per annum downtime. If you want more information, please send me a follow up Email with specific questions and I’ll be happy to address them.

      Thanks again; your comments are appreciated.

  • Laura,

    when you start looking at IBM mainframe you should also consider looking at HP NonStop servers (formerly known as Tandem systems). These are the most reliable systems in the world as far as I know, we are talking “five nines” or 99.99999 % uptime.

    • Hi, Thomas:

      Thanks for taking the time to stop by and leave a comment. The forthcoming ITIC survey on mainframe usage and reliability will definitely include the HP NonStop servers as well as Stratus servers. I concur with your assessment of the very high reliability of the HP platform. Everything I’ve heard from corporate customers over the years echoes what you’ve said, 99.999% is very common among HP and Stratus servers. Stay tuned for future research.

  • One question please.
    For example with Unix Sun Solaris on SPARC server category. The annual average number of incidents for Tier 1/2/3 are 0.59/0.49/0.1. So the % of incidents for Tier 2 and 3 should be (0.49+0.1)/ (0.49+0.59+0.1) = 50%. Why the chart actually says the Tier2/3 combined accounts for only 25% of all incidents?

  • mike ferrell said:

    Hi Laura – this looks like a well-done and honest survey – anything new on the “mainframe” survey?

  • Hi, Mike:

    Thanks for asking and good timing. ITIC analysts are right now putting together our year-end surveys which will also track 2010 trends. The mainframe survey is among them. We expect to “go live” with the mainframe survey in the next six weeks and publicize results in January 2010. Please feel free to let us know if there are any particular questions you’d like to see included. We’ll do our best to see that ITIC addresses the most pressing and pertinent issues impacting mainframe enterprises.

  • Let me just support the comments posted referencing IBM mainframes and HP NonStop (formerly Tandem) where indeed the degree of availability exceeds what has been covered here – and no, that are not in decline nor losing popularity among those that mandate reliable servers – just ask those depending on Amazon.com for their cloud computing issues.

    As the former Chairman of the International Tandem User Group and as a form Marketing Director of IBM SHARE user group, I have lived with both vendors products and both will find homes at the very heart of clouds – simply because they can run 24 X 7 X forever with levels of “9s” far above anything else?

    What separates the two is that the NonStop works right out of the box whereas you do need sufficiently skilled Systems Programmers to “assemble” a Parallel Sysplex config to do something similar.

Post a Comment: