Human Error, IT Staff Shortages and Aging Hardware Undercut Reliability

To reiterate, ITIC’s fifth annual reliability survey results indicate that the inherent reliability and uptime of nearly all of the 14 major server hardware and 18 server operating system distributions continues to improve. But at the same time, user error is becoming more of a factor undercutting overall reliability.

This is based on technical advances in the underlying processor technology from companies like Intel Corp. and Advanced Micro Devices, memory and disk technology, as well as improvements to the core server hardware and server OSs that improve performance, scalability, security and the ability to support heavier workloads.

As organizations strive to accomplish more with fewer resources, IT departments must rely even more heavily on their vendors to deliver more reliable servers and server OS platforms and top notch technical support in the form of regular patches and documentation.

For businesses leveraging IT, time is literally money. Even a few minutes of downtime can result in significant costs and cause internal business operations to grind to a halt. Downtime can also impact adversely a company’s relationship with its customers, business suppliers and partners. Reliability or lack thereof can potentially damage a company’s reputation, result in lost business and raise the risk of litigation.

Vendors’ ability to deliver top notch technical service and support – including a quick response with updates, fixes and patches to known flaws and security vulnerabilities –also figure prominently in reliability. Technical service and support – good and bad – also distinguishes and differentiates vendors from their competitors. How promptly, efficiently and effectively vendors respond to corporate customers when issues arise has a definite impact on customer retention and the company’s willingness to upgrade and purchase new equipment and software and to expand their usage of specific products and renew service contracts. Dell, Fujitsu, HP, IBM, Microsoft and Stratus all scored high marks for technical service, support and responsiveness. Dell, for example, has benefitted greatly from its decision to move its online technical support back onshore in the past several years. Dell support received high praise in anecdotal essay comments and in the first person customer interviews for fast, friendly and efficient service.

The survey results also showed that the strong technology gains in hardware, software, security and virtualization technologies were undercut by other issues that adversely impacted reliability, including:

  • Human error
  • Understaffed and overworked IT departments
  • Prolonged refresh rates for aging server hardware that are inadequate for data intensive workloads

The ITIC 2013 reliability survey marks the first time that respondents had the option of choosing “user error” as negatively impacting reliability and it shot to number two on the list, with 28% of respondents acknowledging for the impact of IT staff mistakes on downtime. In fact, user error was second only to “bugs and flaws in the server operating system” as a cause of downtime. Nearly one-third or 31% of attributed bugs/flaws in the operating system as negatively impacting downtime, while 24% of participants attributed instability/problems with server hardware as a cause of downtime. And 22% of respondents indicated that security issues and the fact that their IT departments were understaffed and overworked also negatively impacted network reliability.

There is clearly a direct correlation between the 28% of survey respondents who blamed human error for reliability issues and the 22% of participants that specified understaffed and overworked IT departments and administrators as undermining reliability. Additionally, 47% of respondents indicated that when a significant portion of their main line of business (LOB) server hardware is more than three and a half years old there has been at least some adverse impact on overall network reliability.

The percentage of businesses with server hardware five years old doubled from 6% in 2011 to its present rate of 12% in 2013. At the same time, the percentage of organizations whose servers are new or one-year old or less also rose to 12% up from 7% a year ago. Organizations cited a number of issues related to system patches/patching processes:

  • Some 26% of businesses spend more than one hour applying patches manually; down from 40% in the 2011-2012 survey. And only 11% of respondents spent >4 hours applying patches vs. 29% in the 2011-2012 survey.
  • Over one-third of companies – 34% — now use group policy to automate/apply patches
  • Notably, only 4% of IBM server users reported experiencing1 to >4 hours of per server/per annum downtime compared to 6% of Oracle users and 7% of HP server users

Additionally, the survey data coupled with anecdotal responses and first person customer interviews indicated that the reliability of virtually all server OS distributions and underlying hardware has improved. The overwhelming majority of unplanned Tier 1 and Tier 2 outages and downtime are directly attributable to integration and interoperability issues such as incompatible drivers, trouble applying patches (particularly in highly customized environments), misconfigurations and human error and the lack of a specific component or software fix for a particular platform.

Scroll to Top