ITIC Poll: Human Error and Security are Top Issues Negatively Impacting Reliability
Multiple issues contribute to the high reliability ratings among the various server hardware distributions. ITIC’s 2018 Global Server Hardware, Server OS Reliability Mid-Year Update reveals that three issues in particular stand out as positively or negatively impacting reliability. They are: Human Error, Security and increased workloads.
ITIC’s 2018 Global Server Hardware, Server OS Reliability Mid Year Update polled over 800 customers worldwide from April through mid-July 2018. In order to obtain the most objective and unbiased results, ITIC accepted no vendor sponsorship for the Web-based survey.
Human Error and Security Are Biggest Reliability Threats
ITIC’s latest 2018 Reliability Mid Year update poll also chronicled the strain that external issues placed on organizations and their IT departments to ensure that the servers and operating systems deliver a high degree of reliability and availability. As Exhibit 1 illustrates, human error and security (both from internal and external hacks) continue to rank as the chief culprits that cause unplanned downtime among servers, operating systems and applications for the fourth straight year. After that, there is a drop off of 22 to 30 percentage points for the remaining issues ranked in the top five downtime causes. Both human error and reliability have had the dubious distinction of maintaining the top two factors precipitating unplanned downtime in the past five ITIC reliability polls.
Analysis
Reliability is a two-way street in which server hardware, OS and application vendors as well as corporate users both bear responsibility for the reliability of their systems and networks.
On the vendor side, there are obvious reasons why hardware makers like HPE, IBM and Lenovo mission critical servers consistently gain top reliability ratings. As ITIC noted in Part 1 of its reliability survey findings, the reliability gap between high end systems and inexpensive, commodity servers with basic features continue to grow. They include:
- Research and Development (R&D) Vendors like Cisco, HPE, Huawei, IBM and Lenovo have made an ongoing commitment to research and development (R&D) and continually refresh/update their solutions.
- RAS 2.0.The higher end servers incorporate the latest Reliability, Accessibility and Serviceability (RAS) 2.0 features/functions and are fine-tuned for manageability and security.
- Price is not the top consideration. Businesses that purchase higher end mission critical and x86 systems like Fujitsu’s Primergy, HPE’s Integrity, Huawei’s KunLun, IBM Z and Power Systems and Lenovo System x want a best in class product offering, first and foremost. These corporations in verticals like banking/finance, government, healthcare, manufacturing, retail and utilities are more motivated with the historical ability of the vendor to act as a true responsive “partner” delivering a highly robust, leading edge hardware. They also want top-notch after market technical service and support, quick response to problems and fast, efficient access to patches and fixes.
- More experienced IT Managers. In general, IT Managers, application developers, systems engineers and security professionals at corporations which purchase higher end servers from IBM, HPE, Lenovo, and Huawei tend to have more experience. The survey found that organizations that buy mission critical servers have IT and technical staff that boast approximately 12 to 13 years experience. By contrast, the average experience among IT managers and systems engineers at companies that purchase less expensive commodity based servers is about six years.
Highly experienced IT managers are more likely to spot problems before they become a major issue and lead to downtime and in the event of an outage. They are also more likely to perform faster remediation, accelerating the time it takes to identify the problem and get the servers and applications up and running faster than less experienced peers.
In an era of increasingly connected servers, systems, applications, networks and people, there are myriad factors that can potentially undercut reliability; they are:
- Human Error and Security. To reiterate, these two factors constitute the top threats to reliability. ITIC does not anticipate this changing in the foreseeable future. Some 59% of respondents cited Human Error as their number one issue, followed by 51% that said Security problems caused downtime. And nearly two-thirds — 62% — of businesses indicated that their Security and IT administrators grapple with a near constant deluge of more pervasive and pernicious security threats. If the availability, reliability and access to servers, operating systems and mission critical main LOB applications is compromised or denied, end user productivity and business operations suffer immediate consequences.
- Heavier, more data intensive workloads. The latest ITIC survey data finds that workloads have increased by 14% to 39% over the past 18 months.
- A 60% majority of respondents say increased workloads negatively impact reliability; up 15% percentage points since 2017. Of that 60%, approximately 80% of firms experiencing reliability declines have commodity servers: e.g., White box; older Dell, HPE ProLiant and Oracle hardware >3 ½ years old that haven’t been retrofitted/upgraded.
- Provisioning complex new applications that must integrate and interoperate with legacy systems and applications. Some 40% of survey respondents rate application deployment and provisioning as among their biggest challenges and one that can negatively impact reliability.
- IT Departments Spending More Time Applying Patches. Some 54% of those polled indicated they are spending upwards of one hour to over four hours applying patches – especially security patches. Users said the security patches are large, time consuming and often complex, necessitating that they test and apply them manually. The percentage of firms automatically applying patches commensurately decreased from 30% in 2016 to just 9% in the latest 2018 poll. Overall, the latest ITIC survey shows that as of July 2018 companies are applying 27% more patches now than any time since 2015.
- Deploying new technologies like Artificial Intelligence (AI), Big Data Analytics which require special expertise by IT managers and application developers as well as a high degree of compatibility and interoperability.
- A rise in Internet of Things (IoT) and edge computing deployments which in turn, increase the number of connections that organizations and their IT departments must oversee and manage.
- Seven-in-10 or 71%of survey respondents said aged hardware (3 ½+ years old) had a negative impact on server uptime and reliability compared with just 16% that said the older servers had not experienced any declines in reliability or availability. This is an increase of five percentage points from the 66% of those polled who responded positively to that survey question in the ITIC 2017 Reliability Survey and it’s a 27% increase from the 44% who said outmoded hardware negatively impacted uptime in the ITIC 2014 Reliability poll.
Corporations Minimum Reliability Requirements Rise
At the same time, corporations now require higher levels of reliability than they did even two o three years ago. The reliability and continuous operation of the core infrastructure and its component parts: server hardware, server operating system software, applications and other devices (e.g. firewalls, unified communications devices and uninterruptible power supply) are more crucial than ever to the organization’s bottom line.
It is clear that corporations – from the smallest companies with fewer than 25 people, to the largest multinational concerns with over one hundred thousand employees, are more risk averse and concerned about the potential risk for lawsuits and the damage to their reputation in the wake of an outage. ITIC’s survey data now indicates that an 84% majority of organizations now require a minimum of “four nines” – 99.99% reliability and uptime.
This is the equivalent of 52 minutes of unplanned outages related to downtime for mission critical systems and applications or just 4.33 minutes of unplanned monthly outage for servers, applications and networks.
Conclusions
The vendors are one-half of the equation. Corporate users also bear responsibility for the reliability of their servers and applications based on configuration, utilization, provisioning, management and security.
To minimize downtime and increase system and network availability it is imperative that corporations work with vendor partners to ensure that reliability and uptime are inherent features of all their servers, network connectivity devices, applications and mobile devices. This requires careful tactical and strategic planning to construct a solid strategy.
Human error and security are and will continue to pose the greatest threats to the underlying reliability and stability of server hardware, operating systems and applications. A key element of every firm’s reliability strategy and initiative is to obtain the necessary training and certification for IT managers, engineers and security professionals. Companies should also have their security professionals take security awareness training. Engaging the services of third party vendors to conduct security vulnerability testing to identify and eliminate potential vulnerabilities is also highly recommended. Corporations must also deploy the appropriate Auditing, BI and network monitoring tools. Every 21st Century network environment needs continuous, comprehensive end-to-end monitoring for their complex, distributed applications in physical, virtual and cloud environments.
Ask yourself: “How much reliability does the infrastructure require and how much risk can the company safely tolerate?”
ITIC Poll: Human Error and Security are Top Issues Negatively Impacting Reliability Read More »