Amazon Elastic Block Store was one of the first AWS offerings, and it remains a critical component for users that...
require direct access to storage volumes for databases and other transactional applications. But Amazon EBS performance might be more prone to disruptions than more abstracted AWS offerings.
AWS resources seem to instantaneously appear at the click of a button, but ultimately, hardware provides these virtual resources. While AWS uses multiple levels of redundancy and software automation to protect systems and data, some services tie more closely to individual pieces of equipment than others.
EBS is one of those services, which means its users notice equipment failures more often. That creates a perception that EBS has more glitches than other native cloud services. Let's examine these causes and provide some troubleshooting advice.
Storage network in the cloud
AWS users complain about a number of Amazon EBS performance issues, and the reasons derive from the nature and delivery of the EBS resource itself, which is effectively a giant, shared storage area network (SAN). EBS storage volumes are conceptually identical to those on an enterprise array that are available to servers over a SAN. When a server attaches to a volume, it acts as local block storage that IT professionals can use just like a local disk for databases or other applications.
AWS automatically replicates EBS data to provide high availability, and users can make volume snapshot copies without unmounting the disk. But each volume can only attach to a single Elastic Compute Cloud (EC2) instance at a time.
Like enterprise SAN volumes, EBS data is persistent, and any instance can use the data. But the shared, yet exclusive, nature of EBS access leads to problems, particularly when ops teams try to build stateful persistence into a container cluster.
Stuck EBS volumes
Stuck or hung EBS volumes are one of the most common reported problems with the service. The issue occurs when an instance either mounts or unmounts from a volume. The attachment process never completes between an EC2 instance and the EBS volume, as the volume gets stuck in the Attaching state.
The issue is so widespread that it warrants a troubleshooting section in the EBS documentation. As an AWS Knowledge Center page points out, the cause can be something simple, such as using the same volume name on two different instances. This results in race condition in EBS, which might cause the second instance to try and attach the volume before the first instance releases it. Also, as AWS documentation highlights, the EBS host could experience a hardware failure. An AWS employee addressed this in a discussion forum: "Although EBS volumes are designed for reliability, including being backed by multiple physical drives, we are still exposed to durability risks when multiple concurrent component failures occur before we are able to restore redundancy."
It's not ideal, but IT pros can use a quick workaround to improve Amazon EBS performance: use another name when attaching from the second instance. If that doesn't work, manually initiate a forced detach of the volume from the AWS Management Console, delete any leftover device names from the original attach and reattach the volume.
If all else fails, reboot the instances. But before you do anything that risks data loss, make sure there's a current snapshot, which stores a volume image in Simple Storage Service (S3). You can also manually back up the problematic EBS volume, as forced detachment of a stuck volume can damage the file system or the data it contains.
Slow or degrading performance
Inconsistent or deteriorating I/O performance also draws regular complaints. These Amazon EBS performance problems often result from the inherent limitations of standard, disk-based volumes, which are rate-limited to 500 IOPS. EBS queues request about that threshold.
EBS specifications state that Throughput Optimized HDD volumes deliver "a baseline throughput of 40 MBps per TB and a maximum throughput of 500 MBps per volume … [and are] designed to deliver the expected throughput performance 99% of the time … [with] enough I/O credits to support a full-volume scan at the burst rate." Solid-state drive (SSD) volumes, on the other hand, cap throughput at 10,000 to 20,000 IOPS, depending on the type.
The shared network connecting EBS and EC2 systems can result in another performance bottleneck, in which storage traffic contends with both storage and network data traffic from other instances. This AWS design limitation could cost you money. Teams should deploy transactional databases and other applications that require predictable latency, throughput and IOPS on EBS-optimized EC2 instances. These instance types have a dedicated link with guaranteed throughput ranging from 425 MBps to 14,000 MBps between EC2 and EBS, depending on the instance.
Follow these other Amazon EBS performance-tuning recommendations:
- Monitor and maintain adequate volume queue length for an application. AWS recommends a queue length of four or more when performing 1 mebibyte sequential I/O.
- Increase the OS cache read-ahead for high-throughput and read-heavy workloads. Check and change the parameter via the blockdev command on Linux.
- Use a modern Linux kernel or Windows Server OS with the latest patches.
- Use RAID 0 to stripe data across two volumes.
You might also encounter performance woes when you first use a volume restored from a snapshot. Although new EBS volumes should deliver peak performance from the outset, restored blocks from S3 snapshots must initialize before use. This process can significantly increase I/O latency, which reduces overall performance by 50%. This won't be a problem for applications doing infrequent random reads, but an ops team can read every block before first use to initialize the EBS volume and eliminate the performance hit. Depending on the type of EC2 and EBS instances and size of the volume, the process can take from a few minutes to several hours.
IT teams that run Kubernetes clusters and use EBS to hold persistent state across containers must be particularly careful with storage architectures. A team can attach a single volume per EC2 instance and then divide it into virtual volumes used by the various containers. Whether you deploy container infrastructure or a high-I/O database, use EC2 with local SSDs for optimal storage volume performance and configuration control. The I3 and F1 instances use nonvolatile memory express drives, which can be particularly helpful with these types of workloads.