The most common problems organizations run into when scaling AWS apps depends on the type of application and, even...
more so, on the organization, said Sebastian Stadil, founder and CEO at Scalr Inc.
For example, small startups will make short-term time-to-market decisions and postpone scaling considerations, which is generally a good approach for their stage and size. Large organizations often have requirements to tie into existing systems that may not scale as well. Integrating with legacy systems is very difficult, especially when they weren't built to integrate with the new architectures that the cloud mandates.
Stadil noted that an additional factor to consider when deploying on Amazon Web Services rather than a traditional enterprise server is that the cloud mandates new architectures because instances are ephemeral and smaller than traditional enterprise servers. They are ephemeral because they fail more often, and you're expected to account for that by handling failure. AWS instances are also typically smaller than what might be procured from Dell or IBM. Engineers are expected to account for these challenges by scaling horizontally using distributed systems.
This leads to new challenges, which enterprises are accustomed to with traditional hardware architectures. Stadil said, "Handling failures is hard, especially when you don't have a great understanding of what the failure modes are because you don't know what exactly the underlying infrastructure is."
The problem is compounded because using distributed systems introduces new complexity. Stadil said that it is absurdly difficult to get the system architecture right even though it may seem easy when starting. There is a lot of theory to understand, and even then it's easy to make a big mistake.
Architecting for distributed systems
Owing to the ephemeral nature of AWS instances, Dayal Gaitonde, director of engineering at Appirio Inc., said it is a good practice to maintain application state on the local file system or in-memory on an application server. In distributed Web applications, shared state should be moved off of the local file system or application server and into a shared location. AWS provides ElastiCache (managed memcached) and Redis, an open source distributed cache that can be deployed on top of AWS for distributed caching.
Gaitonde also recommended that organizations think about separating static assets from dynamic applications. Typical Web applications include static assets (images, js, css files) as part of the application. This causes unnecessary load on the application servers and is slower than the alternatives. AWS has CloudFront and S3 as alternatives. Using CloudFront reduces the load on application servers and improves performance for end users.
Cloud-based labs reduce the barriers
Many organizations are now turning to cloud-based labs to better spin up instances of their software development and testing environment, said Theresa Lanowitz, a senior analyst with voke Inc. Tools from companies like Scalr, Skytap and CA Technologies help address the challenges of quickly configuring and deploying all of the components required for modern AWS applications. She said that while you can do the process, it tends to take more time and there are more problems in creating subtle differences between the testing and deployment instances.
Stadil noted that being able to prototype, develop and test on identical systems reduces fatal errors when going to production. Ultimately, nothing beats real-world testing. This is the only way to make sure the applications can scale and failover without a hitch. It's important to keep in mind that one of the most common hurdles in scaling systems is misconfiguration of all of the systems and services that go into an AWS application.
A good practice is to start small using tooling to find bottlenecks as soon as possible and then to iteratively develop. Another good practice is to load test throughout the development lifecycle using tools like SOASTA to find bottlenecks as soon as possible. Stadil said, "If you don't have that luxury, then the next best thing is investing in a good systems architect."
Architecting for failure
Another AWS scaling challenge comes from failing to architect for failure by ensuring that the automation of the end-to-end provisioning of the infrastructure/instances and the application itself is done and tested properly, said Mark Williams, CTO at Redapt Inc.
AWS or any public cloud consumers that fail to do this run the risk of significant impact to their application's ability to scale or recover from AWS infrastructure anomalies or maintenance events. Simply lifting and shifting workloads into AWS without investing in automation is a big risk.
Scaling cost effectively
Williams added that similarly, failing to put in place the proper cost visibility and organizational process and discipline can lead to a failure to scale properly with respect to the fiduciary priorities of the company or organization.
Depending on the public cloud provider, the performance constraints inherent in the specifications and capabilities of the available compute, storage and networking products can often lead to a significant gap between the ideal compute infrastructure and one that best fits the workload versus those made available.
Workload demands often change, and having the fungibility within public clouds to relaunch workloads on different offerings has its advantages. That said, in AWS, persistent storage I/O performance has been a recurring bottleneck, said Williams. Historically, the ephemeral storage offerings had weaker performance compared to common server architectures deployed in private clouds and data centers. This often leads to throwing more AWS resources at the same workload compared to common server/storage capabilities in private clouds and data centers.
The effects of inter-node communications
Another inherent performance limitation in AWS is the fact that because of its success, the more populated regions now have extremely large availability zones (AZs), said Williams. For applications that run on a large number of nodes, the comparatively larger inter-node/inter-instance latency within AZs is an additional tax on CPU thread performance for many applications.
Customers whose applications have run in single-tenant dedicated data center environments are accustomed to this rich low-latency environment, and it's often a surprise in production to see comparable node specifications perform so differently when enveloped by much large inter-node latency. "With rare exceptions, AWS tenants cannot control the proximity of their multi-node application deployments, and realizing the impact of this additional performance overhead is often discovered late," Williams explained.
Testing the cost of growth
To address these challenges, Redapt has been using the Scalr cloud-testing environment to more closely line its development, testing and deployment infrastructure on AWS.
Investing in highly automated and parameterized deployments makes reacting to performance anomalies and significant maintenance events more reliable and predictable. Fully adopting the "replace versus troubleshoot" approach in responding to anomalous incidents in operations is a key maturity milestone in the transformation of operations and application architecture practices in the cloud.
Williams said, "With Scalr, the repeatability is inherent in the investment around automation of each piece of an application deployment. The ability to treat infrastructure as code and share this transparently between application developers and infrastructure operators is key to identifying performance anomalies quickly and eliminating the blame game."
Good metrics for scalability
A measure of the benefit for calculating the impact of different AWS architecture and deployment choices is application uptime, which quantifies application availability, which could be adversely affected by hardware, software or integration problems. Williams said the best practice here is to count everything that impacts the user experience -- whether it's within the organization's control or not.
Tracking application uptime continuously, and being transparent about how it affects availability, helps drive the right priorities in investments to improve the metrics and outcome. Inevitably, greater investments in automation and self-service shows benefits when outages occur. Also, having a platform upon which to replace failed components of the infrastructure rapidly, or to automatically scale additional resources when loads increase, helps shorten the time to first action and time to recovery and stability, said Williams.
This translates to better predictability in setting expectations to the business about recovery times when significant problems occur. In these cases, the cost savings may not be the best metric to see how this is a benefit. Again, application availability, and therefore minimized revenue impact (assuming the cloud applications drive measurable revenue), is the benefit to investing in this automation.