BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
AWS' tools and services cater to most enterprise IT needs. But no matter your enterprise's size, a move to AWS or a hybrid cloud infrastructure comes with new responsibilities and challenges -- most notably, managing disparate data sources.
Even though AWS and its third-party vendors provide proprietary and open source tools to manage and automate the infrastructure, ops teams need to use advanced AWS data management techniques to handle different data formats. Traditional database administrator skills are becoming outdated, but acquiring new capabilities can be tricky and costly.
Data is diverse, and it has to be disparate. For example, a retail business will have applications that serve different business areas, such as merchandising apps and warehouse management apps. Each department's system depends on data from another system, but they also function separately. In this case, data comes from heterogeneous systems in different formats, making AWS data management and integration a challenge. Also, business analytics systems cannot use raw operational data to support management summary dashboards and reports, so ops teams need to find a way to make this data readable.
AWS complements its services with management tools, but it still does not have a single window for AWS data management and integration. Compliance requirements that surround data storage further complicate the equation. And the cloud's flexibility not only supports data growth, it also supports a growing number of sources. And this leads to increased AWS data management complexities from disparate data and compliance mandates. Enterprises can struggle to control and fulfill their side in the Amazon shared responsibility model when it comes to storage. The following areas can cause trouble for ops teams.
Data comes in from many sources: users feed data, applications process user requests and devices generate usage logs. These disparate sources use different formats, ranging from flat files to JSON, REST and XML files, as well as other databases. Applications and their respective databases serve different business purposes and have different data models; this causes data reconciliation problems. It then becomes difficult for a business user to gain insights on company performance.
Ops teams must manage multiple forms of data, resulting in complexities with storage, data movement, staging, archiving, security and loading -- all of which are likely separate services. They might need to use separate extract, transform and load (ETL) processes for different data sources, which increases the burden of monitoring and managing multiple ETL sites.
Data moves in between any number of database and storage servers. IT teams need to understand what data needs to move in and out of the cloud or else the company will incur significant networking costs.
For example, you can use Amazon Simple Storage Service (S3) to stage data from multiple sources, such as data centers, the internet of things and REST APIs. Then teams can load this data -- in its different sizes and formats -- into an Amazon Redshift data warehouse. Application interfaces also write data directly to Redshift. With so many files and data sources, it becomes difficult to manage the files and ETL code. Redshift doesn't have procedural language support, so ETL code must exist in a flat file format, which is vulnerable to errors and loss.
Test your knowledge of data management products from AWS
Spin up an instance, process data in real time and manage the security of the cloud with these Amazon services. But which service performs which function? Take this quiz to find out.
The problems don't end there; an end-to-end data movement from S3 to Redshift requires multiple steps. Ops teams might, for example, stage files in S3 and create scripts to run ETL code that moves the data into Redshift. This move would need to use a scheduler instance that runs the code in Python, Ruby or Shell programming languages. Relying on different interfaces can make the data transfer difficult and costly.
Cloud vs. on-premises networking
Networking differs greatly between AWS and on-premises data centers. AWS uses virtual Ethernets, while on-premises deployments use physical Ethernets. Consequently, AWS does not support IP multicast and broadcast -- it only supports IP unicast at L2. An enterprise often relies on these networking capabilities to ensure applications and clustering stacks are highly available. To migrate live VMs, ops teams must send specific Address Resolution Protocol requests to the broadcast address, which is not possible with IP unicast. IP broadcasting is a necessary component of the Dynamic Host Configuration Protocol network setup that's required to assign IP addresses to computers using LAN connections. The end result is an inability to broadcast to all connected devices that a LAN setup can achieve.
Third-party tools, such as Panoply and Stitch, provide a single-window option to integrate disparate data sources, enable a flexible data structure and map data types to ease transformation and movement to target data warehouses. These tools allow ops to manage ETL without having to write ETL code; they provide statistics on costs, compliance, security and use. These tools give IT the ability to gain valuable insights from data that would otherwise be too overwhelming and confusing to understand.
How AWS Glue simplifies data transfers
Replicate data for higher availability
Tools that protect sensitive data in AWS