AWS re:Invent 2015: A guide to Amazon's sold-out event
A comprehensive collection of articles, videos and more, hand-picked by our editors
AWS has finally added an important feature previously missing from the Amazon Redshift data warehouse.
Amazon Redshift users have waited for user-defined functions (UDF) almost as long as the product has been around; posts on the AWS developer forums requesting user-defined functions date back to 2013, roughly five months after the product was introduced at re:Invent 2012. This month, AWS added UDFs using the Python language.
UDFs are essentially custom functions designed by the user when available stored functions don't cover their uses. A function given as an example in Amazon Web Services' (AWS) documentation is one that calculates whether a salesperson gets a standard commission amount or 20% of the customer's price paid -- whichever is greater. This type of function, which examines multiple variables and returns one value, is known as a scalar function.
Amazon acknowledged the long wait in its blog post.
"This announcement is huge for analysts and data scientists," said Daniel Heacock, a consultant with c3/consulting, an IT consulting and managed services firm in Nashville, Tenn.
Analysts frequently work with data extracts to perform statistical analysis and data manipulation that their job requires, Heacock said. This usually involves an 'extract, transform, load' (ETL) process that gets the data into a format the analyst can work with. If the product of the analysis then becomes the content of some recurring business report or dashboard, that can require a large development effort to retranslate the data into an interpretable format.
"This feature enables the analyst or data scientist to enter the development team and bypass the ETL step," Heacock said. "When it's time to productize, there's no need to develop an ETL process or convert the analysis to SQL."
The Python library, Pandas, also allows functions to be created in SAS, R or SPSS, and AWS essentially simplifies how they are accessed, Heacock added.
Amazon Redshift has been around since late 2012, and competes against data warehouses that have been available for decades, such as Oracle's Data Warehousing Platform or Microsoft Azure's SQL Server 2016, which offer UDFs through SQL.
Nik Roudasenior analyst with Enterprise Strategy Group
"Having the ability to customize is an advantage, and bringing it to Amazon now at least brings them to parity with other data warehouses," said Nik Rouda, senior analyst with Enterprise Strategy Group Inc., an IT consulting firm in Milford, Mass. "In some ways, it's kind of handy in that they don't need to define everything in advance -- if they don't have it, you can add it yourself."
Python is among the most popular languages for data analytics because it is intuitive and easier to use than languages such as R, but those types of specialized languages can also be more powerful, Rouda said. Python is also used by developers to create other types of applications, so it lends itself to the existing IT skill set in many organizations.
One thing traditional data warehouse companies have begun to do that Amazon might consider is connecting to on-premises data repositories, as well as the cloud. Teradata Corp., an on-premises data warehouse maker based in Dayton, Ohio, now has Teradata Cloud, and Oracle also offers data warehousing on-premises and in the cloud.
"Hybrid environments might be something Redshift might want to look into," Rouda said. "Of course, being Amazon, they'd love for everything to go to the cloud. But the reality for many companies is that due to governance and other restrictions, that's just not going to happen."
Today, customers can use third-party tools to connect Redshift with on-premises data warehouses.
Test your Amazon Redshift expertise
Detailing strategies for Redshift performance monitoring
When to use Azure SQL Data Warehouse