BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
The wait for user-defined Redshift functions is over. AWS announced support for Python 2.7-based user-defined functions, allowing developers to deploy custom functions for use in SQL statements.
User-defined functions (UDFs) in databases can streamline data manipulation operations and help enforce standard calculations. Without UDFs, some calculations must be implemented outside the database. This can lead to fragmentation with some code implemented in extract, transform and load (ETL) processes while other calculations are performed in Excel after downloading data to a spreadsheet. UDFs enable Amazon Redshift data warehouse users to call functions and perform calculations on data within the context of database queries.
Python is a logical choice for a UDF language. It is widely used, easy to learn and has a wide array of libraries that implement data manipulation and analysis functions. In addition to using core Python functionality, Amazon Redshift developers can use the Python standard library, which supports strings, data types, compression, XML and internationalization.
Data scientists and other data analysts will welcome support for NumPy, SciPy and Pandas libraries. NumPy and SciPy offer high-performance math routines; Pandas is another high-performance tool that supports data analysis and modeling. Redshift users can import their own custom Python modules as well. The steps are described in the UDF documentation.
Improving Python performance
Python is an interpreted language and performance is slower than compiled code. If an SQL function exists to perform a particular operation, chances are it will be faster than a Python script implementing the same function. An exception to this is that packages like NumPy and SciPy implement optimized, compiled code that is available to Python programs; such optimized packages should yield better performance than interpreted Python. When you need to implement a compute-intensive calculation, check NumPy and SciPy before implementing your own. Even if those libraries do not have exactly what you need, there may be optimized functions to improve the calculation.
One way to improve the performance of your Python code is to specify the lowest level of volatility to functions. Volatile functions can return different results on the same arguments. The compiler can make few optimization assumptions in such cases. The Redshift data warehouse supports three volatility levels: volatile, stable and immutable. Immutable functions always return the same type of results and are the most likely to be optimized. Stable functions will return the same results for all rows processed in a single statement, while volatile functions can return different results across calls even when the same arguments are passed to each call.
Python also has well-known limitations when it comes to concurrent processing, but those are not an issue when running in Redshift. The database manages the Python code and runs it in partitions, just as SQL code runs across partitions. The Redshift data warehouse manages all parallel processing and avoids bottlenecks that can occur in concurrent Python programs.
Redshift vs. Python data mapping and security privileges
Redshift data types map to their obvious Python counterparts. Several Redshift data types, including, "integer," "smallint," "bigint," "short" and "long," map to Python's "int" data type. Redshift's "decimal" and "numeric" data types map to Python's "decimal," while Redshift's "double" and "real" data types map to Python's "float" data type.
Redshift's "boolean" data type maps to "bool," "char" and "varchar" map to "string," and "timestamp" is equivalent to Python's "datetime" data type. The polymorphic data type "ANYELEMENT" is useful when taking advantage of Python's flexible typing system. It's important to keep these mappings in mind because you need to specify parameter and return Redshift value data types, not Python data types.
A couple privileges need to be in place to work with user-defined functions. Users must have the USAGE privilege for the Python language assigned to them to create UDFs. Once created, the owner and super user can execute them. Administrators must explicitly grant permission to other end users to run UDFs they have not created. Only the owner or the super user can update or delete a UDF.
The Redshift data warehouse limits some operations for reasonable security concerns. For example, user-defined functions cannot save files or access the network. In addition to security related constraints, there are more operationally oriented constraints. User-installed libraries are limited to 100MB. UDFs are subject to workload management rules, and, if maximum concurrency is reached, then UDFs are queued to run at a later time.
Amazon recommends prefixing your UDF names with f_. Amazon reserves f_ for UDFs and will not name any future Redshift functions with that prefix. This naming convention is designed to prevent naming conflicts between UDFs and Redshift-defined functions.
Monitor performance with Amazon Redshift
Quiz: Test your Amazon Redshift knowledge
AWS Lambda adds support for Python, VPC