Sep 11, 2015
Amazon Web Services (AWS) recently announced that Redshift, their managed MPP database offering, will support scalar user-defined functions (hereinafter UDFs). Redshift's UDFs rely on Python to define a program that transforms input data. Because Python is such an accessible language with a host of libraries, many users will find writing UDFs in Redshift more approachable than other SQL dialects, which tend to rely on low-level, compiled programming languages for UDF construction.
This article is intended to give the technical user insight into how to create, use, and troubleshoot scalar UDFs in Redshift, as well as provide a few recipes to get started.
A scalar function returns a single result value for each input value. Performed over N rows, we'd expect our scalar function to return N values. There are five basic elements required to implement a scalar UDF in Redshift:
f_
prefix in the name);date_column TIMESTAMP
or zipcode VARCHAR
;See the official AWS documentation for specifics regarding function volatility as well as data-type mappings between Redshift and Python. This is the generalized syntax to create a scalar UDF.
CREATE [OR REPLACE] FUNCTION my_udf ( [argument redshift_data_type, ...] )
RETURNS redshift_data_type
{ STABLE | VOLATILE | IMMUTABLE }
AS $$
my_python_program
$$ LANGUAGE plpythonu;
The Python program is placed between the set of $$
; the language, for the time being, is always plpythonu
. Let's walk through an example of a scalar function that, for any input timestamp, extracts the week starting Sunday in %Y-%m-%d
format. (For simplicity, I'm going to assume the input is a timestamp without time zone.)
CREATE FUNCTION week_starting_sunday (date_col TIMESTAMP)
RETURNS VARCHAR
STABLE
AS $$
from datetime import datetime, timedelta
day = str(date_col)
dt = datetime.strptime(day, '%Y-%m-%d %H:%M:%S')
week = dt - timedelta((dt.weekday() + 1) % 7)
return week.strftime('%Y-%m-%d')
$$ LANGUAGE plpythonu;
We can run a simple test on this function using
SELECT CURRENT_DATE AS today
, week_starting_sunday(CURRENT_DATE);
which returns
today | week_starting_sunday |
---|---|
2015-03-26 | 2015-03-22 |
We can also write a scalar function that takes multiple columns as input. Take, for instance, a situation where we have latitude and longitude for both origin and destination. Let's write a UDF, called DISTANCE
, that uses the haversine formula to calculate approximate distance in miles between origin and destination.
<
div class="language-sql" markdown="1">
CREATE FUNCTION DISTANCE (orig_lat float, orig_long float, dest_lat float, dest_long float)
RETURNS float
STABLE
AS $$
import math
r = 3963.1676 # earth's radius, in miles
phi_orig = math.radians(orig_lat)
phi_dest = math.radians(dest_lat)
delta_lat = math.radians(dest_lat - orig_lat)
delta_long = math.radians(dest_long - orig_long)
a = math.sin(delta_lat/2) * math.sin(delta_lat/2) + math.cos(phi_orig) \
* math.cos(phi_dest) * math.sin(delta_long/2) * math.sin(delta_long/2)
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
d = r * c
return d
$$ LANGUAGE plpythonu;
<
div class="language-sql" markdown="1">
Let's test this function on a table that captures flight information.
id | origin_latitude | origin_longitude | destination_latitude | destination_longitude |
---|---|---|---|---|
1 | 37.775 | -122.4183333 | 40.64377899999999 | -73.78227400000003 |
2 | 40.64377899999999 | -73.78227400000003 | 37.775 | -122.4183333 |
3 | 37.775 | -122.4183333 | 42.3656132 | -71.00956020000001 |
Running our UDF
SELECT id
, DISTANCE(origin_latitude, origin_longitude, destination_latitude, destination_longitude) AS distance_in_miles
FROM flights
yields
id | distance_in_miles |
---|---|
1 | 2580.96027955641 |
2 | 2580.96027955641 |
3 | 2698.1617901510726 |
Scalar UDFs can be quite useful in certain circumstances. If performance is comparable, a UDF may be ideal to replace a difficult-to-follow stored procedure or go-to SQL transformation. This can make grasping the objective of an S-Q-L query more manageable. There's also an opportunity to take slow-running procedures and speed them up with an imperative program written in a language that makes transforming data much simpler.
The choice to go with Python for UDF construction certainly opens up some doors. Data scientists should be able to move some interesting data exploration techniques typically reserved for Python (perhaps ran on sample data or using the Map-Reduce framework) back into SQL to be run on the full data set with less tedium.