Redshift SQL Gotchas

If you're working within the AWS Cloud ecosystem, you've probably had to work with AWS Redshift, which is the de-facto solution for data warehousing and supporting business intelligence, reporting, data, and analytics tools.

Here are 3 tips you should keep in mind when working with Redshift as your team's data-warehouse:

1) Primary key and Foreign Key Constraints are not enforced

Like other clustered database systems, AWS Redshift only uses primary and foreign keys as planning hints for certain statistical computations. Your ETL logic or source tables should be enforcing these constraints. Redshift is essentially a distributed column store and the main reason why it's able to batch writes and massive select and joins so quickly is because it doesn't need to handle uniqueness constraints from primary keys and indexes.

2) ISO-8601 and Timestamp Precision

An easy mistake to make with storing timestamps is to assume that storing timestamps as ISO-8601 strings, will automatically make it convertible between different databases, languages etc...

For example with the Go standard library supports nano-second precision:

package main

import (
	"fmt"
	"time"
)

func main() {
	t := "2020-07-01T01:23:45.999999999Z"
	res, err := time.Parse(time.RFC3339Nano, t)
	if err == nil {
		fmt.Println(res)
	}
}

2020-07-01 01:23:45.999999999 +0000 UTC

Program exited.

If you stored this into Redshift and tried to cast it back into a DATE with something like created_at::date or TIMESTAMP you'd get the following error:

org.postgresql.util.PSQLException: ERROR: date/time field value out of range

Looking at AWS's documentation we see that Redshift only supports 1 microsecond of precision!

To prevent issues like this, teams should ensure that their ETL code understands the source table schemas and do the appropriate transformations. Specifically for DynamoDB, even though DynamoDB itself only supports microsecond resolution, you should defensively check the string's precision with a rtrim or use to_date(created_at, 'YYYY-MM-DDTHH:MI:SS').

Teams should also access what type of precision they need with their data. In most cases, it's extremely rare that your system will actually need to be measuring time at a nanosecond precision. At those levels, you're at the mercy of system clock errors, clock drift, etc...

3) SQL Three-Valued Logic

This one isn't specific to Redshift, but is more of a tricky SQL gotcha. Lets say you have this sample query:

SELECT * from some_table WHERE name NOT IN ('a', 'b', 'c')

where name is nullable. Would you expect this to returns results where name is NULL? The query will not return any rows where name is NULL due three-valued logic. NULL is a special case where we designate data is absent and missing. From the context of a boolean expression, if we have the expression:

NULL != 'a'

Three-value logic defines the result of this as unknown since we don't actually know the value of the data. WHERE clauses must require conditions to evaluate to true.

The correct query would then be:

SELECT * from some_table WHERE name IS NULL OR name NOT IN ('a', 'b', 'c')