Build JSON API Responses With Postgres CTEs
agination is a recurring problem that developers have to deal with when implementing data access layers for APIs. It can be particularly tricky with the more traditional RDMS like MySQL or Postgresql. For example, let's say we had an API endpoint that allowed consumers to search a data base of moves. We could search by title, director, starring actors, etc. Our data base has millions of movies, and we know we don't want to return all all the potential matches for every search request.
We only want to return the top 25 or so records and indicate in the response that there are more results to query for:
{
meta: {
total: 12000
, limit: 25
, next: <URL TO NEXT PAGE>
, previous: <URL TO PREVIOUS PAGE>
}
, data: [ ... ]
}
This sounds fairly simple and straight forward at first glance, however, in order to make pagination possible through the API, we need to indicate how many records would have been returned by the original query prior to any paging. Which effectively means we need to run the query twice as efficiently as possible. We don't want do this at the application level - pulling potentially hundreds of thousands of records into memory only to discard most of them and risk crashing the application. There are a couple of approaches.
We could just run two queries in parallel.
SELECT
count(*)
FROM
movies
WHERE
director = "Michael Bay";
SELECT
title
, release_date
, director
FROM
movies
WHERE
director = "Michael Bay";
ORDER BY title
LIMIT 25
OFFSET 50
Simple, easy to reason about. But it is two separate queries and round trips to the DB creating double the load on the DB. We will also still have to construct the response manually combining the result from the count query and aggregate a data
array.
Alternatively, we push the count query down into a sub query.
SELECT
title
, release_date
, director
, (
SELECT
count(*)
FROM
movies
WHERE
director = "Michael Bay"
) as total
FROM
movies
WHERE
director = "Michael Bay";
ORDER BY title
LIMIT 25
OFFSET 50
This is a bit better, where we can send a single query, and every row will have a column called total
indicating the total number of records in the query prior to pagination. But we will still have to check one of the records to grab the count and construct the data
array. We can do better though.
Common Table Expressions
Provide a way to write auxiliary statements for use in a larger query. These statements, which are often referred to as Common Table Expressions or CTEs, can be thought of as defining temporary tables that exist just for one query.
Postgres gives us a couple of interesting features that Make this easy. We can use an inline Common Table Expression to get our count information, and we can use the JSON_AGG
and ROW_TO_JSON
function to compile the data
array of json serialized results.
WITH
movie_count AS (
SELECT COUNT(*), movie_id
FROM movies
GROUP BY movie_id
)
, movies AS (
SELECT
title
, release_date
, director
FROM
movies
WHERE
director = "Michael Bay";
ORDER BY title
LIMIT 25
OFFSET 50
)
SELECT
movie_count.count,
COALESCE(
(
JSON_AGG(
ROW_TO_JSON(movies.*)
) FILTER ( where movies.movie_id is not null )
)
, '[]'::json
) as data
FROM movie_count
LEFT JOIN movies ON movies.movie_id = movies.movie_id
GROUP BY movie_count.count
This will give us a single result very close to the way we want it, and make sure that we still get back an empty array if there are no rows that match the query. This will keep our in app processing down to a minimum.
{
count: 12000
data: [{
"title": "Transformaers"
, "director": "Michael Bay"
, "release_date": 2007
}
, ... ]
}
The SQL isn't terribly difficult to read or understand, and we can use the original parameters to determine if there are any additionally pages based on the count which keeps the vast majority of the work at the database layer, and in a single query
const prefix = '/api/v1/movies'
const next_val = offset + limit
const prev_val = offset - limit
const next = next_val < total
? `${prefix}?limit=${limit}&offset=${next_val}`
: null
const prev = prev_val < 0
? null
: `${prefix}?limit=${limit}&offset=${prev_val}`
Not bad! It isn't always apparently obvious, but if you find yourself trying to manipulate data structures in your application, it is probably a job for your database. In this case we were able to condense a tabular data structure spread across 2-3 queries in to a single query that returned a nested JSON
object with and array of object. That is pretty cool
postgres++