Build JSON API Responses With Postgres CTEs

P

agination is a recurring problem that developers have to deal with when implementing data access layers for APIs. It can be particularly tricky with the more traditional RDMS like MySQL or Postgresql. For example, let's say we had an API endpoint that allowed consumers to search a data base of moves. We could search by title, director, starring actors, etc. Our data base has millions of movies, and we know we don't want to return all all the potential matches for every search request.

We only want to return the top 25 or so records and indicate in the response that there are more results to query for:

{
  meta: {
    total: 12000
  , limit: 25
  , next: <URL TO NEXT PAGE>
  , previous: <URL TO PREVIOUS PAGE>
  }
, data: [ ... ]
}

This sounds fairly simple and straight forward at first glance, however, in order to make pagination possible through the API, we need to indicate how many records would have been returned by the original query prior to any paging. Which effectively means we need to run the query twice as efficiently as possible. We don't want do this at the application level - pulling potentially hundreds of thousands of records into memory only to discard most of them and risk crashing the application. There are a couple of approaches.

We could just run two queries in parallel.

SELECT
  count(*)
FROM
  movies
WHERE
  director = "Michael Bay";
SELECT
  title
, release_date
, director
FROM
  movies
WHERE
  director = "Michael Bay";
ORDER BY title
LIMIT 25
OFFSET 50

Simple, easy to reason about. But it is two separate queries and round trips to the DB creating double the load on the DB. We will also still have to construct the response manually combining the result from the count query and aggregate a data array.

Alternatively, we push the count query down into a sub query.

SELECT
  title
, release_date
, director
, (
    SELECT
      count(*)
    FROM
      movies
    WHERE
      director = "Michael Bay"
  ) as total
FROM
  movies
WHERE
  director = "Michael Bay";
ORDER BY title
LIMIT 25
OFFSET 50

This is a bit better, where we can send a single query, and every row will have a column called total indicating the total number of records in the query prior to pagination. But we will still have to check one of the records to grab the count and construct the data array. We can do better though.

Common Table Expressions

Provide a way to write auxiliary statements for use in a larger query. These statements, which are often referred to as Common Table Expressions or CTEs, can be thought of as defining temporary tables that exist just for one query.

Postgres gives us a couple of interesting features that Make this easy. We can use an inline Common Table Expression to get our count information, and we can use the JSON_AGG and ROW_TO_JSON function to compile the data array of json serialized results.

WITH 
  movie_count AS (
    SELECT COUNT(*), movie_id
    FROM movies
    GROUP BY movie_id
  )
, movies AS (
    SELECT
      title
    , release_date
    , director
    FROM
      movies
    WHERE
      director = "Michael Bay";
    ORDER BY title
    LIMIT 25
    OFFSET 50
  )

SELECT
  movie_count.count,
  COALESCE(
    (
      JSON_AGG(
        ROW_TO_JSON(movies.*)
      ) FILTER ( where movies.movie_id is not null )
    )
    , '[]'::json
  ) as data
  FROM movie_count
LEFT JOIN movies ON movies.movie_id = movies.movie_id
GROUP BY movie_count.count

This will give us a single result very close to the way we want it, and make sure that we still get back an empty array if there are no rows that match the query. This will keep our in app processing down to a minimum.

{
  count: 12000
  data: [{
    "title": "Transformaers"
  , "director": "Michael Bay"
  , "release_date": 2007
  }
  , ... ]
}

The SQL isn't terribly difficult to read or understand, and we can use the original parameters to determine if there are any additionally pages based on the count which keeps the vast majority of the work at the database layer, and in a single query

const prefix = '/api/v1/movies'
const next_val = offset + limit
const prev_val = offset - limit

const next = next_val < total
  ? `${prefix}?limit=${limit}&offset=${next_val}`
  : null

const prev = prev_val < 0
  ? null
  : `${prefix}?limit=${limit}&offset=${prev_val}`

Not bad! It isn't always apparently obvious, but if you find yourself trying to manipulate data structures in your application, it is probably a job for your database. In this case we were able to condense a tabular data structure spread across 2-3 queries in to a single query that returned a nested JSON object with and array of object. That is pretty cool

postgres++

sql postgres node.js