GraphQL - ORM

GraphQL is the new ORM.

REST and ORMs are both infamous for:

  • over-fetching: fetching more data than is needed per request
  • under-fetching: fetching less data than is needed, requiring multiple requests
  • select N+1 problem: under-fetching applied to multiple associated objects

GraphQL aims to overcome REST's shortcomings through a flexible query language, and succeeds in doing so on the client side. But on the server side, GraphQL resolvers have effectively recreated the same over- and under- fetching problems that have long plagued ORMs. The fact that ORMs remain popular despite of their inefficiency is a testament to the benefits of having in-memory objects behave consistently. There is no such trade-off for server-side GraphQL, where the only point of the objects is to be immediately serialized.

The so-called N+1 problem is generally acknowledged in the GraphQL community, but this article will argue only the symptoms are being addressed with workarounds like dataloader.

Example

The problems can be seen immediately in GraphQL's own introductory resolver example.

Query: {
  human(obj, args, context, info) {
    return context.db.loadHumanByID(args.id).then(
      userData => new Human(userData)
    )
  }
}

Human: {
  name(obj, args, context, info) {
    return obj.name
  }
}

What makes the name resolver trivial? It's pre-fetched by loadHumanByID, whose only parameter is id, so it's clearly unaware of whether name has been requested. What if the requested field was a nested object, or a json field, or just a large text blob? Then one would clearly be directed towards using a non-trivial resolver which fetches the field on demand. Of course, but then whatever work is common in the human id lookup is duplicated.

This is by no means specific to SQL or relational databases, but SQL is a convenient lingua franca of databases to demonstrate the inefficiency. The choices are:

  • SELECT * FROM human WHERE id = ?
  • SELECT field FROM human WHERE id = ? repeated for each "expensive" field

Even in the simplest possible example, over-fetching has already occurred, and the only proposed workaround is under-fetching. The single query we want is:

  • SELECT f1, f2, ... FROM human WHERE id = ? for requested fields

In other words, exactly what happens with ORMs, except even ORMs typically offer an option of requesting a subset of fields. Naturally the problem gets worse with list fields.

Human: {
  appearsIn(obj) {
    return obj.appearsIn // returns [ 4, 5, 6 ]
  }
}

Human: {
  starships(obj, args, context, info) {
    return obj.starshipIDs.map(
      id => context.db.loadStarshipByID(id).then(
        shipData => new Starship(shipData)
      )
    )
  }
}

Now there are two sets of associate keys (.appearsIn and .starshipIDs) that have been over-fetched. Nonetheless starships is under-fetched as well. The starships resolver is neither the most efficient nor the simplest way of resolving this field:

  • fetch all the data by human id in the starships resolver
  • fetch all the data in bulk by starshipIDs
  • push the resolution down to the Starship layer if forced to fetch one at a time

The example implementation seems to be going out of its way to showcase JavaScript promises. And the assumptions being made about the underlying data store are unusual:

  1. The data is relational in nature.
  2. Associative keys have neglible cost to pre-fetch.
  3. But joins are not available.
  4. And neither are primary key in queries.

Workaround

From GraphQL's best practices

GraphQL is designed in a way that allows you to write clean code on the server, where every field on every type has a focused single-purpose function for resolving that value. However without additional consideration, a naive GraphQL service could be very "chatty" or repeatedly load data from your databases.

This is commonly solved by a batching technique, where multiple requests for data from a backend are collected over a short period of time and then dispatched in a single request to an underlying database or microservice by using a tool like Facebook's DataLoader.

That's an understatement. It's not clear how "naive" differs from best practice.

Clearly there is value in transforming multiple primary key = queries into a single primary key in query. As in the starships example however, that can be done more simply in the parent resolver. There is more value in not needing a primary key in query at all. Furthermore adding caching to a dataloader avoids the central issue.

Again reminiscent of ORMS, any data layer can add caching. The point is efficiently resolving a query requires context, which strict adherence to single-purpose resolvers explictly disregards.

Aggregation

The inefficencies becomes even more glaring when moving beyond associative keys. Nearly any aggregation requires knowing what summaries are requested. Such as if the appearsIn field optionally included counts or times. Using SQL as an example again, the query would resemble one of:

  • SELECT distinct field FROM ...
  • SELECT field, count(*) FROM ... GROUP BY field

The conditional logic if "count" in requested_fields must exist in some form in the code, because the alternatives are over-fetching or under-fetching. Both of which are far more inefficient in this scenario than in the "select N+1" problem. A dataloader-esque approach is not going to be applicable to "group by" operations.

Computation

One last generalized example: computed fields. What are typically query flags in REST (and RPC) APIs, become field selections in GraphQL.

For example, computing scores in a search engine interface. Instead of a scores: Boolean! = false input option, the more obvious approach would be to skip score calculation when the score field isn't requested.

As with aggregation, the same pattern recurs. It's unacceptable to over-fetch, i.e., compute the scores when not needed. It's worse still to under-fetch, i.e., run some sort of lean search that will find matches and then go back and compute scores later.

Solution

The inescapable conclusion is that in some cases parent resolvers need to know which of their children are being requested. There's no need to throw away GraphQL's server-side resolver hierarchy. No one is advocating a thousand line root resolver named RunIt that processes the entire query.

All that's needed is a conceptual shift which encourages introspecting the GraphQLResolveInfo object. The requested fields are right there waiting to be useful, but good luck finding documentation and examples to that effect. In every non-trivial GraphQL project, this author has used a utility like:

In [ ]:
def selections(node):
    """Return tree of field name selections."""
    nodes = getattr(node.selection_set, 'selections', [])
    return {node.name.value: selections(node) for node in nodes}

Those fields would be checked or passed to a data layer as needed. For example, in Django's ORM, it could be as simple as appending a query set with .values(*selections(*info.nodes)).

Well, almost. The next issue is that typical GraphQL model validators raise an error if required fields are missing. Thanks validator; the field is missing because the client didn't request it.

This is actually a different age-old problem: equivocating "optional" and "nullable". GraphQL requires populating all requested fields, and specifiying whether they may be null. Server-side implementations understandably, but still incorrectly, interpret that by making nullables optional and non-nullables required at the model layer. So typically it's necessary to pad resolved objects with empty (but not null) data.

Although a minor problem, it reveals the bias related to single-purpose resolvers. The point of GraphQL is to efficiently return only the requested fields, yet standard practice in GraphQL models is to require populating fields that haven't been requested.

Over- and under- fetching can be addressed directly in resolvers, with the data layer's own interface, instead of hidden behind another abstraction layer.