Hunting N+1 errors in our Django code: the implicit queries case

Introduction

Like any company providing on-line services, Dashdoc strives to deliver performance to its beloved customers. If performance drops, everybody suffers, starting from our customers up to our customer success team, our CEO, and of course, us, techies. We are really a bunch of empathic Care Bears at Dashdoc.

Jokes aside, technically, we have several aspects we look into for performance: pure DB optimisation (Use the Index Luke), React UI performance, Django performance, and overall integration sanity (allow the UI to do bulk operations instead of tens or hundreds of similar calls for example).

We already talked about avoiding one aspect of N+1 in Django.

We described in that previous article two variations of this issue in our code:

The typical instance.related.attribute version: if instance.related was not prefetched it would issue a new DB call.
The less obvious instance.related_set.filter(column=value) version.

It gave an overview of how we tackled the second point. So in this article we will explain how we tackled the first issue which is: not doing the correct complete prefetch.

The problem

During 2022 we became obsessed with performance issues. We were aware that our Django code did undesired implicit requests in different places, and although we kept adding unit tests with assertNumQueries we knew we were not always creating the situation where those tools would catch extra queries.

Indeed, to catch N+1s in tests reliably, the set up is often tedious. Of course we will create numbers of first or second level entities - which are those we want to test - but the devil is in the details: we might forget to create enough remotely connected entities to catch redundant SQL queries, those entities that you will access in your business logic and completely forget they need to be prefetched.

The intuition

Eventually, we figured a way to gain some insight on those N+1. Specifically the case where a instance.relation wasn’t loaded. In Django’s ORM, there is one point where we can sit and look at the accesses to the data: FieldCacheMixin.get_cached_value.

Careful, monkey-patching ahead! Monkey-patching is an often frowned-upon technique to dynamically reconfigure code. Among the reasons that make reviewers frown are that it can introduce hard-to-discover behaviours and that it often patches internal APIs. However, sometimes the benefits outweigh the drawbacks so much! Also this is the approach used by APM tools. If you use some APM on your application, you have monkey-patches going on everywhere.

We monkey-patched FieldCacheMixin.get_cached_value with our own version which logged cache misses and a script was written to collect those logs and identify which QuerySet was missing a prefetch_related or a select_related.

This is the original monkey-patch:

def install_missed_prefetched_detector():
    def get_cached_value(self, instance, default=NOT_PROVIDED):
        cache_name = self.get_cache_name()
        try:
            return instance._state.fields_cache[cache_name]
        except KeyError:
            if default is NOT_PROVIDED:
                _REPORTER(cache_name, instance)
                raise
        return default

    FieldCacheMixin.get_cached_value = get_cached_value

We continued this tedious hand-crafted way for a few months, letting the logger log, and running manually a script to analyse our logs every now and then to find the next code paths to target.

The solution

Finally during a cool-down, we used the same monkey-patching approach and built a decorator for our unit tests: @break_on_missed_prefetches.

Its role was to make the test fail as soon as FieldCacheMixin.get_cached_value missed its cache, indicating an N+1.

Concurrently, we had already a helper testing class to centralize all the API calls we used in our tests, like:

class TestClientMixin:
   def api_create_transport(self, *args, **kwargs):
       self.client.post(
           # ...
       )

   def api_assign_trucker(self, *args, **kwargs):
       self.client.patch(
           # ...
       )

And all our tests use these methods. All we needed was to decorate those methods to start guarding our code against missed prefetches!

class TestClientMixin:
   @break_on_missed_prefetches
   def api_create_transport(self, *args, **kwargs):
       self.client.post(
           # ...
       )
   @break_on_missed_prefetches
   def api_assign_trucker(self, *args, **kwargs):
       self.client.patch(
           # ...
       )

Bingo! As soon as a missed prefetch happened, we got an exception and the test failed. No need to set up complex data situations and count the number of queries to detect the issue! The expected related object is not there? Then it is an error.

We later enforced decorating these methods with a Semgrep rule. By using it every day, we iterated the tool quite extensively.

Firstly, we improved this tool to also catch deferred field being fetched: .only(...) and .defer(...) can be used to avoid fetching all fields of a model, and in our case accessing those fields are a sign of a bug.
Secondly, we added the call stack of the location the query was executed, to that end, we monkey-patched QuerySet._fetch_all to assign the call stack to the model that were fetched. This solution being only useful for developers (and being quite expensive) is disabled by default. It is often useful because if you miss a prefetch, you most likely want to add select_related / prefetched_related to the query.
Finally, with a deep hierarchy of prefetch/select_related it can be hard to find the related object with only the message “implicit query accessing <field> on ModelClass”, to helps us we store the parent of children entities in FieldCacheMixin.get_cached_value. With this information we can go as far as constructing a likely answer to fix the query: “Hint: add ‘.select_related(grand_parent__parent__field’)” to the query.

Conclusion

Nowadays, the decorator is called @break_on_implicit_queries and reports a lot more information to make it even easier for us to pinpoint incomplete querysets. We go all the way into giving the faulty queryset and where it is defined, together with where the failing N+1 call happened. We also cross the boundary of Celery tasks to be able to debug them too.

Besides the decorator used in tests, we can turn on the mechanism on our developing environment to catch the mistakes early.

This tool has proved to be a very valuable addition to our performance tuning tool kit. Its integration in our unit testing makes sure we don’t let missed prefetches go live. Although it doesn’t completely avoid false positives (for example situations where doing the correct prefetch explicitly or letting Django do the query lazily implicitly makes no performance difference) we use it as a proactive tool to avoid heavy production performance incidents.

More recently, this tool caught an issue while migrating to Django 5. An extra query appeared in complex situations. The investigation is still ongoing to pin-point what is wrong though.

But we were not done yet, performance is an ever lasting battle and the following year, 2023, was crucial for us in that regard.