Database N+1 performance problems caused by Django ORM

Background description

Recently, when using Django, I found that after calling api, a large number of database query statements appear in the transactions under the same process of the database. After investigation, it was found that it was caused by the mechanism of Django ORM.

Django object relational mapper (ORM), as a popular feature of Django, is widely used in development. We can interact with the database through it to realize DDL and DML operations

Specifically, use QuerySet Object to retrieve data, and QuerySet is essentially through in a predefined model Manager Interact with the database.

Manager is an interface that Django model provides for database query. There is at least one manager object in each Model. But today's protagonist is QuerySet, which is not the key.

In order to express the problem more clearly, it is assumed that there are the following tables in the database:

device table, indicating the physical devices of the current network.

Interface table, which represents the interface owned by the physical device.

interface_ The extension table and the interface table are one-to-one. Because there are too many interface attributes, they are used to store some less commonly used interface attributes.

class Device(models.Model):
    name = models.CharField(max_length=100, unique=True)  # Device name when adding a device
    hostname = models.CharField(max_length=100, null=True)  # The hostname obtained from the device
    ip_address = models.CharField(max_length=100, null=True)  # Device management IP

class Interface(models.Model):
    device = models.ForeignKey(Device, on_delete=models.PROTECT, null=Falseļ¼Œrelated_name='interfaces')) # Which device does it belong to
    name = models.CharField(max_length=100)  # port name
    collect_status = models.CharField(max_length=30, default='active')
    class Meta:
        unique_together = ("device", "name")  # composite keys 
        
class InterfaceExtension(models.Model):
    interface = models.OneToOneField(
        Interface, on_delete=models.PROTECT, null=False, related_name='ex_info')
        
    endpoint_device_id = models.ForeignKey( # Bound terminal device
        Device, db_column='endpoint_device_id',
        on_delete=models.PROTECT, null=True, blank=True)
        
    endpoint_interface_id = models.ForeignKey(
        Interface, db_column='endpoint_interface_id', on_delete=models.PROTECT, # Interface of bound terminal device
        null=True, blank=True)

To put it simply, a device has multiple interfaces, and an interface has an extended attribute.

In the extended attribute of the interface, you can bind the interface on another device, so in the interface_extension also has two reference foreign keys.

In order to better analyze the process of executing SQL by ORM, you need to record the executed SQL in the following ways:

  • Open the log of sql log in django settings
  • Open the log recording sql log in MySQL

django, in settings Py, you can see the SQL execution process on the console:

DEBUG = True

import logging
l = logging.getLogger('django.db.backends')
l.setLevel(logging.DEBUG)
l.addHandler(logging.StreamHandler())

LOGGING = {
    'version': 1,
    'disable_existing_loggers': False,
    'filters': {
        'require_debug_false': {
            '()': 'django.utils.log.RequireDebugFalse'
        }
    },
    'handlers': {
        'mail_admins': {
            'level': 'ERROR',
            'filters': ['require_debug_false'],
            'class': 'django.utils.log.AdminEmailHandler'
        },'console': {
            'level': 'DEBUG',
            'class': 'logging.StreamHandler',
        },
    },
    'loggers': {
        'django.db': {
            'level': 'DEBUG',
            'handlers': ['console'],
        },
    }
}

Or configure it directly in MySQL:

# Check whether the function of recording SQL is on. It is off by default:
SHOW VARIABLES LIKE "general_log%";

# Turn on the recording function, and the specific log path will be displayed through the above command.
SET GLOBAL general_log = 'ON';

QuerySet

If you want to query through QuerySet, the name of the device of all interfaces:

interfaces = Interface.objects.filter()[:5] # hit once database

for interface in interfaces: 
    print('interface_name: ', interface.name,
          'device_name: ', interface.device.name) # hit database again

The first sentence above takes the first five interface records, and the corresponding raw sql is select * from interface limit 5; No problem.

However, when the device name to which the interface belongs is taken below, the database will be called repeatedly: when traversing an interface, it will pass the obtained device_id go to the database to query device_name. The corresponding raw sql is similar to: select name from device where id = {}

In other words, if there are 100000 interfaces, 100000 queries will be executed, and the performance consumption can be imagined. Including the previous query to find all interfaces, it is collectively referred to as the N + 1 query problem.

The solution is also very simple. If native SQL is used, there are usually two solutions:

  • When querying the interface for the first time, use join to associate the interface with the device. This makes only one database call.
  • Or, after the interface is queried, the required device is sent through the code logic_ IDs are collected in the form of a collection, and then queried through the in statement. Similar to SELECT name FROM device WHERE id in (...) Doing so will execute SQL only twice.

The specific choice should be analyzed in combination with specific scenarios, such as whether there is an index and the size of the table.

Back to QuerySet, how can QuerySet solve this problem? There are also two solutions, using the select provided in QuerySet_ Related() or prefetch_ The related() method.

select_related

After calling select_ When using the related () method, Queryset will query the foreign key relationship of the model together. It is equivalent to join in raw sql Query all data at the same time. select_ The main application scenario of related () is: a model is associated with a foreign key (many to one), or there is a 1-to-1 relationship.

Take the device name of the search interface as an example:

interfaces = Interface.objects.select_related('device').filter()[:5] # hit once database

for interface in interfaces:
    print('interface_name: ', interface.name,
         'device_name: ', interface.device.name) # don't need to hit database again 

The above query SQL is similar to: select XX frominterface inner join device on interface device_ id = device. ID limit5. Note that this is inner join because it is a non empty foreign key.

select_related() also supports the case where multiple foreign keys are associated in a model: for example, expand the interface, query the bound device name and interface name:

ex_interfaces = InterfaceExtension.objects.select_related(
    'endpoint_device_id', 'endpoint_interface_id').filter()[:5] 

# or

ex_interfaces = InterfaceExtension.objects.select_related(
    'endpoint_device_id').select_related('endpoint_interface_id').filter()[:5] 

The above SQL is similar to:

SELECT XXX FROM interface_extension LEFT OUTER JOIN device ON (interface_extension.endpoint_device_id=device.id) 
LEFT OUTER JOIN interface ON (interface_extension.endpoint_interface_id=interface.id)
LIMIT 5

Since it is a nullable foreign key, it is left join

If you want to clear the foreign key relationship of QuerySet, you can use: QuerySet select_ Related (none).

prefetch_related

prefetch_related and select_ Like related, it is to avoid database calls when a large number of relationships are queried. Just to avoid the huge result set and efficiency problems caused by multi table join, select_related is more inclined to foreign keys (many to one) and one to one relationships.

Prefetch_ The implementation of related is similar to the second method of raw sql, which separates the relationship between queries, and then combines them through python code. So prefetch_related can well support one to many or many to many relationships.

Take the example of querying the device names of all interfaces:

interfaces = Interface.objects.prefetch_related('device').filter()[:5] # hit twice database

for interface in interfaces:
    print('interface_name: ', interface.name,
         'device_name: ', interface.device.name) # don't need to hit database again

Change to prefetch_ After related, the sql execution logic becomes as follows:

  1. "SELECT * FROM interface "
  2. "SELECT * FROM device where device_id in (.....)"
  3. Then the relationships between them are combined through python code.

The same is true for querying which interfaces all devices have:

devices = Device.objects.prefetch_related('interfaces').filter()[:5] # hit twice database
for device in devices:
    print('device_name: ', device.name,
          'interface_list: ', device.interfaces.all())

The execution logic is also:

  1. "SELECT * FROM device"
  2. "SELECT * FROM interface where device_id in (.....)"
  3. Then the relationships between them are combined through python code.

If the relationship is changed to many to many, it will change to join and then in in the second step. You can try it directly.

However, it should be noted that when the QuerySet used has a new logical query, prefetch_ The related results will not take effect, but the database will be queried:

For example, add a condition to query which interfaces all devices have. The status of the interface is the up interface

devices = Device.objects.prefetch_related('interfaces').filter()[:5] # hit twice database
for device in devices:
    print('device_name: ', device.name,
         'interfaces:', device.interfaces.filter(collect_status='active')) # hit dababase repeatly

The execution logic becomes:

  1. "SELECT * FROM device"
  2. "SELECT * FROM interface where device_id in (.....)"
  3. Repeat the number of device times: "SELECT * FROM interface where device_id = xx and collect_status='up ';"
  4. Finally, it is combined through python.

The reason is: the previous prefetch_ The related query does not include the judgment of collect_status. So for QuerySet, this is a new query. So it will be re executed.

Can use Prefetch object Further control and solve the above problems:

devices = Device.objects.prefetch_related(
    Prefetch('interfaces', queryset=Interface.objects.filter(collect_status='active'))
    ).filter()[:5] # hit twice database
for device in devices:
    print('device_name: ', device.name, 'interfaces:', device.interfaces) 

The execution logic becomes:

  1. "SELECT * FROM device"
  2. "SELECT * FROM interface where device_id in (.....) and collect_status = 'up';"
  3. Finally, it is combined through python.

You can use the to of the Prefetch object_ Attr to change the name of the association relationship between:

devices = Device.objects.prefetch_related(
    Prefetch('interfaces', queryset=Interface.objects.filter(collect_status='active'), to_attr='actived_interfaces')
    ).filter()[:5] # hit twice database
for device in devices:
    print('device_name: ', device.name, 'interfaces:', device.actived_interfaces) 

You can see that Prefetch can control and associate those related objects.

Finally, for some cases with complex association structure, prefetch can be used_ Related and select_related are grouped together to control the logic of querying the database.

For example, you want to query the information of all interfaces and their device names, as well as the information of the bound peer devices and interfaces in the extended interface.

queryset = Interface.objects.select_related('ex_info').prefetch_related(
            'ex_info__endpoint_device_id', 'ex_info__endpoint_interface_id')

The execution logic is as follows:

  1. SELECT XXX FROM interface LEFT OUTER JOIN interface_extension ON (interface.id=interface_extension .interface_id)
  2. SELECT XXX FROM device where id in ()
  3. SELECT XXX FROM interface where id in ()
  4. Finally, it is combined through python.

The first step is due to interface and interface_extension is a 1-to-1 relationship, so select is used_ Related associates them.

Step 2 and 3: Although the interface_extension and endpoint_device_id and endpoint_interface_id is a foreign key relationship. If you continue to use select_related will join four tables consecutively and replace them with select_related, for interface_ The attribute associated with the extension foreign key uses the in query because the interface_ The attributes of the extension table are not often used.

summary

In this article, the causes of Django N +1 problem are introduced. The solution is to call the select of QuerySet_ Related or prefetch_related method.

For select_ For related, the application scenario is mainly in foreign keys and one-to-one relationships. Corresponding to the native SQL is similar to the JOIN operation.

For prefetch_ For related, application scenarios are mainly IN many to one and many to many relationships. The corresponding native SQL is similar to the IN operation.

The Prefetch object allows you to control select_related and prefetch_related is associated with those related objects.

Finally, in each QuerySet, you can combine select_related and prefetch_ Change the logic of querying the database in related mode.

reference resources

https://docs.djangoproject.com/en/3.1/ref/models/querysets/](https://docs.djangoproject.com/en/3.1/ref/models/querysets/)

https://medium.com/better-programming/django-select-related-and-prefetch-related-f23043fd635d

https://stackoverflow.com/questions/39669553/django-rest-framework-setting-up-prefetching-for-nested-serializers

[https://medium.com/@michael_england/debugging-query-performance-issues-when-using-the-django-orm-f05f83041c5f

Tags: Python

Posted by sparc-e on Thu, 12 May 2022 13:49:59 +0300