A 900w data table, SQL optimization executed in 16s to 300ms?

  • 1, Foreword
  • confirm

 

There is a financial flow table without sub database and sub table. The current data volume is 9555695. The limit is used for paging query. The query before optimization takes 16 s 938 ms (execution: 16 s 831 ms, fetching: 107 ms). After adjusting SQL in the following way, it takes 347 ms (execution: 163 ms, fetching: 184 ms);

Operation: put the query criteria into the sub query. The sub query only checks the primary key ID, and then use the primary key determined in the sub query to query other attribute fields;

Principle: reduce meter return operation;

-- Before optimization SQL
SELECT  Various fields
FROM `table_name`
WHERE Various conditions
LIMIT 0,10;

  

 
-- After optimization SQL
SELECT  Various fields
FROM `table_name` main_tale
RIGHT JOIN
(
SELECT  Subquery only checks primary key
FROM `table_name`
WHERE Various conditions
LIMIT 0,10;
) temp_table ON temp_table.Primary key = main_table.Primary key

 

 

Principle analysis found: why does MySQL use limit affect performance?

1, Foreword

First, explain the MySQL version:

mysql> select version();
+-----------+
| version() |
+-----------+
| 5.7.17    |
+-----------+
1 row in set (0.00 sec)

 

 

Table structure:

mysql> desc test;
+--------+---------------------+------+-----+---------+----------------+
| Field  | Type                | Null | Key | Default | Extra          |
+--------+---------------------+------+-----+---------+----------------+
| id     | bigint(20) unsigned | NO   | PRI | NULL    | auto_increment |
| val    | int(10) unsigned    | NO   | MUL | 0       |                |
| source | int(10) unsigned    | NO   |     | 0       |                |
+--------+---------------------+------+-----+---------+----------------+
3 rows in set (0.00 sec)

 

 

id is a self incrementing primary key, and val is a non unique index.

Pour in a large amount of data, a total of 5 million:

mysql> select count(*) from test;
+----------+
| count(*) |
+----------+
|  5242882 |
+----------+
1 row in set (4.25 sec)

 

 

We know that when the offset in limit offset rows is large, efficiency problems will occur:

mysql> select * from test where val=4 limit 300000,5;
+---------+-----+--------+
| id      | val | source |
+---------+-----+--------+
| 3327622 |   4 |      4 |
| 3327632 |   4 |      4 |
| 3327642 |   4 |      4 |
| 3327652 |   4 |      4 |
| 3327662 |   4 |      4 |
+---------+-----+--------+
5 rows in set (15.98 sec)

 

 

In order to achieve the same purpose, we will generally rewrite it into the following sentence:

mysql> select * from test a inner join (select id from test where val=4 limit 300000,5) b on a.id=b.id;
+---------+-----+--------+---------+
| id      | val | source | id      |
+---------+-----+--------+---------+
| 3327622 |   4 |      4 | 3327622 |
| 3327632 |   4 |      4 | 3327632 |
| 3327642 |   4 |      4 | 3327642 |
| 3327652 |   4 |      4 | 3327652 |
| 3327662 |   4 |      4 | 3327662 |
+---------+-----+--------+---------+
5 rows in set (0.38 sec)

 

 

The time difference is obvious.

Why the above results? Let's take a look at select * from test where Val = 4, limit 300000,5; Query process:

Query leaf index node data. According to the primary key value on the leaf node, cluster all the field values required by the query on the index.

Similar to the following figure:

 

 

Like the above, you need to query the index node for 300005 times, query the data of the cluster index for 300005 times, and finally filter out the first 300000 results and take out the last 5. MySQL consumes a lot of random I/O to query the data of cluster index, and the data queried by 300000 random I/O will not appear in the result set.

Some people will ask: since the index was used at the beginning, why not query the last five nodes along the index leaf node, and then query the actual data in the cluster index. In this way, only 5 random I / OS are required, which is similar to the process in the following picture:

 

 

Actually, I also want to ask this question.

confirm

Let's confirm the above inference through practical operation:

In order to confirm that select * from test where val=4 limit 300000,5 is to scan 300005 index nodes and 300005 data nodes on clustered indexes, we need to know whether MySQL can count the times of querying data nodes through index nodes in an sql. I tried handler first_ read_* Series, unfortunately, no variable can meet the condition.

I can only confirm it indirectly:

There is a buffer pool in InnoDB. There are recently accessed data pages, including data pages and index pages. Therefore, we need to run two sql to compare the number of data pages in the buffer pool. The prediction result is to run select * from test a inner join (select id from test where Val = 4, limit 300000,5); After that, the number of data pages in the buffer pool is far less than select * from test where Val = 4, limit 300000,5; The corresponding quantity, because the previous sql only accesses the data page 5 times, and the latter sql accesses the data page 300005 times.

select * from test where val=4 limit 300000,5

 

 
mysql> select index_name,count(*) from information_schema.INNODB_BUFFER_PAGE where INDEX_NAME in('val','primary') and TABLE_NAME like '%test%' group by index_name;Empty set (0.04 sec)

 

 

It can be seen that there is no data page about the test table in the buffer pool at present.

mysql> select * from test where val=4 limit 300000,5;
+---------+-----+--------+
| id      | val | source |
+---------+-----+--------+|
3327622 |   4 |      4 |
| 3327632 |   4 |      4 |
| 3327642 |   4 |      4 |
| 3327652 |   4 |      4 |
| 3327662 |   4 |      4 |
+---------+-----+--------+
5 rows in set (26.19 sec)

mysql> select index_name,count(*) from information_schema.INNODB_BUFFER_PAGE where INDEX_NAME in('val','primary') and TABLE_NAME like '%test%' group by index_name;
+------------+----------+
| index_name | count(*) |
+------------+----------+
| PRIMARY    |     4098 |
| val        |      208 |
+------------+----------+2 rows in set (0.04 sec)

 

 

It can be seen that there are 4098 data pages and 208 index pages about the test table in the buffer pool at this time.

select * from test a inner join (select id from test where val=4 limit 300000,5) ;In order to prevent the impact of the last test, we need to empty buffer pool,restart mysql. 

mysqladmin shutdown
/usr/local/bin/mysqld_safe &
mysql> select index_name,count(*) from information_schema.INNODB_BUFFER_PAGE where INDEX_NAME in('val','primary') and TABLE_NAME like '%test%' group by index_name;

Empty set (0.03 sec)

 

 

Run sql:

mysql> select * from test a inner join (select id from test where val=4 limit 300000,5) b on a.id=b.id;
+---------+-----+--------+---------+
| id      | val | source | id      |
+---------+-----+--------+---------+
| 3327622 |   4 |      4 | 3327622 |
| 3327632 |   4 |      4 | 3327632 |
| 3327642 |   4 |      4 | 3327642 |
| 3327652 |   4 |      4 | 3327652 |
| 3327662 |   4 |      4 | 3327662 |
+---------+-----+--------+---------+
5 rows in set (0.09 sec)

mysql> select index_name,count(*) from information_schema.INNODB_BUFFER_PAGE where INDEX_NAME in('val','primary') and TABLE_NAME like '%test%' group by index_name;
+------------+----------+
| index_name | count(*) |
+------------+----------+
| PRIMARY    |        5 |
| val        |      390 |
+------------+----------+
2 rows in set (0.03 sec)

 

 

We can see the obvious difference between the two: the first sql loads 4098 data pages into the buffer pool, while the second sql only loads 5 data pages into the buffer pool. In line with our prediction. It also proves why the first sql is slow: it reads a large number of useless data rows (300000) and finally discards them. Moreover, this will cause a problem: loading many data pages with low hot spots into the buffer pool will pollute the buffer pool and occupy the space of the buffer pool. Problems encountered

In order to ensure that the buffer pool is emptied every time we restart, we need to turn off innodb_buffer_pool_dump_at_shutdown and innodb_buffer_pool_load_at_startup. These two options can control the data in the buffer pool when the database is shut down and the data in the backup buffer pool loaded on the disk when the database is turned on.

Java has a wide range of knowledge, and the interview questions cover a wide range. The key points include: Java foundation, Java concurrency, JVM, MySQL, data structure, algorithm, spring, microservice, MQ, etc. the knowledge involved is so huge that we often have no way to start when reviewing. Today's Xiaobian brings you a set of Java interview questions with a very comprehensive question base, including Java foundation, Java collection, JVM, Java concurrency Spring, Redis, MySQL, Dubbo, Netty, MQ, etc., including 2000 Java backend knowledge points+

Data acquisition method: follow the official account: "programmer Bai Nannan" to obtain the above data

Posted by pacuran on Fri, 06 May 2022 21:37:00 +0300