Read MySQL index structure and query optimization

Review the previous: Learn the explain tool of MySQL

(at the same time, it is emphasized again that these articles on MySQL are based on version 5.7, and the relevant conclusions and conclusions are not necessarily applicable to other versions)

MySQL official documents( https://dev.mysql.com/doc/refman/5.7/en/optimization-indexes.html )There is such a description:

The best way to improve the performance of SELECT operations is to create indexes on one or more of the columns that are tested in the query. But unnecessary indexes waste space and waste time for MySQL to determine which indexes to use. Indexes also add to the cost of inserts, updates, and deletes because each index must be updated. You must find the right balance to achieve fast queries using the optimal set of indexes.

In other words, the most direct and effective way to improve query performance is to establish an index, but unnecessary indexes will waste space and increase the additional time cost to judge which index to go. In addition, indexes will increase the cost of inserting, updating and deleting data, because these operations also need to maintain (update) the index tree. Therefore, you should learn to use the best index set to optimize queries.

index structure

reference resources:

  1. Data structure and algorithm principle behind MySQL index http://blog.codinglabs.org/articles/theory-of-mysql-index.html

  2. Detailed explanation of Mysql BTree and B+Tree https://www.cnblogs.com/Transkai/p/11595405.html

  3. Why MySQL uses B + tree https://draveness.me/whys-the-design-mysql-b-plus-tree/

  4. Shallow MySQL and InnoDB https://draveness.me/mysql-innodb/

  5. Cartoon: what is a B tree https://mp.weixin.qq.com/s/rDCEFzoKHIjyHfI_bsz5Rw

What is an index

In mysql, index is a data structure that helps to obtain data efficiently. The most commonly used data structure in MySQL is B + tree.

Indexes are used to find rows with specific column values quickly. Without an index, MySQL must begin with the first row and then read through the entire table to find the relevant rows.

It is like giving you a book and an article title. If there is no catalog, you may need to turn from the first page to the last page to find the article corresponding to this title; If you have a table of contents outline, you may just need to look for this title on the table of contents page and quickly locate the article.

Here, we can regard a book as a table in mysql, an article as a row of records in a table, that is, a row, and an article title as a column in a row. Naturally, the directory is the index established for the title column. In this way, retrieving an article from a Book according to the article title corresponds to the sql statement select * from book where title =?, Accordingly, every article added in the book (i.e. insert into book (title,...) Values ('hua Shan Lun Jian ',...), You need to maintain the directory so that you can find the new article Huashan lunjian in the directory. This operation corresponds to maintaining the index tree (B+Tree) of the title column for each record inserted in MySQL.

Why use B+Tree

First of all, it needs to be clarified that MySQL has no direct relationship with the B + tree. What is really related to the B + tree is MySQL's default storage engine InnoDB. The main function of the storage engine in MySQL is to store and extract data. In addition to InnoDB, MySQL also supports other storage engines such as MyISAM (see for details) https://dev.mysql.com/doc/refman/5.7/en/storage-engine-setting.html )As the underlying storage engine of tables.

mysql> show engines;
+--------------------+---------+----------------------------------------------------------------+--------------+------+------------+
| Engine             | Support | Comment                                                        | Transactions | XA   | Savepoints |
+--------------------+---------+----------------------------------------------------------------+--------------+------+------------+
| MRG_MYISAM         | YES     | Collection of identical MyISAM tables                          | NO           | NO   | NO         |
| CSV                | YES     | CSV storage engine                                             | NO           | NO   | NO         |
| PERFORMANCE_SCHEMA | YES     | Performance Schema                                             | NO           | NO   | NO         |
| BLACKHOLE          | YES     | /dev/null storage engine (anything you write to it disappears) | NO           | NO   | NO         |
| InnoDB             | DEFAULT | Supports transactions, row-level locking, and foreign keys     | YES          | YES  | YES        |
| MyISAM             | YES     | MyISAM storage engine                                          | NO           | NO   | NO         |
| ARCHIVE            | YES     | Archive storage engine                                         | NO           | NO   | NO         |
| MEMORY             | YES     | Hash based, stored in memory, useful for temporary tables      | NO           | NO   | NO         |
| FEDERATED          | NO      | Federated MySQL storage engine                                 | NULL         | NULL | NULL       |
+--------------------+---------+----------------------------------------------------------------+--------------+------+------------+

When it comes to indexing, we may immediately think of the following data structures to implement.

(1) Hash table
Although hash can provide the query performance of O(1) single data row, it can not well support range query and sorting, so it needs full table scanning.

(2) Red black tree
Red black tree is a kind of self balancing binary search tree. When inserting and deleting, it maintains the balance of binary search tree through specific operations, so as to obtain high search performance.

Generally speaking, the index itself is also large, and it is often impossible to store all of it in memory. Therefore, the index is often stored on disk in the form of index file. In this way, disk I/O consumption will occur in the process of index search. Compared with memory access, the consumption of I/O access is much higher than that of memory. Therefore, the most important index to evaluate the quality of a data structure as an index is the number of disk I/O times in the process of index search. In other words, the index structure should be organized to minimize the number of disk I/O in the search process.

Here, the number of disk I/O depends on the height of the tree. Therefore, when the amount of data is large, the red black tree will cause more disk IO due to the height of the tree, which will affect the query efficiency.

(3) B-Tree
B in the B tree represents balance, not binary. The B tree evolved from the balanced binary tree.

In order to reduce the height of the tree (that is, reduce the number of disk I/O) and make the original thin and tall tree structure short and fat, tree B will store multiple elements in each node (red black tree will only store one element per node), and the elements in the node are arranged incrementally from left to right. As shown in the figure below:

In fact, the comparison times of B-Tree in query are not less than that of binary search tree, but the size comparison in memory and the time-consuming of binary search are almost negligible compared with the time-consuming of disk IO. B-Tree greatly reduces the height of the tree, so it greatly improves the search performance.

(4) B+Tree
B+Tree is further optimized on the basis of B-Tree to make it more suitable for realizing the storage index structure. InnoDB storage engine uses B+Tree to implement its index structure.

In the B-Tree structure diagram, you can see that each node contains not only the key value of data, but also the data value. The storage space of each node is limited. If the data value is large, the number of keys that each node can store will be very small, which will increase the height of B-Tree, increase the disk I/O times during query, and then affect the query performance. In B+Tree, all data values are stored on leaf nodes of the same layer according to the order of key values, while only key value information is stored on non leaf nodes. This can increase the number of key values stored in each non leaf node, reduce the height of B+Tree and improve efficiency.

Here is a supplementary knowledge. In the computer, the disk is often not read strictly on demand, but will be read in advance every time. Even if only one byte is needed, the disk will start from this position and read a certain length of data backward in order and put it into the memory. The theoretical basis for this is the famous locality principle in Computer Science:

When a data is used, the data nearby is usually used immediately.

Due to the high efficiency of disk sequential reading (no seek time, only a little rotation time), pre reading can improve I/O efficiency for local programs. The length of the preview is generally an integral multiple of the page.

Page is the logical block of computer management memory. The hardware and operating system often divide the main memory and disk storage area into consecutive blocks of equal size. Each storage block is called a page (the default size of pages in many operating systems is 4KB). The main memory and disk exchange data in page units. When the data to be read by the program is not in the main memory, a page missing exception will be triggered. At this time, the operating system will send a disk reading signal to the disk. The disk will find the starting position of the data and continuously read one or more pages backward into the memory, and then the exception returns, and the program continues to run. (you can view the default page size of the operating system with the following command)

$ getconf PAGE_SIZE
4096

The designer of the database system skillfully uses the principle of disk pre reading to set the size of a node to an integer multiple of the page size of the operating system, so that each node can be fully loaded only once with I/O.

InnoDB storage engine also has the concept of page, which is the smallest unit of disk management. The default size of each page in InnoDB storage engine is 16KB.

mysql> show variables like 'innodb_page_size';
+------------------+-------+
| Variable_name    | Value |
+------------------+-------+
| innodb_page_size | 16384 |
+------------------+-------+
1 row in set (0.01 sec)

Generally, the primary key type of a table is INT (accounting for 4 bytes) or BIGINT (accounting for 8 bytes), and the pointer type is generally 4 or 8 bytes, that is, about 16KB/(8B+8B)=1K key values are stored in a page (a node in B+Tree) (because it is an estimation, the value of K here is 10 ^ 3 for convenience of calculation). In other words, a B+Tree index with a depth of 3 can maintain 10 ^ 3 * 10 ^ 3 * 10 ^ 3 = 1 billion records.

The height of B+Tree is generally between 2 and 4 floors. The InnoDB storage engine of mysql is designed to resident the root node in memory, that is, only 1 to 3 disk I/O operations are required to find the row record of a key value.

Random I/O will have a great impact on the query performance of MySQL, and the sequential reading of data from the disk will be very fast. Therefore, we should also try to reduce the number of random I/O, so as to improve the performance. In B-Tree, because all nodes may contain target data, we always have to traverse the subtree from the root node to find the data rows that meet the conditions, which will bring a lot of random I/O. while all data rows in B+Tree are stored in leaf nodes, which are connected sequentially through a two-way linked list, When we traverse the data in the B+Tree (such as range query), we can jump directly between multiple leaf nodes to ensure the performance of sequential and reverse traversal.

In addition, for those unfamiliar with the data structures mentioned above, an online data structure visualization demonstration tool is recommended to help quickly understand the mechanism of these data structures: https://www.cs.usfca.edu/~galles/visualization/Algorithms.html

primary key

As mentioned above, in MySQL, index belongs to the concept of storage engine level. Different storage engines implement indexes differently. Here we mainly look at the index implementation methods of MyISAM and InnoDB.

MyISAM index implementation

When the MyISAM engine uses B+Tree as the index structure, the data field of the leaf node stores the address of the data record. As shown in the figure below:

As can be seen from the above figure, MyISAM index file and data file are separated, and the index file only saves the address of data records. Therefore, MyISAM index method is also called non clustered, which is called to distinguish it from the clustered index of InnoDB.

InnoDB index implementation

InnoDB's primary key index also uses B+Tree as the index structure, but its implementation is very different from MyISAM. The InnoDB data file itself is an index file. In InnoDB, the table data file itself is an index structure organized according to B+Tree. The leaf node data field of this tree saves complete data records. The key of this index is the primary key of the data table, so the InnoDB table data file itself is the main index.

The primary key index in the InnoDB storage engine is also called clustered index. Because the InnoDB data file itself needs to be aggregated by primary key, InnoDB requires that the table must have a primary key (MyISAM can not). If it is not explicitly specified, MySQL system will automatically select a column that can uniquely identify the data record as the primary key. If there is no such column, MySQL will automatically generate an implicit field for the InnoDB table as the primary key. The length of this field is 6 bytes and the type is long integer. (see official documents for details: https://dev.mysql.com/doc/refman/5.7/en/innodb-index-types.html)

The implementation of clustered index makes the search by primary key very efficient and can directly find the whole row of data.

In InnoDB, it is not a good idea to use non monotonically increasing fields as the primary key, because the InnoDB data file itself is a B+Tree. Non single increasing primary key will cause frequent splitting and adjustment of the data file to maintain the characteristics of B+Tree when inserting new records, which is very inefficient. Therefore, using increasing fields as the primary key is a good choice.

secondary index

MyISAM index implementation

In MyISAM, there is no difference in structure between primary key index and non primary key index (Secondary key, also known as auxiliary index), but the key of primary key index is required to be unique, and the key of auxiliary index can be repeated. There will be no more narration here.

InnoDB index implementation

The non primary key index data field of InnoDB stores the value of the primary key of the corresponding record. In other words, all non primary key indexes of InnoDB refer to the value of the primary key as the data field. As shown in the figure below:

It can be seen from the above figure that when searching with a non primary key index, you need to retrieve the index twice. First, retrieve the non primary key index to obtain the primary key, and then use the primary key to retrieve the complete record in the primary key index tree.

So why does the leaf node of the non primary key index structure store the primary key value instead of directly storing a complete row of data like the primary key index, so as to avoid secondary retrieval back to the table? Obviously, on the one hand, this saves a lot of storage space, on the other hand, multiple copies of redundant data, the efficiency of updating data must be low, and it is troublesome to ensure the consistency of data.

Here, it is easy to understand why it is not recommended to use too long fields as primary keys, because all non primary key indexes refer to primary key values. Too long primary key values will make non primary key indexes too large.

Joint index

Official documents: https://dev.mysql.com/doc/refman/5.7/en/multiple-column-indexes.html

For example, INDEX idx_book_id_hero_name (book_id, hero_name) USING BTREE_ id, hero_ Name establishes a joint index between the two columns.

A multiple-column index can be considered a sorted array, the rows of which contain values that are created by concatenating the values of the indexed columns.

The joint index is to compare the size of multiple columns in order, taking IDX_ book_id_ hero_ For the joint index name, compare book first_ id,book_ The ID is on the left side of the small book_ The one with the largest ID is on the right, book_ Compare hero with the same ID_ name. As shown in the figure below:

After understanding the structure of the joint index, we can introduce the leftmost prefix rule:

If the table has a multiple-column index, any leftmost prefix of the index can be used by the optimizer to look up rows. For example, if you have a three-column index on (col1, col2, col3), you have indexed search capabilities on (col1), (col1, col2), and (col1, col2, col3).

That is, multiple columns in the joint index are arranged in the order of columns. If the order of columns cannot be met during query, for example, col1 =?, is missing in the where condition?, Col2 =? and col3 = ?, Then you can't go through the joint index. It should be obvious from the structure diagram of the joint index above that only col2 column can't retrieve qualified data through the index tree.

According to the leftmost prefix rule, we know that for INDEX idx_book_id_hero_name (book_id, hero_name)_ id = ? and hero_ name = ? For the query, you can certainly use the index, but if it is where hero_name = ? and book_id = ? On the surface, it does not conform to the leftmost prefix rule, but the MySQL optimizer will adjust the order of the two columns in the query criteria according to the existing index to make it conform to the leftmost prefix rule and go to the index. Here is an answer to why the two filter conditions hero in where are used when viewing with the show warnings command in the previous article "explain tool for MySQL of a literature society"_ name,book_ The ID sequence has been changed.

As for the range query of the columns in the joint index, you can first think about how the structure of the joint index is created, and then see whether the filtering conditions meet the leftmost prefix rule. For example, in range query, the range column can use the index (must be the leftmost prefix), but the column behind the range column cannot use the index. At the same time, the index can be used for one range column at most. Therefore, if there are two range columns in the query criteria, the index cannot be used completely.

Optimization suggestions

Selection of primary key

When using the InnoDB storage engine, if there is no special need, try to use a business independent incremental field as the primary key, and the primary key field should not be too long. The reason has been mentioned above when talking about the index structure. For example, it is a good choice to use the snowflake algorithm to generate an integer of 64 bit size (8 bytes, BIGINT type) as the primary key.

Index selection

(1) When there are few table records, such as a table with only a few hundred records, the significance of indexing some columns may not be great, so the index should be considered as appropriate when the table records are small. However, for fields with unique characteristics in business, even for a combination of multiple fields, it is recommended to use a unique key.

(2) When the selectivity of the index is very low, the significance of the index may not be significant. The so-called index selectivity refers to the ratio of the non repeated index value (also known as Cardinality) to the number of table records, that is, count(distinct column name) / count(*). A common scenario is that there is a column of status to identify the status of data rows. Maybe the status is non-0 or 1. There are 500000 rows of 1 million rows of total data with status of 0 and 500000 rows with status of 1. Is it necessary to index this column separately?

An index is best used when you need to select a small number of rows in comparison to the total rows.

This sentence is taken from MySQL: low selectivity columns = how to index The following is a person's answer. (for details, see: https://stackoverflow.com/questions/2386852/mysql-low-cardinality-selectivity-columns-how-to-index)

For the above-mentioned case where the status is not 0 or 1, and the two cases are evenly distributed, the index may not have practical significance. During the actual query, the MySQL optimizer may give up the index after calculating the cost of full table scanning and index tree scanning, because the cost may be higher than that of full table scanning by traversing the primary key value from the status index tree and then searching the final data in the primary key index tree.

But what if there are only 10000 rows of data with status 1 and 990000 rows of data with status 0? Friends who are interested in discussing the article are welcome to leave a message below!

Add: you can use the trace tool provided by Mysql to find out how MySQL can choose whether to use the index or which best index to use. See the following official documents for specific use.
https://dev.mysql.com/doc/internals/en/optimizer-tracing.html
https://dev.mysql.com/doc/refman/5.7/en/information-schema-optimizer-trace-table.html

usage method:

mysql> set session optimizer_trace="enabled=on",end_markers_in_json=on;
mysql> select * from tb_hero where hero_id = 1;
mysql> SELECT * FROM information_schema.OPTIMIZER_TRACE;

Note: opening trace tool will affect MySQL performance, so it can only be used for temporary analysis of sql. It should be closed immediately after use

mysql> set session optimizer_trace="enabled=off";

(3) When building an index on a varchar type field, it is recommended to specify the index length. Sometimes it may not be necessary to build an index on the whole field, The index length can be determined according to the actual text discrimination [Note: the length and discrimination of the index are a pair of contradictions. Generally, for string type data, the discrimination of the index with the length of 20 will be more than 90%. You can use count (distinct left) (column name, index length)) / count(*) to determine the discrimination].

This index with specified index length is called prefix index (see for details) https://dev.mysql.com/doc/refman/5.7/en/column-indexes.html#column-indexes-prefix).

With col_name(N) syntax in an index specification for a string column, you can create an index that uses only the first N characters of the column. Indexing only a prefix of column values in this way can make the index file much smaller. When you index a BLOB or TEXT column, you must specify a prefix length for the index.

Prefix index syntax is as follows:

mysql> alter table tb_hero add index idx_hero_name_skill2 (hero_name, skill(2));

Prefix index takes into account both index size and query speed, but its disadvantage is that it can not be used for group by and order by operations, nor for covering index (that is, when the index itself contains all the data required for query, the data file itself is no longer accessed).

(4) When the where condition or group by and order by of the query statement contain multiple columns, the multiple column index can be given priority according to the actual situation, which can reduce the number of single column indexes and contribute to efficient query.

If you specify the columns in the right order in the index definition, a single composite index can speed up several kinds of queries on the same table.

When establishing a joint index, pay special attention to the order of column s, which should be combined with the leftmost prefix rule mentioned above and the actual filtering, grouping and sorting requirements. The most discriminating suggestion is to put it on the far left.

explain:

  • The field of order by can be used as a part of the joint index and placed at the end to avoid the occurrence of file_sort affects query performance. Positive example: where a=? and b=? order by c will go to index idx_a_b_c. However, where a > 10 order by B cannot fully use the upper index idx_a_b. Only the first column a of the upper union index will be used

  • When there is a mixture of non equal sign and equal sign, the column of equal sign condition should be preceded when building a joint index. For example: where C >? and d=? Even if C has a higher degree of discrimination, D should be placed at the forefront of the index, that is, index idx_d_c

  • If where a=? and b=?, If the value of column A is almost unique, you only need to establish a single column index idx_a is OK

order by and group by

Try to complete grouping and sorting on the index column and follow the leftmost prefix rule of the index. If the condition of order by is not on the index column, Using filesort will be generated, which will reduce the query performance.

Paging query

Most MySQL paging queries may be written as follows:

mysql> select * from tb_hero limit offset,N;

MySQL does not skip the offset line, but takes the offset+N line, and then returns the previous offset line and returns the N line. When the offset is particularly large, the efficiency is very low.

You can SQL rewrite the number of pages that exceed a specific threshold as follows:

First quickly locate the id segment to be obtained, and then associate it

mysql> select a.* from tb_hero a, (select hero_id from tb_hero where condition limit 100000,20 ) b where a.hero_id = b.hero_id;

Or this way

mysql> select a.* from tb_hero a inner join (select hero_id from tb_hero where condition limit 100000,20) b on a.hero_id = b.hero_id;

Multi table join

(1) The data types of the fields that need to be join ed must be absolutely consistent;
(2) When join ing multiple tables, ensure that the associated fields have indexes

Overlay index

The covering index is used for query operation to avoid returning to the table, so as to increase disk I/O. In other words, avoid the select * statement as much as possible, select only the necessary columns and remove the useless columns.

An index that includes all the columns retrieved by a query. Instead of using the index values as pointers to find the full table rows, the query returns values from the index structure, saving disk I/O. InnoDB can apply this optimization technique to more indexes than MyISAM can, because InnoDB secondary indexes also include the primary key columns. InnoDB cannot apply this technique for queries against tables modified by a transaction, until that transaction ends.

Any column index or composite index could act as a covering index, given the right query. Design your indexes and queries to take advantage of this optimization technique wherever possible.

When the index itself contains all the columns required by the query, there is no need to query the complete row records back to the table. For InnoDB, the non primary key index contains all index columns and primary key values. Try to use this feature when querying to avoid back to table operation. When the amount of data is large, the query performance is significantly improved.

in and exceptions

Principle: small tables drive large tables, that is, small data sets drive large data sets

(1) When the data set of table A is larger than that of table B, in is better than exists

mysql> select * from A where id in (select id from B)

(2) When the data set of table A is smaller than that of table B, exists is better than in

mysql> select * from A where exists (select 1 from B where B.id = A.id)

like

The index file has the leftmost prefix matching feature of B+Tree. If the value on the left is not determined, the index cannot be used, so try to avoid left blur (i.e.% xxx) or full blur (i.e.% xxx%).

mysql> select * from tb_hero where hero_name like '%nothing%';
+---------+-----------+--------------+---------+
| hero_id | hero_name | skill        | book_id |
+---------+-----------+--------------+---------+
|       3 | zhang wuji    | The Nine Yang Manual     |       3 |
|       5 | Flawless flowers    | Flower transplanting jade     |       5 |
+---------+-----------+--------------+---------+
2 rows in set (0.00 sec)

mysql> explain select * from tb_hero where hero_name like '%nothing%';
+----+-------------+---------+------------+------+---------------+------+---------+------+------+----------+-------------+
| id | select_type | table   | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra       |
+----+-------------+---------+------------+------+---------------+------+---------+------+------+----------+-------------+
|  1 | SIMPLE      | tb_hero | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    6 |    16.67 | Using where |
+----+-------------+---------+------------+------+---------------+------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.00 sec)

It can be seen that the whole table is scanned during full fuzzy query. At this time, the feature of overwriting index is used, and only selecting index fields can be optimized. As follows:

mysql> explain select book_id, hero_name from tb_hero where hero_name like '%nothing%';
+----+-------------+---------+------------+-------+---------------+-----------------------+---------+------+------+----------+--------------------------+
| id | select_type | table   | partitions | type  | possible_keys | key                   | key_len | ref  | rows | filtered | Extra                    |
+----+-------------+---------+------------+-------+---------------+-----------------------+---------+------+------+----------+--------------------------+
|  1 | SIMPLE      | tb_hero | NULL       | index | NULL          | idx_book_id_hero_name | 136     | NULL |    6 |    16.67 | Using where; Using index |
+----+-------------+---------+------------+-------+---------------+-----------------------+---------+------+------+----------+--------------------------+
1 row in set, 1 warning (0.00 sec)

count(*)

Alibaba's Java development manual contains such regulations:

Do not use count (column name) or count (constant) instead of count(*). count(*) is the syntax of the standard statistical row number defined by SQL92, which has nothing to do with the database, NULL and non NULL [Note: count(*) will count the row with NULL value, while count (column name) will not count the row with NULL value].
count(distinct col) calculates the number of non repeating rows of the column except NULL. Note that count(distinct col1, col2) if one column is NULL, it returns 0 even if the other column has different values

Intercept the description of count in an official document (see: https://dev.mysql.com/doc/refman/5.7/en/aggregate-functions.html#function_count)

COUNT(expr): Returns a count of the number of non-NULL values of expr in the rows.The result is a BIGINT value.If there are no matching rows, COUNT(expr) returns 0.

COUNT(*) is somewhat different in that it returns a count of the number of rows, whether or not they contain NULL values.

Prior to MySQL 5.7.18, InnoDB processes SELECT COUNT(*) statements by scanning the clustered index. As of MySQL 5.7.18, InnoDB processes SELECT COUNT(*) statements by traversing the smallest available secondary index unless an index or optimizer hint directs the optimizer to use a different index. If a secondary index is not present, the clustered index is scanned.

It can be seen that before 5.7.18, MySQL processed count(*) to scan the primary key index. After 5.7.18, select a smaller and appropriate index scan from the non primary key indexes. You can use explain to see the execution plan.

mysql> select version();
+-----------+
| version() |
+-----------+
| 5.7.18    |
+-----------+
1 row in set (0.00 sec)

mysql> explain select count(*) from tb_hero;
+----+-------------+---------+------------+-------+---------------+-----------+---------+------+------+----------+-------------+
| id | select_type | table   | partitions | type  | possible_keys | key       | key_len | ref  | rows | filtered | Extra       |
+----+-------------+---------+------------+-------+---------------+-----------+---------+------+------+----------+-------------+
|  1 | SIMPLE      | tb_hero | NULL       | index | NULL          | idx_skill | 15      | NULL |    6 |   100.00 | Using index |
+----+-------------+---------+------------+-------+---------------+-----------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.00 sec)

mysql> explain select count(1) from tb_hero;
+----+-------------+---------+------------+-------+---------------+-----------+---------+------+------+----------+-------------+
| id | select_type | table   | partitions | type  | possible_keys | key       | key_len | ref  | rows | filtered | Extra       |
+----+-------------+---------+------------+-------+---------------+-----------+---------+------+------+----------+-------------+
|  1 | SIMPLE      | tb_hero | NULL       | index | NULL          | idx_skill | 15      | NULL |    6 |   100.00 | Using index |
+----+-------------+---------+------------+-------+---------------+-----------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.00 sec)

Some people wonder which method of writing count(*) and count(1) is more efficient. From the above implementation plan, it is the same. If you are not at ease, the official document also clearly indicates that InnoDB handles count(*) and count(1) exactly the same.

InnoDB handles SELECT COUNT(*) and SELECT COUNT(1) operations in the same way. There is no performance difference.

other

Unable to use the index when doing any operation (expression, function evaluation, type conversion, etc.) on the index column will result in a full table scan

actual combat

A few weeks ago, the test colleague conducted pressure test on a product of the company. Nearly 200 million data were written in a single table. During the process, it was found that several data query times of the matched reports were too long, so we focused on several slow query SQL. To avoid sensitive information, make a record of its extraction simplification here.

mysql> select count(*) from tb_alert;
+-----------+
| count(*)  |
+-----------+
| 198101877 |
+-----------+

Table join slow

After join ing the table, it takes 15 seconds to get the first 10 data. Look at the SQL execution plan as follows:

mysql> select * from tb_alert left join tb_situation_alert on tb_alert.alert_id = tb_situation_alert.alert_id limit 10;
10 rows in set (15.46 sec)

mysql> explain select * from tb_alert left join tb_situation_alert on tb_alert.alert_id = tb_situation_alert.alert_id limit 10;
+----+-------------+--------------------+------------+------+---------------+------+---------+------+-----------+----------+----------------------------------------------------+
| id | select_type | table              | partitions | type | possible_keys | key  | key_len | ref  | rows      | filtered | Extra                                              |
+----+-------------+--------------------+------------+------+---------------+------+---------+------+-----------+----------+----------------------------------------------------+
|  1 | SIMPLE      | tb_alert           | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 190097118 |   100.00 | NULL                                               |
|  1 | SIMPLE      | tb_situation_alert | NULL       | ALL  | NULL          | NULL | NULL    | NULL |   8026988 |   100.00 | Using where; Using join buffer (Block Nested Loop) |
+----+-------------+--------------------+------------+------+---------------+------+---------+------+-----------+----------+----------------------------------------------------+
2 rows in set, 1 warning (0.00 sec)

It can be seen that the index is not used when joining, TB_ situation_ The joint primary key on the alert table is such a PRIMARY KEY (situation_id, alert_id), and the join field of the participating table is alert_ ID, originally, does not conform to the leftmost prefix rule of the joint index. Only from this sql, there are two solutions. One is tb_situation_alert on alert table_ ID establishes an index separately. The other is to change the order of the columns of the joint primary key to PRIMARY KEY (alert_id, situation_id). Of course, it is not reasonable to change the primary key index of tables of other production lines just because one more report is configured. Here, alert should be_ The ID column is indexed separately.

mysql> create index idx_alert_id on tb_situation_alert (alert_id);

mysql> select * from tb_alert left join tb_situation_alert on tb_alert.alert_id = tb_situation_alert.alert_id limit 100;
100 rows in set (0.01 sec)

mysql> explain select * from tb_alert left join tb_situation_alert on tb_alert.alert_id = tb_situation_alert.alert_id limit 100;
+----+-------------+--------------------+------------+------+---------------+--------------+---------+---------------------------------+-----------+----------+-------+
| id | select_type | table              | partitions | type | possible_keys | key          | key_len | ref                             | rows      | filtered | Extra |
+----+-------------+--------------------+------------+------+---------------+--------------+---------+---------------------------------+-----------+----------+-------+
|  1 | SIMPLE      | tb_alert           | NULL       | ALL  | NULL          | NULL         | NULL    | NULL                            | 190097118 |   100.00 | NULL  |
|  1 | SIMPLE      | tb_situation_alert | NULL       | ref  | idx_alert_id  | idx_alert_id | 8       | tb_alert.alert_id |         2 |   100.00 | NULL  |
+----+-------------+--------------------+------------+------+---------------+--------------+---------+---------------------------------+-----------+----------+-------+
2 rows in set, 1 warning (0.00 sec)

After optimization, it can be seen from the execution plan that the index is taken during the join, and the first 100 pieces of data are queried for 0.01 seconds, which is very different from the previous 15 seconds for the first 10 pieces of data.

Slow paging query

When turning back the page from the 10000000 data, the result can be obtained in 25 seconds. Here you can use the paging query optimization techniques above. When talking about the optimization suggestions above, I didn't look at the implementation plan. I'm looking at it here.

mysql> select * from tb_alert limit 10000000, 10;
10 rows in set (25.23 sec)

mysql> explain select * from tb_alert limit 10000000, 10;
+----+-------------+----------+------------+------+---------------+------+---------+------+-----------+----------+-------+
| id | select_type | table    | partitions | type | possible_keys | key  | key_len | ref  | rows      | filtered | Extra |
+----+-------------+----------+------------+------+---------------+------+---------+------+-----------+----------+-------+
|  1 | SIMPLE      | tb_alert | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 190097118 |   100.00 | NULL  |
+----+-------------+----------+------------+------+---------------+------+---------+------+-----------+----------+-------+
1 row in set, 1 warning (0.00 sec)

Take another look at the sql execution plan using the paging query optimization technique

mysql> select * from tb_alert a inner join (select alert_id from tb_alert limit 10000000, 10) b on a.alert_id = b.alert_id;
10 rows in set (2.29 sec)

mysql> explain select * from tb_alert a inner join (select alert_id from tb_alert a2 limit 10000000, 10) b on a.alert_id = b.alert_id;
+----+-------------+------------+------------+--------+---------------+---------------+---------+-----------+-----------+----------+-------------+
| id | select_type | table      | partitions | type   | possible_keys | key           | key_len | ref       | rows      | filtered | Extra       |
+----+-------------+------------+------------+--------+---------------+---------------+---------+-----------+-----------+----------+-------------+
|  1 | PRIMARY     | <derived2> | NULL       | ALL    | NULL          | NULL          | NULL    | NULL      |  10000010 |   100.00 | NULL        |
|  1 | PRIMARY     | a          | NULL       | eq_ref | PRIMARY       | PRIMARY       | 8       | b.alert_id |         1 |   100.00 | NULL        |
|  2 | DERIVED     | a2         | NULL       | index  | NULL          | idx_processed | 5       | NULL      | 190097118 |   100.00 | Using index |
+----+-------------+------------+------------+--------+---------------+---------------+---------+-----------+-----------+----------+-------------+
3 rows in set, 1 warning (0.00 sec)

Packet aggregation slow

After analyzing SQL, it is found that it is not the slow grouping aggregation, but the low performance caused by returning to the table after scanning the joint index. Remove unnecessary fields and use overlay index.

This is the main problem to avoid aggregating sensitive information before the SQL presentation.
There is a joint index key IDX on the table_ alert_ start_ host_ template_ ID (alert_start, alert_host, template_id). The sql before optimization is

mysql> select alert_start, alert_host, template_id, alert_service from tb_alert where alert_start > {ts '2019-06-05 00:00:10.0'} limit 10000;
10000 rows in set (1 min 5.22 sec)

Use the overlay index and remove the template_id column can avoid returning to the table. The query time changes from 1min to 0.03 seconds, as follows:

mysql> select alert_start, alert_host, template_id from tb_alert where alert_start > {ts '2019-06-05 00:00:10.0'} limit 10000;
10000 rows in set (0.03 sec)

mysql> explain select alert_start, alert_host, template_id from tb_alert where alert_start > {ts '2019-06-05 00:00:10.0'} limit 10000;
+----+-------------+----------+------------+-------+------------------------------------+------------------------------------+---------+------+----------+----------+--------------------------+
| id | select_type | table    | partitions | type  | possible_keys                      | key                                | key_len | ref  | rows     | filtered | Extra                    |
+----+-------------+----------+------------+-------+------------------------------------+------------------------------------+---------+------+----------+----------+--------------------------+
|  1 | SIMPLE      | tb_alert | NULL       | range | idx_alert_start_host_template_id   | idx_alert_start_host_template_id   | 9       | NULL | 95048559 |   100.00 | Using where; Using index |
+----+-------------+----------+------------+-------+------------------------------------+------------------------------------+---------+------+----------+----------+--------------------------+
1 row in set, 1 warning (0.01 sec)

summary

Any design that does not consider the application scenario is not the best design. For example, the design of table structure and the creation of index should weigh the amount of data, query requirements, data update frequency, etc.
In addition, as mentioned in Alibaba java development manual, the index protocol (for details, see: "Exception handling, MySQL database" in Java Development Manual ): avoid the following extreme misunderstandings when creating an index:

1) Better abuse than lack. Think that a query needs to build an index
2) It is better to lack than to abuse. It is considered that the index will consume space and seriously slow down the update of records and the addition speed of rows

Tags: MySQL Optimize

Posted by jamkelvl on Mon, 16 May 2022 11:45:34 +0300