In this blog post, I’ll look at whether optimizing indexing is always the key to improving MySQL query performance (spoiler, it isn’t).
As we look at MySQL query performance, our first concern is often whether a query is using the right indexes to retrieve the data. This is based on the assumption that finding the data is the most expensive operation – and the one you should focus on for MySQL query optimization. However, this is not always the case.
Let’s look at this query for illustration:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
|
mysql> show create table tbl G
*************************** 1. row ***************************
Table: tbl
Create Table: CREATE TABLE `tbl` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`k` int(11) NOT NULL DEFAULT ‘0’,
`g` int(10) unsigned NOT NULL,
PRIMARY KEY (`id`),
KEY `k_1` (`k`)
) ENGINE=InnoDB AUTO_INCREMENT=2340933 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
mysql> explain select g,count(*) c from tbl where k<1000000 group by g having c>7 G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: tbl
partitions: NULL
type: ALL
possible_keys: k_1
key: NULL
key_len: NULL
ref: NULL
rows: 998490
filtered: 50.00
Extra: Using where; Using temporary; Using filesort
1 row in set, 1 warning (0.00 sec)
mysql> select g,count(*) c from tbl where k<1000000 group by g having c>7;
+————+——+
| g | c |
+————+——+
| 28846 | 8 |
| 139660 | 8 |
| 153286 | 8 |
...
| 934984 | 8 |
+————+——+
22 rows in set (6.80 sec)
|
Looking at this query, many might assume the main problem is that this query is doing a full table scan. One could wonder then, “Why does the MySQL optimizer not use index (k)?” (It is because the clause is not selective enough, by the way.) This thought might cause someone to force using the index, and get even worse performance:
1
2
3
4
5
6
7
8
9
10
|
mysql> select g,count(*) c from tbl force index(k) where k<1000000 group by g having c>7;
+————+——+
| g | c |
+————+——+
| 28846 | 8 |
| 139660 | 8 |
...
| 934984 | 8 |
+————+——+
22 rows in set (9.37 sec)
|
Or someone might extend the index on (k) to (k,g) to be a covering index for this query. This won’t improve performance either:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
mysql> alter table tbl drop key k_1, add key(k,g);
Query OK, 0 rows affected (5.35 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql> explain select g,count(*) c from tbl where k<1000000 group by g having c>7 G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: tbl
partitions: NULL
type: range
possible_keys: k
key: k
key_len: 4
ref: NULL
rows: 499245
filtered: 100.00
Extra: Using where; Using index; Using temporary; Using filesort
1 row in set, 1 warning (0.00 sec)
mysql> select g,count(*) c from tbl where k<1000000 group by g having c>7;
+————+——+
| g | c |
+————+——+
| 28846 | 8 |
| 139660 | 8 |
...
| 915436 | 8 |
| 934984 | 8 |
+————+——+
22 rows in set (6.80 sec)
|
This wasted effort is all due to focusing on the wrong thing: figuring out how can we find all the rows that match k<1000000 as soon as possible. This is not the problem in this case. In fact, the query that touches all the same columns but doesn’t use GROUP BY runs 10 times as fast:
1
2
3
4
5
6
7
|
mysql> select sum(g) from tbl where k<1000000;
+———————+
| sum(g) |
+———————+
| 500383719481 |
+———————+
1 row in set (0.68 sec)
|
For this particular query, whether or not it is using the index for lookup should not be the main question. Instead, we should look at how to optimize GROUP BY – which is responsible for some 90% of the query response time.
In my next blog post, I will write about how MySQL performs the GROUP BY operation to provide further help on optimizing these queries.