MySQL Indexes
It dawned on me this morning, following a conversation with Casey, that I don’t know nearly enough about MySQL database administration. The vast majority of my DB experience is with Informix and Oracle. So, I did a little reading up and performed some experiments.
First question that I had was how to analyze the queries, and the tables, to determine when/which indexes should be applied. To begin I created a test database called monkey with just a few tables:
CREATE TABLE `order` ( `id` INT NOT NULL AUTO_INCREMENT PRIMARY KEY , `name` VARCHAR( 20 ) NOT NULL ) ENGINE = INNODB ; CREATE TABLE `suborder` ( `id` INT NOT NULL AUTO_INCREMENT PRIMARY KEY , `order_id` INT NOT NULL, `name` VARCHAR( 20 ) NOT NULL ) ENGINE = INNODB ; CREATE TABLE `infraorder` ( `id` INT NOT NULL AUTO_INCREMENT PRIMARY KEY , `suborder_id` INT NOT NULL, `name` VARCHAR( 20 ) NOT NULL ) ENGINE = INNODB ; insert into `order` ( name ) values ( 'Primates' ); insert into `suborder` ( order_id, name ) values ( 1, 'Strepsirrhini' ); insert into `suborder` ( order_id, name ) values ( 1, 'Haplorrhini' ); insert into `infraorder` ( suborder_id, name ) values ( 1, 'Lemuriformes' ); insert into `infraorder` ( suborder_id, name ) values ( 1, 'Chiromyiformes' ); insert into `infraorder` ( suborder_id, name ) values ( 1, 'Lorisiformes' ); insert into `infraorder` ( suborder_id, name ) values ( 2, 'Tarsiiformes' ); insert into `infraorder` ( suborder_id, name ) values ( 2, 'Simiiformes' );
Query analysis is done using the EXPLAIN statement. Here’s an example of how to use it:
explain extended select o.name, s.name, i.name from `order` o, suborder s, infraorder i where o.id = s.order_id and s.id = i.suborder_id and i.name = 'Simiiformes';
The output looks like this:
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+------+---------------+------+---------+------+------+-------------+ | 1 | SIMPLE | o | ALL | PRIMARY | NULL | NULL | NULL | 1 | | | 1 | SIMPLE | s | ALL | PRIMARY | NULL | NULL | NULL | 2 | Using where | | 1 | SIMPLE | i | ALL | NULL | NULL | NULL | NULL | 5 | Using where | +----+-------------+-------+------+---------------+------+---------+------+------+-------------+ 3 rows in set, 1 warning (0.01 sec)
It lists the tables in the order that they are joined in. The letters in the table column refer to the aliases that I declared in the query. It reads a row from the first table, then recursively finds a matching row in the second table, the third table, etc … In my example here it reads the one row from the order table first, then joins to the suborder table, and finally the infraorder table.
The ‘type’ column is probably the most important piece of data that is returned. The possible values, from best to worst, are system, const, eq_ref, ref, range, index, all. As you can see, our test query has returned nothing but ‘ALL’, the worst possible type of join in MySQL. ALL is basically a sequential scan, the optimizer is planning on reading every row in each table.
Let’s add a few indexes and see if we can’t improve this a little bit:
create index ix_subordid on infraorder ( suborder_id ); create index ix_ordid on suborder ( order_id );
+----+-------------+-------+------+------------------+----------+---------+-------------+------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+------+------------------+----------+---------+-------------+------+-------------+ | 1 | SIMPLE | o | ALL | PRIMARY | NULL | NULL | NULL | 1 | | | 1 | SIMPLE | s | ref | PRIMARY,ix_ordid | ix_ordid | 4 | monkey.o.id | 1 | | | 1 | SIMPLE | i | ALL | ix_subordid | NULL | NULL | NULL | 4 | Using where | +----+-------------+-------+------+------------------+----------+---------+-------------+------+-------------+ 3 rows in set, 1 warning (0.00 sec)
The join on the suborder table is now type ‘ref’. This is a good thing, it means that the query optimizer has chosen to use our index on the suborder table for every single row that gets returned from the query.
However, our index on the infraorder table isn’t being used. Why? I’m not entirely certain. I assume that it has to do with how little data there is in the table. If that’s the case, then the query optimizer assumes that returning all rows in order to satisfy the “i.name = ‘Simiiformes’” part of the where clause is more efficient than going to ix_subordid, and then only filtering on the names of those rows returned.
Let’s drop that index and create a multiple-column index instead:
drop index ix_subordid on infraorder; create index ix_infra on infraorder ( name, suborder_id );
+----+-------------+-------+------+------------------+----------+---------+-------------------+------+--------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+------+------------------+----------+---------+-------------------+------+--------------------------+ | 1 | SIMPLE | o | ALL | PRIMARY | NULL | NULL | NULL | 1 | | | 1 | SIMPLE | s | ref | PRIMARY,ix_ordid | ix_ordid | 4 | monkey.o.id | 1 | | | 1 | SIMPLE | i | ref | ix_infra | ix_infra | 26 | const,monkey.s.id | 2 | Using where; Using index | +----+-------------+-------+------+------------------+----------+---------+-------------------+------+--------------------------+ 3 rows in set, 1 warning (0.00 sec)
Now we’re using both the indexes in both of those tables. There’s an important caveat to using a multiple-column index though. In the example that I’ve created here, if I were to construct and execute a query on the infraorder table that referenced the name column but not the suborder_id, it would still use the index. The opposite is not true.
If you have a multiple-column index that has 3 columns, let’s call them A, B, and C. If your query uses all three, A and B, or just A, it can still utilize the index. If it uses just B, or just C, or B and C, it can’t. If it refers to A and C, then it will use the index for column A, but it will have to filter out the results that don’t match column C.
I hope that makes sense. I can’t think of any clearer way to explain the concept.
There are a few rules of thumb when it comes to indexes. These are almost universal, they don’t just apply to MySQL. First of all, you want an index that has short keys. This refers to the key_len column when we do an explain. The shorter the keys, the higher the number that will be read from the disk in each block. Another benefit is it means more of the index will be cached in memory.
Another rule of thumb is that you want your indexes to be as unique as possible. If you are indexing a column, you have a 100,000 rows in the table, and the column is comprised of four unique values, then your index isn’t much use.
MySQL supports indexing only part of the field. So, if you have a varchar(200), and the majority of the entries in the column are unique across the first 10 characters, you can index just those first 10 characters. Now, instead of using 200 characters for each key in the index, it only uses 10.
There is overhead associated with indexes. As a result, you should add them sparingly. This is especially true if you require any sort of performance when it comes to data modification queries (insert, update, delete). You don’t just have to modify the data, the indexes have to be modified as well.
What do you gain from using multiple-column indexes? They are more unique, and MySQL will favor the index that the query plan assumes will return the fewest number of rows. Another really cool thing, and most databases do this, if an index has all of the data necessary to return the answer without consulting the data, then the engine will use just the index. Examples are queries where you are trying to find a count of rows, a min or a max, etc … These are called covered queries and there are more possibilities if your index refers to more columns. Don’t go listing all the columns though. Then you just end up duplicating the table in the index space. Remember, small keys.
Two other things before I go to bed. I keep mentioning the query optimizer. Optimizer’s are a fascinating subject. They rely on statistical information about the database in order to make an educated guess about how to execute the query. The ‘ANALYZE TABLE‘ command analyzes the table and stores the key distribution information that the optimizer uses to make some of those decisions. It needs to be run periodically, in some cases frequently, in order for the optimizer to function properly.
Another important statement is ‘OPTIMIZE TABLE‘. If you use varchar, or any other variable length data type, then deleting data fragments your table. The OPTIMIZE command is a defragmenter for your database. It also sorts the index keys, which can improve performance.
Wow … sleepy now. If you are in the market for an effective sedative, then I recommend reading the MySQL 5.0 Reference Manual.

March 27th, 2007 14:48
I wanted to make sure that I left a reply because this is really good information. I have to process it in small chunks because most of it is over my head. It’s a nice little resource.