Compound Indexes
Important rule, MySQL only supports using one index per join.
If a multiple-column index exists on col1 and col2, the appropriate rows can be fetched directly. If separate single-column indexes exist on col1 and col2, the optimizer tries to find the most restrictive index by deciding which index finds fewer rows and using that index to fetch the rows.
To demonstrate this I went and got some historical stock price data from this site. After that, I created a test table and added some indexes in order to test some concepts.
CREATE TABLE stockdata ( sdate DATE NOT NULL , ticker VARCHAR( 4 ) NOT NULL , open DECIMAL( 8,2 ) NOT NULL , high DECIMAL( 8,2 ) NOT NULL , low DECIMAL( 8,2 ) NOT NULL , close DECIMAL( 8,2 ) NOT NULL , volume INT NOT NULL ) ENGINE = MYISAM ; load data infile 'sp500hst.txt' into table stockdata fields terminated by ','; create index ix_sdate on stockdata ( sdate ); create index ix_ticker on stockdata ( ticker );
With the data loaded and indexes created on the ticker field and the sdate field, we’re ready to take a look at why this is important. Here’s the query that we’re going to use to test with:
explain extended select * from stockdata where ticker = 'DIS' and sdate between '2007-02-01' and '2007-02-28';
+----+-------------+-----------+------+--------------------+-----------+---------+-------+------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-----------+------+--------------------+-----------+---------+-------+------+-------------+ | 1 | SIMPLE | stockdata | ref | ix_sdate,ix_ticker | ix_ticker | 6 | const | 245 | Using where | +----+-------------+-----------+------+--------------------+-----------+---------+-------+------+-------------+
According to the results of our EXPLAIN, the optimizer chose to use the index on the ticker field, but not the index on the sdate field. It chose that index because there are only 253 entries in the data with a ticker of ‘DIS’, but there are 9501 entries with an sdate in Feb 2007. Basically, the plan is for it to grab all 253 ‘DIS’ entries, and then it will apply a filter to them in order to find the ones that are within the date range that we are looking for. Note that under ‘rows’, the optimizer thinks that this index will return 245 rows instead of 253.
This isn’t exactly ideal. It would be nice to have the database utilize indexes on both those fields. Since the database can only use one index per join, we will have to rely on a multiple-column ( aka compound ) index to get the desired performance.
drop index ix_sdate on stockdata; drop index ix_ticker on stockdata; create index ix_compound on stockdata( sdate, ticker );
+----+-------------+-----------+-------+---------------+-------------+---------+------+------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-----------+-------+---------------+-------------+---------+------+------+-------------+ | 1 | SIMPLE | stockdata | range | ix_compound | ix_compound | 9 | NULL | 8872 | Using where | +----+-------------+-----------+-------+---------------+-------------+---------+------+------+-------------+
According to the optimizer it’s going to have to filter through 8872 rows. This appears to be less efficient than just having the index on the ticker field. Why? Because the order of the columns in a multiple-column index is very important and should be from most selective to least. In this example, we’ve ordered them just the opposite. Instead, let’s drop that index and recreate it with the fields in the opposite order:
drop index ix_compound on stockdata; create index ix_compound on stockdata( ticker, sdate );
+----+-------------+-----------+-------+---------------+-------------+---------+------+------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-----------+-------+---------------+-------------+---------+------+------+-------------+ | 1 | SIMPLE | stockdata | range | ix_compound | ix_compound | 9 | NULL | 14 | Using where | +----+-------------+-----------+-------+---------------+-------------+---------+------+------+-------------+
Down to an estimated 14 rows. The query actually returns 19. By utilizing a multiple-column index and carefully selecting the order of the columns in the index, I think this is about the best performance that we can get out of this query by just adjusting the indexes. I’m certain that there are other optimizations that could be performed. For instance, I could have done a better job structuring the table. Different data types would probably occupy less space on disk, require fewer IO reads, etc … That said, playing with these indexes probably gets us the biggest bang for the buck.
