Christoph Schiessl's Blog

Benchmark: Single-Column Indexes vs. Compound Indexes

I’ve known about compound indexes for many years but never really thought about them in detail. That doesn’t mean, I’ve never used them – quite on the contrary. At some point in the past I’ve made up some assumptions about when to use them and when to avoid them. However, until now I’ve never taken the time to put my assumptions to the test.

Benchmark Setup

First off, we need a table with records to run the benchmark against:

1
2
3
4
5
6
DROP TABLE IF EXISTS t;
CREATE TABLE t (a INTEGER NOT NULL, b INTEGER NOT NULL);
-- populate with one million records:
INSERT INTO t (a, b)
    SELECT * FROM GENERATE_SERIES(1,1000) AS a
       CROSS JOIN GENERATE_SERIES(1,1000) AS b;

Next, we need some random numbers to query for. I decided to compute these numbers in advance to avoid the random number generator during benchmark execution:

1
2
3
DROP TABLE IF EXISTS randx, randy;
SELECT TRUNC( 1 + (1000 * RANDOM()))::INTEGER AS x INTO randx FROM GENERATE_SERIES(1,50);
SELECT TRUNC( 1 + (1000 * RANDOM()))::INTEGER AS y INTO randy FROM GENERATE_SERIES(1,50);

Now, we have two new relations (named randx and randy), each containing 50 random numbers between 1 and 1000. Finally, we are defining three functions to repeatedly perform SELECT queries with different WHERE clauses. To get comparable results, these functions must all perform the same number of queries.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
-- performs 5000 `SELECT * FROM t WHERE a = ?;` queries...
CREATE OR REPLACE FUNCTION benchmark_where_a ()
    RETURNS VOID LANGUAGE plpgsql AS $$
        DECLARE
            x INTEGER;
            y INTEGER;
        BEGIN
            FOR x IN SELECT * FROM randx LOOP
                FOR y IN SELECT * FROM randy LOOP
                    PERFORM COUNT(*) FROM t WHERE t.a = x;
                    PERFORM COUNT(*) FROM t WHERE t.a = y;
                END LOOP;
            END LOOP;
        END;
    $$;

-- performs 5000 `SELECT * FROM t WHERE b = ?;` queries...
CREATE OR REPLACE FUNCTION benchmark_where_b ()
    RETURNS VOID LANGUAGE plpgsql AS $$
        DECLARE
            x INTEGER;
            y INTEGER;
        BEGIN
            FOR x IN SELECT * FROM randx LOOP
                FOR y IN SELECT * FROM randy LOOP
                    PERFORM COUNT(*) FROM t WHERE t.b = x;
                    PERFORM COUNT(*) FROM t WHERE t.b = y;
                END LOOP;
            END LOOP;
        END;
    $$;

-- performs 5000 `SELECT * FROM t WHERE a = ? AND b = ?;` queries...
CREATE OR REPLACE FUNCTION benchmark_where_ab ()
    RETURNS VOID LANGUAGE plpgsql AS $$
        DECLARE
            x INTEGER;
            y INTEGER;
        BEGIN
            FOR x IN SELECT * FROM randx LOOP
                FOR y IN SELECT * FROM randy LOOP
                    PERFORM COUNT(*) FROM t WHERE t.a = x AND t.b = y;
                    PERFORM COUNT(*) FROM t WHERE t.a = y AND t.b = x;
                END LOOP;
            END LOOP;
        END;
    $$;

Benchmark Execution

Ensure that the query optimizer uses all available indexes and doesn’t opt to use sequential scans:

1
SET enable_seqscan = off;

Benchmark without indexes:

1
2
3
EXPLAIN ANALYZE SELECT benchmark_where_a();
EXPLAIN ANALYZE SELECT benchmark_where_b();
EXPLAIN ANALYZE SELECT benchmark_where_ab();

Benchmark with index on (a):

1
2
3
4
5
CREATE INDEX t_a_idx ON t (a);
EXPLAIN ANALYZE SELECT benchmark_where_a();
EXPLAIN ANALYZE SELECT benchmark_where_b();
EXPLAIN ANALYZE SELECT benchmark_where_ab();
DROP INDEX IF EXISTS t_a_idx;

Benchmark with index on (b):

1
2
3
4
5
CREATE INDEX t_b_idx ON t (b);
EXPLAIN ANALYZE SELECT benchmark_where_a();
EXPLAIN ANALYZE SELECT benchmark_where_b();
EXPLAIN ANALYZE SELECT benchmark_where_ab();
DROP INDEX IF EXISTS t_b_idx;

Benchmark with indexes on (a) and (b):

1
2
3
4
5
6
CREATE INDEX t_a_idx ON t (a);
CREATE INDEX t_b_idx ON t (b);
EXPLAIN ANALYZE SELECT benchmark_where_a();
EXPLAIN ANALYZE SELECT benchmark_where_b();
EXPLAIN ANALYZE SELECT benchmark_where_ab();
DROP INDEX IF EXISTS t_a_idx, t_b_idx;

Benchmark with compound index on (a,b):

1
2
3
4
5
CREATE INDEX t_ab_idx ON t (a,b);
EXPLAIN ANALYZE SELECT benchmark_where_a();
EXPLAIN ANALYZE SELECT benchmark_where_b();
EXPLAIN ANALYZE SELECT benchmark_where_ab();
DROP INDEX IF EXISTS t_ab_idx;

Benchmark with compound index on (b,a):

1
2
3
4
5
CREATE INDEX t_ba_idx ON t (b,a);
EXPLAIN ANALYZE SELECT benchmark_where_a();
EXPLAIN ANALYZE SELECT benchmark_where_b();
EXPLAIN ANALYZE SELECT benchmark_where_ab();
DROP INDEX IF EXISTS t_ba_idx;

Benchmark Results

Index WHERE a=? WHERE b=? WHERE a=? AND b=?
None 406007 ms 426321 ms 400189 ms
(a) 731 ms 427019 ms 813 ms
(b) 402615 ms 3259 ms 3277 ms
(a), (b) 716 ms 3364 ms 1161 ms
(a,b) 725 ms 85071 ms 57 ms
(b,a) 80956 ms 3441 ms 55 ms

Bold text indicates, that no sequential table scans were required.

Conclusion

A compound index on (a,b)

  1. performs much better than two separate indexes on (a) and (b) in WHERE a=? AND b=? queries.
  2. performs mostly identical to an index on (a) in WHERE a=? queries.
  3. is useless in WHERE b=? queries. Therefore, the order of columns in the compound index matters: (a,b) does not equal (b,a).

Strangely, queries with WHERE a=? are consistently faster than queries with WHERE b=?. I’m not sure why this is the case – might have something to do with a being the table’s first column.


Notes
Software: PostgreSQL v9.3.1
Hardware: MacBook Pro with 2.7 GHz and 16 GB RAM

comments powered by Disqus