Skip to content

Read the plan before you add an index

6 min read

Slow queries usually don't look wrong. That's the trap.

The SQL returns the right rows. It's just getting them the expensive way. You can't see that by staring at the query, so you have to ask Postgres what it actually did. That's EXPLAIN, and it's the first thing I reach for before adding an index or rewriting anything.

I used a fake shop database for the examples: 100k customers and 2 million orders. The setup script is at the end. These plans are real output from one run, so your timings and even some plan choices will differ, but the shapes should be close. You can use Pagila if you want something ready-made, but it is small enough that Postgres scans everything and most plans look boring.

There are three versions of EXPLAIN people mix together.

Plain EXPLAIN shows the plan Postgres expects to use. It does not run the query, so every number is a guess. EXPLAIN ANALYZE runs the query and shows what happened: real timing, real row counts, real loops. That's the one you want most of the time, with one catch: it really runs the query. Don't ANALYZE an UPDATE or DELETE unless you mean it. Wrap it in a transaction and roll back:

BEGIN;
EXPLAIN ANALYZE DELETE FROM orders WHERE created_at < '2020-01-01';
ROLLBACK;

EXPLAIN (ANALYZE, BUFFERS) adds the part I care about most: how many pages each step touched, and whether those pages came from memory or disk. Postgres 18 turns BUFFERS on by default. Until then, type it yourself.

Start with a query that has no useful index:

EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE customer_id = 42;
Gather  (cost=1000.00..28085.77 rows=21 width=38) (actual time=0.890..55.408 rows=32 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  Buffers: shared hit=14412 read=2255
  ->  Parallel Seq Scan on orders  (cost=0.00..27083.67 rows=9 width=38) (actual time=1.184..19.962 rows=11 loops=3)
        Filter: (customer_id = 42)
        Rows Removed by Filter: 666656
        Buffers: shared hit=14412 read=2255
Planning:
  Buffers: shared hit=14
Planning Time: 0.120 ms
Execution Time: 55.444 ms

This plan has two nodes because Postgres ran the scan in parallel. Read it inside-out. The Parallel Seq Scan at the bottom does the work. The Gather at the top collects the results.

Workers Planned: 2 / Workers Launched: 2 means it split the table across two worker processes. Add the leader process and you get three processes scanning, which is why the inner node says loops=3. There is no index on customer_id, so every row has to be read. Parallelism just makes that less awful.

A few fields carry most of the meaning. cost=1000.00..28085.77 is the planner's guess in arbitrary units: startup cost, then total cost. It is not milliseconds. The number only matters next to another plan. rows=21 is what Postgres expected back, and actual ... rows=32 is what came back. Close enough. width=38 is the estimated row size in bytes.

The useful line is Rows Removed by Filter: 666656, and that count is per worker. Each process threw away about 667k rows. Between the three of them, Postgres read all 2 million rows to keep 32. That is the wasted work, and it's why this is one of the rare times an index is the obvious fix.

Buffers: shared hit=14412 read=2255 says the same thing another way: about 16,700 pages touched, 2,255 read from disk. Most of the 55ms is Postgres walking through pages it did not need.

Add the index and run the same query:

CREATE INDEX orders_customer_id_idx ON orders (customer_id);
Bitmap Heap Scan on orders  (cost=4.59..86.62 rows=21 width=38) (actual time=0.096..0.525 rows=32 loops=1)
  Recheck Cond: (customer_id = 42)
  Heap Blocks: exact=32
  Buffers: shared hit=29 read=9
  ->  Bitmap Index Scan on orders_customer_id_idx  (cost=0.00..4.58 rows=21 width=0) (actual time=0.080..0.080 rows=32 loops=1)
        Index Cond: (customer_id = 42)
        Buffers: shared hit=3 read=3
Planning:
  Buffers: shared hit=16 read=1
Planning Time: 0.413 ms
Execution Time: 0.586 ms

55ms down to 0.6ms. Same 32 rows, almost none of the work: about 44 pages instead of 16,700.

The plan is a bitmap scan, which is two steps. The Bitmap Index Scan walks the index and builds a bitmap of every place matching customer_id = 42. Then the Bitmap Heap Scan reads those pages from the table in physical order. Postgres picked this instead of a plain index scan because the rows are spread across the table. In this sample data, customers were assigned to orders at random, so reading the pages in order is cheaper than jumping around one row at a time.

Heap Blocks: exact=32 says 32 table pages held the rows. Rows Removed by Filter is gone. That's the part you want.

That first plan was already a small tree. Bigger plans are just deeper trees. Same rule: read inside-out.

EXPLAIN (ANALYZE, BUFFERS)
SELECT c.name, o.amount_cents
FROM customers c
JOIN orders o ON o.customer_id = c.id
WHERE c.country = 'BR';
Hash Join  (cost=2264.46..44181.68 rows=290340 width=18) (actual time=11.764..194.220 rows=289764 loops=1)
  Hash Cond: (o.customer_id = c.id)
  Buffers: shared hit=15277 read=2223
  ->  Seq Scan on orders o  (cost=0.00..36667.00 rows=2000000 width=12) (actual time=0.012..85.300 rows=2000000 loops=1)
        Buffers: shared hit=14444 read=2223
  ->  Hash  (cost=2083.00..2083.00 rows=14517 width=22) (actual time=11.688..11.689 rows=14522 loops=1)
        Buckets: 16384  Batches: 1  Memory Usage: 923kB
        Buffers: shared hit=833
        ->  Seq Scan on customers c  (cost=0.00..2083.00 rows=14517 width=22) (actual time=0.014..10.004 rows=14522 loops=1)
              Filter: (country = 'BR'::text)
              Rows Removed by Filter: 85478
              Buffers: shared hit=833

Bottom-up, Postgres scanned customers, kept the ~14k from Brazil, built a hash table from them, then scanned orders and checked each order against that hash table. That's a hash join. It works well when one side is small enough to hash and the other side is large.

The estimate was almost exact here: 290,340 expected rows against 289,764 actual rows. That's why Postgres could pick the join confidently.

The other joins you'll run into are nested loops and merge joins. A nested loop is good when the outer side is tiny and the inner side is indexed. A merge join works when both inputs are already sorted on the join key. Postgres chooses between them based on row estimates, which is why estimates matter more than almost anything else in the plan.

When the estimate is close, Postgres has a decent chance of choosing well. When it is off by 100x or 1000x, every decision built on top of it is suspect.

This is the classic failure mode:

Nested Loop  (cost=... rows=1 ...) (actual time=... rows=48211 loops=1)
  ->  Index Scan ...  (rows=1) (actual rows=48211 loops=1)
  ->  Index Scan ...  (loops=48211)

Postgres picked a nested loop because looping once over a single row is cheap. But the outer side actually had 48,211 rows, so the inner scan ran 48,211 times. The query crawls, and nothing in the SQL looks suspicious. The plan is bad because the estimate was bad.

loops= is easy to misread. When a node runs in a loop, actual rows is the average per loop. Multiply it by loops to get the real total. A node showing rows=3 loops=48211 did not return 3 rows. It returned about 144,000.

So the move is simple: find the deepest node where the estimate and the actual split apart, then start there. That's usually where the plan first went wrong.

The fix is often not an index. A bad estimate usually means the stats are stale or too coarse. First, refresh them:

ANALYZE orders;

If one column is skewed, tell Postgres to keep more detail on it. In this sample data, status is 90% completed:

ALTER TABLE orders ALTER COLUMN status SET STATISTICS 1000;
ANALYZE orders;

People skip this and jump to indexes. Sometimes the index never gets used because the planner still believes a seq scan is cheaper. The row count is wrong, so the choice is wrong.

You can't out-index a bad estimate.

One more thing on BUFFERS, because EXPLAIN ANALYZE alone is not enough. shared hit means the page was already in memory. shared read means Postgres had to fetch it from disk. temp read / temp written means a sort or hash spilled to disk because it did not fit in work_mem.

That matters because timing lies. A query can be slow on a cold cache and fast once everything is warm. Same plan, different timing. Buffers show the I/O, which is usually the thing you are trying to reduce. Two plans can look nearly identical and touch wildly different numbers of pages.

For anything bigger than a few nodes, don't pretend you're going to read it nicely in the terminal. Paste it into a visualizer. explain.dalibo.com is my default. explain.depesz.com is older, but still good at pointing at the expensive row.

The loop is boring, which is why it works: run EXPLAIN (ANALYZE, BUFFERS), compare estimate to actual, follow the buffers, fix the first place the plan lies to you. Do that before touching indexes.

Try it yourself

Build the dataset. The script is here.

createdb shop_demo
psql shop_demo -f sample-shop.sql
-- 1. Watch Postgres read all 2M rows to find a handful
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE customer_id = 42;
 
-- 2. Add the index and run it again — compare buffers and time
CREATE INDEX orders_customer_id_idx ON orders (customer_id);
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE customer_id = 42;
 
-- 3. Read a join tree inside-out
EXPLAIN (ANALYZE, BUFFERS)
SELECT c.name, o.amount_cents
FROM customers c JOIN orders o ON o.customer_id = c.id
WHERE c.country = 'BR';

Further reading