Debugging Slow PostgreSQL Queries in Production

Database server and data management

It was a Thursday. I remember that because I'd been telling myself all week that Thursday would be chill — we'd shipped a feature on Monday, the deploy went fine, nothing was on fire. I was going to leave on time for once. Then at 3:47 PM the Slack alerts started. API response times through the roof. P95 latency at twelve seconds and climbing. The dashboard looked like a heart rate monitor during a panic attack, except nobody's heart rate should be shaped like a hockey stick going straight up.

I pulled up Grafana and the Postgres metrics panel was just... red. Active connections maxed out. Query durations in the tens of thousands of milliseconds. The app servers were all waiting on the database, and the database was choking.

First thing I did, which in hindsight should have been my first instinct but actually took me a couple minutes of staring at dashboards, was check pg_stat_statements. If you don't have this extension turned on in your production Postgres, stop reading this and go enable it. Seriously. It's the single most useful thing for understanding what your database is actually spending time on.

SELECT
    query,
    calls,
    round(mean_exec_time::numeric, 2) as avg_latency_ms,
    round(total_exec_time::numeric, 2) as total_time_ms
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;

The top offender was averaging 47 seconds per call. Forty-seven seconds. And it was being called hundreds of times per minute because the frontend had a retry loop — so every time the request timed out at 30 seconds, it would try again, adding another 47-second query to the pile. The database was drowning in its own backlog of the same terrible query.

The query itself looked innocent enough. A join between users and posts, counting how many posts each user had, filtered by signup date. The kind of thing you'd write in a tutorial and not think twice about.

SELECT u.name, COUNT(p.id)
FROM users u
LEFT JOIN posts p ON p.author_id = u.id
WHERE u.created_at > '2025-01-01'
GROUP BY u.id, u.name;

I ran EXPLAIN ANALYZE on it in a read replica first — never run expensive diagnostic queries against your primary when it's already struggling, that's a lesson I learned the hard way on a different occasion — and the execution plan told the whole story.

Seq Scan on posts p (actual time=0.012..234.567 rows=85000 loops=1)
  Filter: (author_id = u.id)
  Rows Removed by Filter: 2847000

Sequential scan on the posts table. Postgres was reading every single row. All 2.9 million of them. For every iteration of the join. No index on author_id.

Now here's the embarrassing part. The posts table had been around for over a year. The author_id column had existed since day one. And nobody — not me, not any other developer who'd touched this code — had thought to put an index on it. The table started small, the query was fine, and it grew slowly enough that nobody noticed the degradation until it suddenly wasn't slow, it was broken. That's how these things go. There's no dramatic moment where the query goes from fast to slow. It degrades by milliseconds per week and then one day the table crosses some size threshold and the query planner's strategy flips from something reasonable to a full sequential scan and everything falls over at once.

I created the index:

CREATE INDEX CONCURRENTLY idx_posts_author_id ON posts (author_id);

The CONCURRENTLY keyword is non-negotiable here. Without it, CREATE INDEX takes an exclusive lock on the table, which means every write to posts blocks until the index build finishes. On a table with nearly 3 million rows, that would have been several minutes of complete write lockout. In production. While the app is already on fire. So yeah, CONCURRENTLY — it's slower, it uses more resources, and it can fail if there are conflicting transactions, but it doesn't lock the table.

The problem is that building an index concurrently on a big table takes time. I was sitting there watching pg_stat_activity refreshing every few seconds, watching the index build progress, while Slack messages piled up asking if the site was fixed yet. It took about eleven minutes. Long eleven minutes.

While that was building, I looked at what else was going on. The WHERE u.created_at > '2025-01-01' filter was also not well-served by any existing index. The users table had a primary key index on id and an index on email for login lookups, but nothing on created_at. So I kicked off another concurrent index build:

CREATE INDEX CONCURRENTLY idx_users_created_at ON users (created_at);

I debated making this a composite index with name included, since the query also selects u.name. Something like CREATE INDEX CONCURRENTLY idx_users_created_name ON users (created_at, name) — the idea being Postgres could satisfy the query entirely from the index without going back to the heap. A covering index. I decided against it in the moment because I wasn't sure the column order was right and I didn't want to build the wrong index and waste another ten minutes. In retrospect, the composite index probably would have been worth it. But triage mode isn't the time for perfect decisions.

The author_id index finished. I ran EXPLAIN ANALYZE again.

Index Scan using idx_posts_author_id on posts p
  (actual time=0.023..0.089 rows=12 loops=4521)

Index scan. 0.089 milliseconds per loop instead of 234 seconds total. The query went from 47 seconds to about 180 milliseconds. Relief.

But the alerts didn't fully clear. Latency came down dramatically — P95 dropped from 12 seconds to about 900 milliseconds — but it was still way above normal. Something else was going on.

I went back to pg_stat_statements and the second-worst query was a paginated listing endpoint. The application was fetching page 5000+ of a listing with OFFSET:

SELECT * FROM posts
ORDER BY created_at DESC
LIMIT 20 OFFSET 100000;

Here's the thing about OFFSET that's not obvious until you think about it: Postgres still has to compute and sort all 100,000+ rows before it can skip them. It doesn't have some magic way to jump ahead. It builds the entire result set up to the offset point, then throws away everything except the last 20 rows you asked for. So OFFSET 100000 on a large table is almost as expensive as selecting all 100,000 rows. The higher the page number, the worse it gets.

This is a known problem and the known solution is keyset pagination, sometimes called cursor-based pagination. Instead of saying "skip N rows," you say "give me rows after this specific value." If you're ordering by created_at, you pass the created_at value of the last item on the previous page:

SELECT * FROM posts
WHERE created_at < '2025-06-15T10:30:00Z'
ORDER BY created_at DESC
LIMIT 20;

With an index on created_at, Postgres can jump directly to the right spot in the index and read exactly 20 rows. No scanning, no throwing away results. It's consistently fast regardless of how deep into the dataset you are. Page 1 and page 5000 take the same amount of time.

The downside is you can't jump to an arbitrary page number anymore. You can't say "go to page 347." You can only go forward and back. For most applications — infinite scroll, feeds, API pagination — that's fine. For the few places where random page access matters, you're stuck with OFFSET and you just have to accept that deep pages will be slow, or find a different UX.

We changed the endpoint to cursor-based pagination. That fixed the second query. P95 latency dropped to about 200 milliseconds.

Still not great. Normal was under 50ms.

I looked at the connection metrics more carefully. During the incident, we'd hit the Postgres max_connections limit — set to 100 — and connections were queuing up in the application's connection pool waiting for a free database connection. Even though the individual queries were now fast, the connection bottleneck was still causing latency.

Our setup was three app servers, each running a connection pool of 40 connections. That's 120 potential connections against a limit of 100. Under normal load it worked because not all connections were active simultaneously. Under this burst of retried requests, every pool was maxed out and connections were fighting each other.

The real fix here is PgBouncer. It sits between your application and Postgres and multiplexes connections. Your app servers can open hundreds of connections to PgBouncer, and PgBouncer maintains a small pool of actual Postgres connections — we set it to 50 — and routes transactions through them. We ran it in transaction pooling mode, which means a Postgres connection is assigned to a client only for the duration of a transaction, then released back to the pool.

[pgbouncer]
pool_mode = transaction
default_pool_size = 50
max_client_conn = 500

Setting up PgBouncer wasn't part of the immediate incident response — we did that the next morning when things were calmer. But the effect was stark. CPU utilization on the Postgres server, which had been hovering around 85-90% even after the query fixes, dropped to about 25%. Connection-related wait times essentially disappeared.

I should mention something about PgBouncer in transaction mode that bit us later: prepared statements don't work. When a connection is shared between clients at the transaction level, server-side prepared statements from one client can interfere with another. Our ORM was using prepared statements by default, and after we deployed PgBouncer we got a wave of prepared statement does not exist errors. Had to add ?prepared_statements=false to the connection string. One of those things you learn by getting hit with it. The PgBouncer docs mention it, but honestly who reads docs cover to cover during an incident.

There's a follow-up thing I did the next week that I think is worth mentioning. I set up a slow query log:

ALTER SYSTEM SET log_min_duration_statement = 500;
SELECT pg_reload_conf();

This logs any query that takes longer than 500 milliseconds. It's not a replacement for pg_stat_statements, but it gives you a real-time feed of problematic queries that you can pipe into your log aggregation system and alert on. If we'd had this set up before the incident, we probably would have caught the degradation weeks earlier when the query was taking 5 seconds instead of waiting until it hit 47.

I also set up auto_explain to automatically log the execution plan for slow queries:

LOAD 'auto_explain';
SET auto_explain.log_min_duration = '1s';
SET auto_explain.log_analyze = true;

That way, when a slow query shows up in the logs, you don't have to reproduce it to see the execution plan — it's right there in the log entry. Incredibly useful for intermittent issues where a query is usually fast but occasionally slow due to plan changes or data distribution shifts.

The whole incident, from first alert to fully stable, was about three and a half hours. Most of that time was waiting for index builds and then deploying the pagination change. The actual diagnosis took maybe twenty minutes. Which is the frustrating thing — the fixes were all straightforward. Missing index, bad pagination pattern, no connection pooling. None of it was subtle or clever. It was just stuff that hadn't been set up correctly from the start, and the data grew until it couldn't be ignored anymore.

I've been thinking about this since. There's a pattern where a database starts fine because the data is small, and all the missing indexes and bad query patterns are invisible because everything fits in memory and sequential scans are fast when the table is tiny. Then growth happens gradually and nobody notices the queries getting a little slower each month because there's no baseline alerting, no slow query log, no periodic EXPLAIN ANALYZE review. And then one day there's a threshold — maybe the table gets too big for the shared buffer cache, maybe the planner switches strategies — and performance falls off a cliff.

I don't have a great answer for preventing this. You could review execution plans regularly, but in practice nobody does that. You could set up latency monitoring with percentile-based alerts, which we now have, but that only catches things after they've started degrading. Maybe the answer is just to assume that every foreign key column needs an index from day one, and that every listing endpoint should use cursor pagination from the start, and that PgBouncer should be in your stack before you think you need it.

Though honestly I'm not even sure about all of that. There are cases where indexes on foreign keys aren't worth the write overhead, and cursor pagination makes some UX patterns harder. I keep going back and forth on whether you should optimize proactively or wait until there's a measured problem. The textbook answer is "don't prematurely optimize." The lived experience answer is "that missing index will cost you a Thursday afternoon."

One thing I still wonder about is whether we should have looked at partitioning the posts table instead of (or in addition to) the index approach. The table is big and getting bigger, and at some point even good indexes start to show their age on a table with tens of millions of rows. Range partitioning by created_at might make sense for our access patterns — most queries only look at recent data. But I haven't had time to properly test it and I've heard mixed things about partition pruning performance in practice versus the theory. Something for a future Thursday, I guess. Hopefully a less eventful one.

SELECT query, calls, round(mean_exec_time::numeric, 2) as avg_latency_ms, round(total_exec_time::numeric, 2) as total_time_ms FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;

Debugging Slow PostgreSQL Queries in Production

Anurag Sinha

Found this useful?

Comments

Related Articles

An Interview with an Exhausted Redis Node

Rust vs Go: Picking the Right One for Your Next Project

Monolith vs. Microservices: How We Made the Decision

Debugging Slow PostgreSQL Queries in Production

Anurag Sinha

Found this useful?

Comments

Related Articles

An Interview with an Exhausted Redis Node

Rust vs Go: Picking the Right One for Your Next Project

Monolith vs. Microservices: How We Made the Decision