OpenAI Scales Single Primary Postgresql To Millions Of Queries Per Second For ChatGPT

OpenAI outlined how it scaled PostgreSQL to handle millions of queries per second for ChatGPT and its API platform, serving hundreds of millions of users globally. The effort highlights how far a single-primary PostgreSQL instance can be pushed before write-intensive workloads require additional distributed solutions, emphasizing design trade-offs and operational guardrails needed for a low-latency, globally available service.

As PostgreSQL load grew more than tenfold in the past year, OpenAI worked with Azure to optimize its deployment on Azure Database for PostgreSQL, enabling the system to serve 800 million ChatGPT users while maintaining a single-primary instance with sufficient headroom. Optimizations spanned both the application and database layers, including scaling up instance size, refining query patterns, and scaling out with additional read replicas. Redundant writes were reduced through application-level tuning, and new write-heavy workloads were directed to sharded systems such as Azure Cosmos DB, reserving PostgreSQL for relational workloads requiring strong consistency.

The primary PostgreSQL instance is supported by nearly 50 geo-distributed read replicas on Azure Database for PostgreSQL. Reads are distributed across replicas to maintain p99 latency in the low double-digit milliseconds, while writes remain centralized with measures to limit unnecessary load. Lazy writes and application-level optimizations further reduce pressure on the primary instance, ensuring consistent performance even under global traffic spikes.

PostgreSQL cascading replication (Source: OpenAI Blog Post)

Operational challenges emerged as traffic scaled. Cache-miss storms, multi-table join patterns often generated by ORMs, and service-wide retry loops were identified as common failure modes. To address these, OpenAI moved some computation to the application layer, enforced stricter timeouts on idle and long-running transactions, and refined query structures to reduce interference with autovacuum processes.

Reducing write pressure was a key strategy. PostgreSQL’s MVCC model increases CPU and storage overhead under heavy updates due to version churn and vacuum costs. OpenAI mitigated this by migrating shardable workloads to distributed systems, rate-limiting backfills and high-volume updates, and maintaining disciplined operational policies to avoid cascading overloads.

In a LinkedIn post, Microsoft Corporate Vice President Shireesh Thota noted that

Every database is optimized differently and needs the right tuning to get it to work at scale.

Connection pooling and workload isolation were also critical. PostgreSQL’s connection limits were managed by PgBouncer in transaction-pooling mode, reducing connection setup latency and preventing spikes in client connections. Critical and non-critical workloads were isolated to avoid noisy neighbor effects during peak demand.

Kubernetes deployment running multiple PgBouncer pods (Source: OpenAI Blog Post)

Scalability constraints also arise from read replication. As the number of replicas increases, the primary must stream the WAL to each replica, adding CPU and network overhead. OpenAI is experimenting with cascading replication, where intermediate replicas relay WAL downstream, reducing load on the primary while supporting future growth. These strategies allow PostgreSQL to sustain extremely large-scale, read-heavy AI workloads across geo-distributed regions, while sharded systems handle write-intensive operations to maintain stability and performance.

OpenAI has indicated it continues to evaluate ways to extend PostgreSQL’s scalability envelope, including sharded PostgreSQL deployments and alternative distributed systems, to balance strong consistency guarantees with rising global traffic and increasingly diverse workloads as the platform grows.