Stop Guessing Batch Sizes: High-Performance Bulk Inserts in Spring Boot with PostgreSQL COPY
DEV Community

Stop Guessing Batch Sizes: High-Performance Bulk Inserts in Spring Boot with PostgreSQL COPY

Why Not JDBC Batch?

The standard Spring Boot approach for bulk inserts is JDBC batch with JdbcTemplate:

jdbcTemplate.batchUpdate(
    "INSERT INTO settlement_order (id, amount, status, ...) VALUES (?, ?, ?, ...)",
    orders,
    500, // batch size
    (ps, order) -> {
        ps.setObject(1, order.getId());
        ps.setBigDecimal(2, order.getAmount());
        ps.setString(3, order.getStatus().name());
        // ...
    }
);

This works. For predictable volumes, it works well. The problem is the number 500. That batch size is a tuning decision. Too small, and you make too many roundtrips to the database. Too large, and you risk memory pressure or statement timeouts under high load. Getting it right requires knowing your volume - and knowing your volume requires data you may not have yet.

In a financial system with variable daily throughput, you are essentially betting that your batch size stays optimal across all load scenarios. Some days you win that bet. Others you don't, and you find out in production. I didn't want to pay for that lesson.

The COPY Protocol

PostgreSQL's COPY command was designed for bulk data transfer. It bypasses the standard row-by-row insert path entirely, streaming data directly into the table in a single operation. No per-row parsing overhead, no per-row plan execution, no per-row roundtrip.

The Java PostgreSQL driver exposes this through CopyManager, which accepts an InputStream and streams it into the database using the COPY FROM STDIN syntax. The key difference from JDBC batch: you don't choose a batch size. You stream the entire dataset in one pass, and PostgreSQL handles the ingestion. The throughput ceiling is determined by the database and the network, not by a tuning parameter in your application code.

Implementation in Spring Boot

The main challenge in a Spring Boot application is that CopyManager requires a native PostgreSQL connection, not the wrapped connection that HikariCP hands to your code by default. You need to unwrap it:

@Component
public class SettlementOrderCopyBulkAdapter {
    private final JdbcTemplate jdbcTemplate;

    public SettlementOrderCopyBulkAdapter(JdbcTemplate jdbcTemplate) {
        this.jdbcTemplate = jdbcTemplate;
    }

    @Transactional
    public void bulkInsert(List<SettlementOrder> orders, UUID batchId) {
        if (orders.isEmpty()) return;

        jdbcTemplate.execute((Connection connection) -> {
            // Unwrap HikariCP's proxy to reach the real PostgreSQL driver connection
            BaseConnection baseConnection = connection.unwrap(BaseConnection.class);
            CopyManager copyManager = new CopyManager(baseConnection);

            String copySql = """
                COPY settlement_order (
                    id, originator_id, destination_id, batch_id,
                    order_type, amount, currency, settlement_date,
                    status, created_at
                ) FROM STDIN WITH (FORMAT csv, DELIMITER ',', QUOTE '"', ESCAPE '"')
                """;

            StringBuilder csv = new StringBuilder();
            for (SettlementOrder order : orders) {
                csv.append(order.getId()).append(",")
                   .append(order.getOriginator().getId()).append(",")
                   .append(order.getDestination().getId()).append(",")
                   .append(batchId).append(",")
                   .append(order.getType().getCode()).append(",")
                   .append(order.getAmount()).append(",")
                   .append("BRL").append(",")
                   .append(order.getSettlementDate()).append(",")
                   .append(order.getStatus().name()).append(",")
                   .append(order.getCreatedAt()).append("\n");
            }

            byte[] bytes = csv.toString().getBytes(StandardCharsets.UTF_8);
            try (InputStream inputStream = new ByteArrayInputStream(bytes)) {
                copyManager.copyIn(copySql, inputStream, 65536);
            }
            return null;
        });
    }
}

A Note on Memory Usage

In the implementation above, I used a StringBuilder to serialize the data into CSV format in memory. This is concise and works perfectly for moderate volumes. However, for extremely large datasets, building a massive string in memory can lead to OutOfMemoryError or heavy Garbage Collection pressure. For high-throughput systems, consider streaming the data directly to the InputStream using a custom implementation or a library like Apache Commons CSV or Jackson CSV to write to a Writer or OutputStream directly. This keeps the memory footprint constant regardless of the batch size.

Three Details Worth Explaining

connection.unwrap(BaseConnection.class)

HikariCP wraps the real JDBC connection in a proxy for connection pool management. CopyManager is a PostgreSQL-specific class that needs the actual driver connection, not the wrapper. unwrap() pierces through the proxy and returns the underlying org.postgresql.core.BaseConnection. Without this, you get a ClassCastException at runtime.

FROM STDIN WITH (FORMAT csv, DELIMITER ',', QUOTE '"', ESCAPE '"')

This is the COPY format declaration. FORMAT csv tells PostgreSQL to parse the stream as CSV. The QUOTE and ESCAPE settings ensure that values containing commas or quotes are handled correctly. For a financial system where amounts and identifiers have predictable formats, CSV is the right choice - it's compact and fast to serialize.

65536 as the buffer size

The third argument to copyIn() is the read buffer size in bytes - 64KB here. This controls how much data is read from the InputStream per internal read cycle during the stream transfer. It is not a batch size in the JDBC sense; the entire dataset is still transferred in a single COPY operation. The buffer size affects internal I/O efficiency, not correctness. 64KB is a reasonable default for most workloads.

Transactional Participation

One concern with native PostgreSQL features is whether they play nicely with Spring's transaction management. CopyManager does, because the connection it uses is the same connection that Spring's transaction manager already enrolled in the active transaction. The @Transactional annotation on the adapter method ensures that the entire COPY operation participates in the surrounding transaction. If anything fails after the bulk insert - a downstream status update, a Kafka publish, anything - the COPY rolls back along with everything else. No partial state reaches the database.

@Transactional
  ├─ Spring opens transaction
  ├─ bulkInsert() ← COPY streams into PostgreSQL
  ├─ updateStatus() ← JPQL update on same connection
  ├─ publishToKafka() ← fires only after commit (afterCommit)
  └─ commit or rollback applies to all

The Risk Argument

The choice between JDBC batch and CopyManager is not purely technical. It is also a risk decision. JDBC batch with a fixed batch size is a bet that your volume estimate stays valid. In a system where daily throughput varies significantly, and where outlier days are expected, that bet has a real probability of failing in production. When it fails, you find out under load, at the worst possible time.

CopyManager removes that bet. The application serializes the dataset and streams it. PostgreSQL ingests it. The throughput ceiling is the database, not a tuning parameter. On a normal day the difference is negligible. On an outlier day it is the difference between a pipeline that holds and one that doesn't. For a settlement system processing interbank transactions under regulatory time constraints, that margin matters.

When Not to Use CopyManager

CopyManager is the right tool when you need to insert a large, complete dataset in one operation. It is not always the right tool:

  • Small, frequent inserts - for single-row or low-volume writes, regular JPA or JdbcTemplate is simpler and sufficient
  • Upsert semantics - COPY is insert-only; if you need ON CONFLICT DO UPDATE, you need a different approach (a staging table + merge, or INSERT ... ON CONFLICT)
  • Complex validation per row - if each row needs application-level validation before insert, building that into the CSV serialization loop adds complexity that JDBC batch handles more naturally

Takeaway

If your system has unpredictable or variable bulk insert volume, tuning a JDBC batch size is an ongoing maintenance burden with production risk attached. PostgreSQL's COPY protocol delegates that responsibility to the database, where it belongs. The implementation in Spring Boot requires one non-obvious step - unwrapping HikariCP's connection proxy - but once that is in place, the rest is straightforward: serialize your dataset to CSV, stream it in, let PostgreSQL do what it was designed to do. Don't tune your way around the database. Use what the database gives you.

This is part of a series on the STR-XML-Pipeline, a high-throughput interbank settlement system built with Spring Boot 3.5, Java 21, Apache Kafka, PostgreSQL 16, Redis 7, and AWS Fargate. The previous article covered how distributed locks were moved off the API hot path and onto the scheduler.

Comments

No comments yet. Start the discussion.