How we found a bug in the hyper HTTP library
How we found a bug in the hyper HTTP library
By rearchitecting the Images binding, we accidentally uncovered a bug that existed in the open-source hyper library across multiple major versions.
The Images service, built in Rust on Workers, runs on every machine in Cloudflare's edge network. To handle client connections, we use hyper, an open-source HTTP library for Rust. Last year, we introduced the Images binding to enable custom, programmatic workflows for processing remote images in Workers.
At the end of 2025, we rearchitected the binding to provide a more direct, local connection between the Workers runtime and the Images service. Shortly after rollout, we received reports that transformation requests from the binding were failing - but only intermittently and only for larger images. Even stranger, the responses for these requests returned a 200 status without any errors logged. The image data was simply cut short: A response that should have been two megabytes might arrive with a few hundred kilobytes instead.
We spent six weeks chasing a nearly invisible bug - a race condition that occurred only under specific conditions - in the hyper library that impacted how the Images binding returned processed image data back to the client. In the end, it took four lines of code to fix it.
Hops, handoffs, and hyper
When developers build on Cloudflare, they compose full-stack applications from a set of platform services that are accessible to Workers through bindings. Bindings provide direct APIs to resources on the Developer Platform like compute, storage, AI inference, and media processing.
The Images binding decouples image optimization from delivery; you can transcode, composite, or manipulate images without needing to return the output as an HTTP response. It also lets you apply optimization parameters in any order, rather than following the fixed sequence imposed by the URL interface.
Here, a worker can pass image data directly to the Images API, chain operations together, and get the processed result back as a stream:
const result = await env.IMAGES
.input(image)
.transform({ width: 800, rotate: 90 })
.output({ format: "image/avif" });
return result.response();
At a high level, this is how image data moves through our various services: The pipe represents a socket connection between the intermediary and Images, where data is handed off from one process to the next through the kernel's buffer.
The binding communicates with Images through a socket connection managed by the Workers runtime. A socket connection is a communication channel between two processes. Each end of the socket has buffers that are managed by the operating system's kernel; these buffers are temporary holding areas where data sits after one side writes it but before the other side reads it.
Hyper manages the connection on the Images service's side, reading incoming requests from the socket and writing responses back to it. When a request uses the Images binding, the Images service reads the input, performs the requested optimization operations, and encodes the result. It then passes the entire encoded image to hyper as a single in-memory block.
Hyper writes this response data into its own internal buffer. At this point, hyper considers the encoding work as complete, since it has all the bytes that it needs to send. The next step is to flush its internal buffer to the socket's outbound buffer, moving the data from the Images service to the intermediary on the other end.
If the reader on the other end is fast, then hyper can flush everything in one pass - the outbound buffer will have room because the reader is consuming data as quickly as it arrives. Once all data is sent, hyper issues a shutdown on the socket, signaling that the connection is finished and no more data will be written.
But if the reader is slower (even by a few milliseconds), then the outbound buffer fills up, and hyper needs to wait until there's room to continue writing.
All incoming traffic on Cloudflare's network passes through FL, an internal intermediary service that runs security and performance features and routes requests to the appropriate backend. When we first launched the binding, image data flowed from the Workers runtime, through FL, to the Images service. This path was a natural fit for our initial release and follows the same architecture as our URL interface.
Over time, though, this coupling with FL became a constraint: Every change to the binding had to follow FL's release cycle. In December 2025, the Images team replaced FL with a new intermediary service, an internal worker binding that runs on the same machine.
In the original architecture, data moved through FL over network sockets; this path carried the overhead of FL's full processing pipeline, such as DNS lookups and routing. The internal binding replaced these with Unix sockets to directly connect the services on the same machine, bypassing FL and the overhead of the network stack. This made the request path to Images faster and gave the team independent control over binding releases.
Within days of the rollout, we received our first customer report.
The first sign of trouble
The first sign of trouble came from a customer with a non-standard setup: two layers of image processing, where one pipeline was nested inside another.
- First, their worker used the Images binding to composite multiple large source images from R2 - a JPEG background plus PNG overlay layers - into a single combined JPEG.
- Second, they further compressed, transcoded, and resized the result through the URL interface.
The bug originated in the inner pipeline's return path, where the response was truncated before reaching the outer pipeline. The inner pipeline (transformation binding) handled compositing. The outer pipeline (transformation URL) handled delivery optimizations like scaling and format conversion.
This layered approach meant that when the inner pipeline silently returned a truncated response, the only visible error appeared one level up:
error reading a body from connection: end of file before message length reached
The outer pipeline received HTTP 200 from the inner one, with a Content-Length header that promised several megabytes. The actual body was only a fraction of that: In one request, only ~200 KB arrived out of an expected 3.3 MB.
The error surfaced in the outer pipeline, but the truncation could have originated in the binding, the intermediary service, the Images service, or somewhere in between. When a browser receives a truncated image, the result is visible. Depending on the format, the image either renders partially (e.g., with the bottom half missing or gray) or fails to decode entirely, instead displaying a broken image.
Tracing the truncation
From here, we worked inward through the request path, testing each layer to isolate where the truncation was happening. Some of these efforts hit dead ends; others left breadcrumbs that narrowed the search:
Building a reproduction. We built a worker that mimicked the customer's nested setup, then stripped away layers until we could trigger the bug with the binding alone. A small script let us fire requests in batches. In one early run, 19 out of 25 requests failed. The amount of data that did arrive - roughly 200 KB - was suspiciously close to the size of the socket buffer in production. This confirmed that the problem wasn't tied to the customer's configuration and gave us a reliable way to trigger the bug on demand.
Investigating timeouts. Early on, we suspected the truncation might be related to timeout behavior (i.e., the connection was being closed after a time limit). This theory didn't hold, as the truncation wasn't correlated with request duration.
Updating hyper version. When the bug was first reported, we were running 0.14.x, while the latest hyper version was around 1.8.x. We tested across hyper versions 0.14, 1.7, and 1.8, just in case the most obvious answer was the correct (and easiest) one. But the bug appeared in each version, which meant that there wasn't an upstream fix.
Reproducing locally. We ran local integration tests on macOS and a Debian VM. Even under considerable load, our local requests never triggered any failure. Making direct curl requests to the binding socket and replaying captured requests always seemed to work. The bug only appeared on the full production path when there was real concurrency and a real Workers runtime client on the other end of the socket. This led us to suspect the runtime itself.
Ruling out the Workers runtime. We examined the HTTP client that the Workers runtime uses to communicate with Images through the binding socket. None of the traces from either side of the connection showed any syscalls that indicated an unexpected close or early termination. We observed that the client behaved correctly and multiple other services used the same client without issues.
Distributed tracing. By inspecting request traces end-to-end, we confirmed that the truncated body was already present before it reached the outer transformation layer in the customer's setup. That narrowed the problem to the inner pipeline - the binding path through the Images service.
Instrumenting the intermediary service. We added instrumentation to the intermediary service to measure body sizes before forwarding the response data. The bodies were already truncated by the time they left the Images service, so the intermediary was ruled out.
Deeper tracing within the Images service. At the service level, the request was processed, the image was properly encoded, and the response was sent with HTTP 200. The only consistent signal was that the bug was timing-dependent: It appeared only on the production path, with real concurrency, and only for larger images.
Tools for application-level debugging told only what the system thought it was doing. But according to the system, everything was fine: Tracing said the response was sent; logging reported no errors, and the Images service returned 200 on every request.
What strace revealed
To see what the system was actually doing, we attached strace to the Images service. strace records the syscalls that a process makes to the kernel, which could show us exactly which bytes were written, when a shutdown was called, and whether the client sent any termination signal.
Setting up the trace was delicate. strace works by intercepting syscalls as they happen, which adds a small amount of timing overhead to each one. Filtering for a narrow set of syscalls kept that overhead minimal. Broadening the filter, however, slowed the process just enough to shift the timing between the flush and the shutdown check - and make the bug disappear entirely. That alone reinforced our theory that the issue was timing-sensitive.
Using a reproduction worker, we triggered the bug and compared the syscall output between successful and failing requests.
In a successful request, the response is written in chunks as the socket buffer allows, with shutdown called only after all the data is sent. For example, this may look like:
sendto(42, "HTTP/1.1 200 OK\r\nContent-Length: 14991808\r\n...", ...) = 219264
sendto(42, "\xff\xd8\xff\xe0...", 292352) = 292352
// ... keeps writing until buffer drains ...
sendto(42, "...", 292352) = 292352
shutdown(42, SHUT_WR) = 0
When we reproduced the bug, a failing request looked like:
sendto(42, "HTTP/1.1 200 OK\r\nContent-Length: 14991808\r\n...", ...) = 219264
shutdown(42, SHUT_WR) = 0
Here, there is only one write - just enough for the headers and a sliver of the body - before the shutdown is immediately called. Out of a 14.9 MB response, only about 219 KB was sent. The remaining ~14.8 MB of image data never left hyper's internal buffer, nor was there any termination signal from the client between the write and the shutdown. Instead, the Images service prematurely shut down the connection on its own, genuinely believing it was finished.
The race condition
The failing requests confirmed that the bug was a race condition that triggered intermittently. Whether a request succeeded or failed depended on whether the flush and shutdown operations overlapped, which changed from request to request.
When the buffer was still full at the exact moment that hyper decided the connection was finished, data was lost. When the reader consumes slower than hyper writes, the outbound buffer fills up. If hyper shuts down the connection before the buffer drains, then only a fraction of the response makes it to the intermediary; this incomplete data gets forwarded back to the Workers runtime and the client.
The December rearchitecture didn't introduce this bug, which had been present in hyper for years across multiple major versions. But the new intermediary changed who was reading on the response side of the socket. Our working theory is that FL, the previous intermediary, consumed data fast enough that the socket buffer rarely filled during a response. The new reader read at a pace that occasionally let the buffer fill during larger responses. These few milliseconds of backpressure, introduced by an improvement that made everything else faster, were all it took to surface a flaw that had been hiding in plain sight.
The bug in dispatch.rs
Hyper's HTTP/1 connection lifecycle is driven by a state machine in a file called dispatch.rs. It runs a loop that reads requests, writes responses, flushes the write buffer to the socket, and decides when to shut down. In simplified form:
fn poll_loop(&mut self, cx: &mut Context<'_>) -> Poll<Result<(), ()>> {
loop {
let _ = self.poll_read(cx)?;
let _ = self.poll_write(cx)?;
let _ = self.poll_flush(cx)?;
if !self.conn.wants_read_again() {
return Poll::Ready(Ok(()));
}
}
}
More precisely, the let _ before poll_flush is where the bug lives. In Rust, let _ = expr discards the expression's result, including Poll::Pending, the signal that the flush isn't done yet. The flush might still have megabytes sitting in its buffer, but the loop never finds out.
When a request fails, this is the exact sequence of events: The Images service finishes encoding the image and hands the entire response to hyper.
Comments
No comments yet. Start the discussion.