Coherent vs Streaming DMA: A Deep Dive into the Linux DMA Mapping API
Coherent vs Streaming DMA: A Deep Dive into the Linux DMA Mapping API
The Linux DMA mapping API exists to solve two problems for driver authors: translating a CPU buffer into a bus address the device can use, and inserting the cache maintenance operations needed on non-coherent architectures.
Coherent mappings (dma_alloc_coherent) are for small, long-lived control structures accessed by both CPU and device without explicit syncing, while streaming mappings (dma_map_single, dma_map_page, dma_map_sg) are for bulk data mapped just before a transfer and unmapped after, with explicit sync calls if the CPU touches the buffer in between.
In the Linux 7.x source, every mapping with no custom DMA ops and no IOMMU takes the dma-direct fast path through dma_map_phys(), and CONFIG_DMA_API_DEBUG can validate that a driver's map and unmap calls are used correctly.
If you write drivers for embedded Linux, the DMA mapping API is one of the interfaces you cannot avoid for long. The moment a device moves data into or out of memory on its own, without the CPU copying each byte, your driver has to tell the kernel how that memory should be prepared. Get it wrong and the symptoms are some of the hardest to debug in kernel work: data that is correct on a desktop x86 board but corrupted on an ARM target, or a buffer that reads back stale values only under load.
This Deep Dive comes in two parts. First it covers the concepts and rules every driver author needs: coherent versus streaming mappings, the DMA mask, directions, and syncing. Then it goes one layer down and traces the kernel source that implements them, from the dispatch in kernel/dma/mapping.c to the arm64 cache hooks and the CONFIG_DMA_API_DEBUG facility. The source shown is from Linux 7.1, whose series reworked the DMA core to be physical-address based.
Three Kinds of Addresses
The first source of confusion is that DMA involves three different address spaces, and they are not interchangeable. The kernel works with virtual addresses, the kind returned by kmalloc() and stored in a void *. The memory management unit translates those into CPU physical addresses, the values you see in /proc/iomem. A device, however, sees a third kind of address called a bus address.
On simple systems the bus address equals the physical address, but when an IOMMU or a host bridge sits between the device and memory, the two diverge. This matters because a device performing DMA uses bus addresses, and it has no access to the CPU's virtual memory system. You cannot hand a device a pointer from kmalloc() and expect it to work.
The job of the DMA mapping API is to take a buffer the CPU can see and return a dma_addr_t value the device can use, setting up any IOMMU translation along the way. Every driver that touches DMA must include the header that defines this type:
#include <linux/dma-mapping.h>
Why the DMA Mapping API Exists
Beyond address translation, the API solves a second problem: cache coherency. Many embedded SoCs have CPU caches that are not kept coherent with DMA traffic. If the CPU writes a buffer, the data may still be sitting in the cache when the device reads main memory, so the device sees old contents. In the other direction, the device writes main memory while the CPU still holds a cached copy, so the CPU reads stale data.
The DMA mapping API is the single place where the kernel inserts the cache maintenance operations needed to handle this, in an architecture-independent way. On a fully coherent platform those operations compile down to almost nothing; on a non-coherent ARM board they become real cache flushes and invalidations. Your driver code stays the same either way.
Tell the Kernel Your Addressing Limits
Before mapping anything, a driver must declare how many address bits the device can drive. By default the kernel assumes 32-bit DMA addressing. You change that with a single call that covers both the streaming and coherent interfaces:
if (dma_set_mask_and_coherent(dev, DMA_BIT_MASK(64))) {
dev_warn(dev, "No suitable 64-bit DMA available\n");
/* fall back or refuse to probe */
}
The kernel saves this mask and uses it later when it allocates DMA addresses, so it never hands the device an address it cannot reach. Note that dma_set_mask_and_coherent() will not fail for masks of 32 bits or larger, so the common pattern is to set 64 bits when the device supports it and 32 bits otherwise, rather than trying a 64-bit call and falling back to 32.
If the device has different limits for descriptors and for data, you can set the streaming and coherent masks separately with dma_set_mask() and dma_set_coherent_mask().
Coherent Mappings: Allocate Once, Keep for the Device's Lifetime
A coherent mapping is memory for which a write by either the CPU or the device is immediately visible to the other, with no explicit flushing in your driver. Think of it as synchronous. You allocate it once, usually at probe time, and free it at removal. The classic uses are control structures the device polls continuously: network card ring descriptors, command mailboxes, or firmware microcode run out of main memory.
dma_addr_t dma_handle;
void *cpu_addr;
cpu_addr = dma_alloc_coherent(dev, size, &dma_handle, GFP_KERNEL);
if (!cpu_addr)
return -ENOMEM;
The call returns two things: a CPU virtual address you use to read and write the buffer, and a dma_handle of type dma_addr_t that you program into the device. When you are done, release both with the matching free call:
dma_free_coherent(dev, size, cpu_addr, dma_handle);
One subtlety that surprises people: coherent does not mean the CPU stops reordering writes. If the device must see word zero of a descriptor updated before word one, you still need a write memory barrier between the two stores:
desc->word0 = address;
wmb();
desc->word1 = DESC_VALID;
For many small allocations, carving them out of a single page is wasteful. The dma_pool interface acts like a kmem_cache built on top of dma_alloc_coherent(), and it understands alignment and boundary constraints that hardware queues often require.
Streaming Mappings: Map for One Transfer, Then Unmap
A streaming mapping is for memory the CPU already owns, which you want to hand to the device for a single transfer and then take back. Think of it as asynchronous, outside the coherency domain. Network packets being transmitted or received, and filesystem buffers, are the standard examples.
You map a buffer just before the transfer and unmap it as soon as the device signals completion:
dma_addr_t dma_handle;
dma_handle = dma_map_single(dev, addr, size, DMA_TO_DEVICE);
if (dma_mapping_error(dev, dma_handle))
goto map_error;
/* program dma_handle into the device, start the transfer */
dma_unmap_single(dev, dma_handle, size, DMA_TO_DEVICE);
The direction argument is not optional decoration. DMA_TO_DEVICE means memory to device, DMA_FROM_DEVICE means device to memory, and DMA_BIDIRECTIONAL covers both at a possible performance cost. The kernel uses the direction to decide which cache operations to perform, so specify it as precisely as you can. DMA_NONE exists only as a debugging placeholder.
Two rules are easy to miss. First, always check dma_mapping_error() on the returned address; mapping can fail when DMA address space is exhausted or an IOMMU mapping cannot be created, and using an unchecked address can lead to silent corruption. Second, never use the CPU buffer while it is mapped for the device. The buffer belongs to the device between map and unmap.
The same applies to dma_map_page(), which takes a page and offset instead of a CPU pointer so it can map HIGHMEM memory, and to dma_map_sg() for scatter-gather lists.
Synchronising a Buffer You Reuse
Sometimes you need the CPU to look at a streaming buffer between transfers without fully unmapping it. That is what the sync calls are for. Before the CPU reads a buffer the device just wrote, give ownership back to the CPU; before handing it to the device again, return ownership to the device:
dma_sync_single_for_cpu(dev, dma_handle, size, DMA_FROM_DEVICE);
/* CPU may now safely read the buffer */
dma_sync_single_for_device(dev, dma_handle, size, DMA_FROM_DEVICE);
/* device may now use the buffer again */
If you never touch the data between dma_map_*() and dma_unmap_*(), you do not need the sync calls at all. They exist precisely for the reuse case, and skipping them on a non-coherent platform is a frequent cause of intermittent corruption.
Alignment and Cache Lines
One rule deserves special attention on embedded targets. You may DMA to memory from kmalloc() or the page allocator, but not from vmalloc() memory, kernel stack, or static (data, text, bss) addresses. On a CPU with DMA-incoherent caches, a DMA buffer must also not share a cache line with other data, or a CPU write to one word and a DMA write to a neighbouring word in the same line can overwrite each other.
Architectures set ARCH_DMA_MINALIGN so that kmalloc() buffers are aligned safely, but if you embed a DMA buffer inside a larger structure next to fields the CPU writes, you are responsible for keeping them on separate cache lines.
Inside the DMA Mapping API: Three Back Ends
Everything above is the contract your driver works to, and it is stable across kernel versions. The implementation underneath is not. The 7.x series reworked the DMA core to be physical-address based: the internal entry point that performs the dispatch is now dma_map_phys() in kernel/dma/mapping.c, the dma_map_ops operation map_page was renamed map_phys, and dma_direct_map_page() became dma_direct_map_phys().
The dma_map_single() and dma_map_page() you call are unchanged; they convert your buffer to a physical address and feed dma_map_phys() underneath. Trimmed to the decision that matters, the dispatch looks like this:
dma_addr_t dma_map_phys(struct device *dev, phys_addr_t phys, size_t size,
enum dma_data_direction dir, unsigned long attrs)
{
const struct dma_map_ops *ops = get_dma_ops(dev);
dma_addr_t addr = DMA_MAPPING_ERROR;
if (dma_map_direct(dev, ops))
addr = dma_direct_map_phys(dev, phys, size, dir, attrs, true);
else if (use_dma_iommu(dev))
addr = iommu_dma_map_phys(dev, phys, size, dir, attrs);
else if (ops->map_phys)
addr = ops->map_phys(dev, phys, size, dir, attrs);
debug_dma_map_phys(dev, phys, size, dir, addr, attrs);
return addr;
}
There are still exactly three paths. The dma-direct path (dma_direct_map_phys) is the common one on most modern arm64 and x86 systems. The IOMMU path (iommu_dma_map_phys) runs when an IOMMU is managing the device. The legacy ops path (ops->map_phys) is for buses that install their own struct dma_map_ops.
The selector is dma_map_direct(), which calls a small helper:
static bool dma_go_direct(struct device *dev, dma_addr_t mask,
const struct dma_map_ops *ops)
{
if (use_dma_iommu(dev))
return false;
if (likely(!ops))
return true;
/* CONFIG_DMA_OPS_BYPASS mask check omitted */
return false;
}
The key line is if (likely(!ops)) return true;. When a device has no custom DMA ops, the kernel takes the direct path. And whether a device has ops is decided by get_dma_ops():
static inline const struct dma_map_ops *get_dma_ops(struct device *dev)
{
if (dev->dma_ops)
return dev->dma_ops;
return get_arch_dma_ops();
}
On architectures built without CONFIG_ARCH_HAS_DMA_OPS (which includes today's arm64 and x86), this returns NULL. A NULL ops pointer is precisely what makes dma_go_direct() return true. So on a typical embedded arm64 board with no IOMMU in the path, every mapping you make goes straight through the dma-direct layer. That is the code worth understanding well.
The dma-direct Fast Path
The dma-direct implementation lives in kernel/dma/direct.h and kernel/dma/direct.c. The single-buffer map is a static inline in the header, and in 7.x it takes a physical address directly rather than a page and offset. Trimmed to the normal-memory path (the source also handles MMIO and confidential-computing buffers), it is short and revealing:
static inline dma_addr_t dma_direct_map_phys(struct device *dev,
phys_addr_t phys, size_t size, enum dma_data_direction dir,
unsigned long attrs, bool flush)
{
dma_addr_t dma_addr = phys_to_dma(dev, phys);
if (unlikely(!dma_capable(dev, dma_addr, size, true))) {
if (is_swiotlb_active(dev))
return swiotlb_map(dev, phys, size, dir, attrs);
return DMA_MAPPING_ERROR;
}
if (!dev_is_dma_coherent(dev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) {
arch_sync_dma_for_device(phys, size, dir);
if (flush)
arch_sync_dma_flush();
}
return dma_addr;
}
Read it from the top. The page-to-physical conversion that older kernels did here is gone; the caller already passes a phys_addr_t. The function turns it into a bus address with phys_to_dma(), then checks dma_capable(): can the device, given its DMA mask, reach this address? If not, and a software IOMMU is available, it bounces the transfer through swiotlb_map(); otherwise it returns DMA_MAPPING_ERROR, the value dma_mapping_error() tests for.
This closes the loop on the mask you set in part one: the mask is the input to dma_capable(), and an honest mask is what triggers bouncing instead of silent corruption when a 32-bit device is handed a high buffer.
The last lines are the cache story. If the device is not cache-coherent and the caller did not set DMA_ATTR_SKIP_CPU_SYNC, the code calls arch_sync_dma_for_device(), then, when flush is set, arch_sync_dma_flush(). That second call is new in the 7.x series: the cache maintenance and its memory barrier were split apart, so a batch of mappings can issue one barrier at the end instead of one per buffer.
On a coherent platform dev_is_dma_coherent(dev) is true and nothing happens. That single branch is the difference between a desktop x86 board where DMA "just works" and an embedded arm64 target where forgetting a sync corrupts data.
Comments
No comments yet. Start the discussion.