Intel and AMD's new ACE CPU extensions bring an efficient AI-oriented instruction set to x86 - a new design makes matrix multiplication more power- and density-efficient
Intel and AMD's new ACE CPU extensions bring an efficient AI-oriented instruction set to x86 - a new design makes matrix multiplication more power- and density-efficient
Running AI models on x86 CPUs is becoming easier and faster. Most all you hear about "running an AI model" involves a GPU of some sort, but not every AI task is suited to that hardware. Smaller models or single-user latency-sensitive operations can benefit from running on the CPU instead, as it avoids the overhead of shuffling data to and from the GPU. There are also many situations where there is no GPU available to begin with, or it's a meek integrated affair with limited capabilities.
Intel and AMD have recently released the full specification for the ACE CPU extensions that make it easier and more power-efficient to run the aforementioned AI tasks on x86 processors. ACE comes in by offering a technical standard that leverages the existing AVX10 registers but adds silicon dedicated to matrix multiplication. This brings multiple benefits, but the key advantages are better power efficiency, easier development and optimization, and leveraging AVX's 512-bit inputs. The latter makes for easy integration with existing designs by eschewing the need for ACE-specific inputs.
How ACE improves matrix multiplication
Matrix multiplication is the cornerstone of AI workloads: take a table of numbers, and run a multiplication-addition loop over the whole thing. This has always been possible with most any CPU, though at limited speed. Even today, running these loops uses a lot of power, even when leveraging x86's AVX10 multiply-accumulate instructions - something that's technically a hack, as AVX wasn't designed with 2D matrix operations multiplication in mind.
For the same number of input vectors, ACE can perform 16x as many operations, compared to AVX10. Note this doesn't necessarily mean a 16x speedup, as that will depend on each individual implementation, but it's reasonable to expect that Intel and AMD will dedicate more silicon to this task in future designs to improve performance. Plus, as each ACE instruction performs more work than its equivalent AVX10 loop, there's less CPU instruction overhead and potentially better RAM bandwidth usage right off the bat.
Benefits for developers and frameworks
The benefits go far beyond just using fewer instructions for the same thing. ACE is intended to be implementation-agnostic, meaning that ML frameworks and their underlying libraries (PyTorch, TensorFlow) can just write one code path instead of having multiple variations depending on the underlying hardware and its degree of AVX support.
ACE native supports most every data type used in ML operations (including but not limited to INT8, INT32, FP8, FP16, FP32, BF16), but it also can use Open Compute Project's MX block-scaled formats natively, something that AVX10 does not provide. Developers will also be able to move some NPU-specific workloads back to CPU when they need something done now and fast. In those situations, not having to deal with the fact that each NPU is different is a huge boon, too, as ACE offers a consistent target across x86 hardware.
Community discussion
usertests asked: What is the point? Does it make sense to do this on the CPU from a TOPS/mm^2 or TOPS/Watt perspective instead of a GPU or NPU? Is it meant to squeeze out a little more performance by enlisting any unused CPU cores? Or is it complementary in another way that isn't obvious?
Reply: A CPU is a general-purpose compute unit. Over time, it has taken on tasks that prior were assigned to external units, like math processing, graphics, I/O, etc. Given impending rise of local AI, incorporating AI-specific compute function is inevitable, especially if x86 wants to fend off the encroachment of rivals ISAs like ARM. I expect ACE to be the first of multiple iterative additions.
Inference engines like llama.cpp and ONNX runtime (what Microsoft uses for Windows) leverage both CPU and GPU (and NPU in case of ONNX) for AI use. "AI" is not exclusively GPU-bound. NPU will stick around because of Microsoft Copilot+ mandate - to wit, recent rumor of Zen 6 + NPU - iGPU. Also, NPU is suitable for steady-state background tasks, like eye-tracking, especially when the CPU/GPU are busy with other uses, like gaming.
usertests said: I guess if they pull it off, they can kill off the dedicated NPUs and add more regular CPU cores instead. If that isn't power efficient enough, maybe they can create ACE-optimized cores that are at least vendor-agnostic.
Tech0000 asked: How does ACE extensions relate to the already established AMX extension standard? I see a lot of overlap. Why create a new standard instead of building on and extend AMX and add more capabilities to it. Is ACE just a way to include AMX capabilities in the AVX 10.x standard road map? Also, the article misses to mention that the very important FP4 and FP6 formats are also included in the ACE extension - that makes this even more useful and more complete.
Reply: AMX is proprietary to Intel and not used by AMD. ACE is a joint adoption.
Comments
No comments yet. Start the discussion.