ACE: A Shared Path to Faster Matrix Math on x86

3 min read

Matrix multiplication, multiplying large grids of numbers, is the workhorse math behind neural networks and large language models (LLMs).

As AI becomes more common across PCs, workstations, and servers, the x86 ecosystem benefits when these core operations are faster, more efficient, and easier for developers to target consistently.

That’s why AMD and Intel, along with partners across the ecosystem, are collaborating through the x86 Ecosystem Advisory Group (EAG) to standardize key architectural capabilities for x86, including matrix acceleration.

At a glance

  • ACE (AI Computation Extensions) proposes new x86 instructions to accelerate matrix multiplication while integrating seamlessly with AVX10.
  • ACE uses “outer product” operations and tile registers to do more matrix math per instruction and keep intermediate results close to compute.
  • ACE supports popular AI data formats, including INT8, BF16, OCP FP8 and OCP MX formats (MXFP8 and MXINT8) with inline block scaling.
  • The x86 EAG is aligning ACE as a standardized approach to matrix multiplication capabilities across devices, from laptops to data center servers.

 

Why CPUs need better “matrix math” building blocks

CPUs already accelerate math using SIMD (Single Instruction, Multiple Data) vector instructions. AVX10 is the next-generation direction for x86 vectors, and it can be used for matrix multiplication but scaling and compute density can be limited for today’s AI workloads. ACE is designed to raise that ceiling while working seamlessly alongside AVX10.

ACE in plain English: build a 2D patch of results at once

Traditional vector approaches often compute matrix results in more “one-dimensional” chunks. ACE introduces an outer product operation that accumulates into a two-dimensional tile, effectively building a patch of the output matrix per instruction. That 2D accumulation is where compute density improves.

ACE defines eight tile registers (each 512b × 16 rows) plus a block scale register to support block scaling with certain low-precision formats.

If you’re familiar with Intel AMX, the tile concept will feel recognizable: AMX introduced a tile-based programming model for accelerating matrix operations.

To reduce platform friction, ACE is designed to be exposed to software as a new “palette” under the AMX accelerator framework, helping reuse much of the system programming model and operating system support for tile state.

Why ACE pairs well with AVX10

ACE intentionally uses AVX10 vector registers as inputs. That means software can use AVX10 to prep and format data “just in time” before matrix operations, and then efficiently move data between vector and tile domains for surrounding work (layout transforms, conversions, and post-processing).

ACE’s eight tiles also enable blocked kernels that keep multiple output tiles live at once for better input reuse and reduced bandwidth pressure.

Supporting the formats AI uses today (and the efficiency they demand)

AI performance and efficiency increasingly depend on low-precision formats. ACE v1 supports INT8 and BF16 and, notably, also includes native support for OCP FP8 and OCP MX formats (MXFP8 and MXINT8) with inline block scaling.

In broad terms, block scaling lets groups of small values share a scale factor—helping preserve useful numeric range while reducing memory and bandwidth demands.

Standardizing through the x86 Ecosystem Advisory Group

The x86 EAG was launched in October 2024 to improve compatibility, predictability, and consistency across x86 processor-based products through standardized, developer-friendly features.

In its first year, the EAG highlighted ACE as a standardized approach to matrix multiplication capabilities across devices ranging from laptops to data center servers.

What This Means for the Enterprise: Portability and Operational Simplicity

For CIOs and enterprise IT leaders, ACE opens a path to portable AI performance on CPUs across your fleet. By aligning matrix acceleration through the x86 Ecosystem Advisory Group, AMD and Intel are working toward consistent, standardized capabilities that software can depend on from laptops to datacenter servers. That helps ISVs and internal teams reduce vendorspecific code forks, simplify validation, and keep a single deployment playbook across onprem and cloud environments.

In practice, this can mean faster rollouts of AIenabled features, fewer “special hardware” exceptions, and clearer lifecycle planning as platforms refresh.

When combined with mainstream toolchains and libraries, standardized acceleration supports predictable performance improvements while preserving x86 compatibility and operational stability. This reduces risk and helps control cost as AI scales.

What comes next

ACE is a hardware + software story. Enablement work is underway across compilers, debuggers, and profilers, with planned integration into optimized kernels, libraries, and ML frameworks such as PyTorch and TensorFlow.

The goal is straightforward: deliver faster, more efficient matrix math on CPUs—while keeping the x86 developer experience consistent across the ecosystem.