APX – Revitalizing the x86 General-Purpose Instruction Set

1 min read
The x86 architecture powers data centers and personal computers around the world. Since its introduction in 1978, x86 has continuously evolved to take advantage of emerging workloads and the relentless pace of Moore’s law (the idea that the number of transistors in an integrated circuit doubles every two years). The original instruction set defined eight 16-bit general-purpose registers, which doubled in number and quadrupled in size over time. The AVX extensions subsequently added thirty-two 512-bit vector registers for SIMD routines, and most recently, the x86 Ecosystem Advisory Group has been actively defining a unified instruction set specification, AI compute Extensions (ACE), for two-dimensional matrix registers to enable the acceleration of AI workloads. Intel and AMD now are working together on another major step in the evolution of x86. Advanced Performance Extensions (APX) expands the entire x86 instruction set with access to more registers and adds new features that improve general-purpose performance. In combination, this enhances x86’s already great status as a compiler target. The extensions provide efficient performance gains across a variety of workloads without significantly increasing the silicon area or power consumption of the core. APX doubles the number of general-purpose registers (GPRs) from 16 to 32. This allows the compiler to keep more values in registers. As a result, code compiled code compiled with APX results in 10% fewer loads and 20% fewer stores than the same code compiled for an x86-64 target [1]. Register accesses are not only faster, but they also consume significantly less dynamic power than load and store operations. Compiler enabling is straightforward: legacy integer and AVX instructions gain access to the new registers through extended instruction encoding. The legacy integer instructions additionally gain non-destructive forms that reduce the need for compilers to make copies of source values. It is expected that code density will be similar to existing binaries, as any increase in average instruction length is balanced by the significantly fewer instructions in APX compiled code. APX was designed to be sensitive to existing OS and application binary interface (ABI) expectations, minimizing deployment challenges. The new GPRs are defined as caller-saved (volatile) in ABIs, facilitating interoperability with legacy binaries. Optimized calling conventions can be introduced where legacy compatibility requirements are relaxed. Generally, more register state will need to be managed at function boundaries. To reduce the associated overhead, PUSH2/POP2 instructions are added to transfer two register values within a single memory operation. Implementations may accelerate short stack sequences by matching PUSH2 and POP2 instructions and bypassing memory. In addition to the architectural register footprint, APX introduces features to mitigate hard-to-predict, data-dependent branches. As out-of-order CPUs continue to become deeper and wider, the cost of mispredictions increasingly dominates the performance of such workloads. Branch predictor improvements can mitigate this only to a limited extent as data-dependent branches are fundamentally hard to predict. To address this growing performance issue, APX significantly expands the conditional instruction set of x86, which was first introduced in 1995 in the form of CMOV/SET instructions. These instructions are used quite extensively by today’s compilers, but they are too limited for the broader use of if-conversion (a compiler optimization that replaces branches with conditional instructions). APX adds conditional forms of load, store, and compare/test instructions and includes an option for the compiler to suppress the status flag writes of common instructions. These enhancements expand the applicability of if-conversion to much larger code regions, cutting down on the number of branches that may incur misprediction penalties. Application developers can take advantage of APX by simple recompilation; source code changes are not expected to be needed. Software migration can be done progressively and existing applications/libraries/functions will interoperate. Workloads written in dynamic languages will automatically benefit as soon as the underlying runtime system has been enabled. APX demonstrates the advantage of the variable-length instruction encodings of x86: New features enhancing the entire instruction set can be defined with only incremental changes to the instruction-decode hardware. This flexibility has allowed the x86 architecture to adapt and flourish over four decades of rapid advances in computing, and it enables the innovations that will keep it thriving into the future.

[1]This projection is based on a prototype simulation of the SPEC CPU® 2017 Integer benchmark suite. SPEC®, SPECrate®, and SPEC CPU® are registered trademarks of the Standard Performance Evaluation Corporation.