Last week, Arm showed off the inner workings of its new Machine Learning Processor design, but that is not all that company had waiting in the wings. Arm is also detailing more cutting-edge technology in the form of its Cortex-A76 CPU, Mali-G76 GPU, and Mali-V76 VPU designs. All three chips are slated to be heavy-hitters in their respective categories, and thanks to some on-campus, in depth briefings earlier this month, we have all of the details to share…
The Cortex-A76 represents Arm’s most radical overhaul over previous designs. The A76’s brand-new architecture brings decisive improvements to power and efficiency. Arm’s processor engineers worked with a design goal of outperforming their competitor’s designs, but with half the area and power. This philosophy is very critical in the mobile space where both power budgets and physical space are highly constrained.
For the A76’s architecture revamp, careful attention was given to reducing latency and eliminating bandwidth bottlenecks wherever possible. One of Arm’s largest targets is eliminating spare or wasted cycles. Spare cycles most often creep in when the processor is unable to retrieve the correct data from memory quickly enough.
One method Arm is employing to combat spare cycles is to decouple the A76’s branch prediction from the instruction fetcher. Branch predictors “read ahead” and attempt to guess which way a code path will jump at a conditional. The branch predictor is now tuned to operate at double the fetch’s rate. This may seem like a peculiar arrangement, but Arm reasons that this helps disguise prediction misses by ensuring the fetch always has its queue filled. It is more power efficient to burn cycles at the branch predictor when instructions are being fed correctly than to lose cycles through the entire core when a miss occurs.
The A76’s execution core consists of a branch unit, two simple ALUs, and a combination simple and multi-cycle ALU for integer workloads. The execution core is upgraded to dual 128-bit ASIMD/FP pipelines to provide double the bandwidth of previous Arm CPUs. This ASIMD bump contributes significantly towards the A76’s near-4x uplift in machine learning performance over the previous-gen A75.
Arm also provided some interesting cache metrics. Arm’s goal here is to deliver a perfect cache-hit ratio, as cache-misses incur latency penalties. The A76 can sustain up to 20 outstanding L1 misses, up to 46 L2 misses, and up to 94 L3 misses. The A76 offers 64K of L1 cache, in both I-Cache and D-Cache flavors, 256-512K of private L2 cache, and 2-4M of shared L3 cache. In terms of latency, the L1 cache has a 4-cycle load to use (LD-use) period, L2 cache is 9-cycles LD-use, and L3 cache is on the order of 26-31 cycles LD-use, so prefetcher accuracy is vital to smooth operation.
Arm projects 35% more performance than a Cortex-A75 core while also maintaining 40% better power efficiency. We will note this comparison pits a 7nm A76 clocked at 3.0 GHz against a 10nm A75 at 2.8GHz, but even still the die shrink and frequency bump alone do not explain all of the A76’s gains. In iso-process and frequency comparisons, the A76 still affords a 25% uptick in integer IPC (SPECINT), a 35% improvement in ASIMD/FP performance (SPECFP), and a 90% increase in memory bandwidth (LMBench).
Arm claims laptop-class performance with the A76. While many may take this to mean something on the level of an Intel Atom core, Arm believes its A76 core can perform within 10-percent of a Skylake core with the same thermal constraints, but with approximately half the footprint. This has promising implications for the future of Arm-powered Windows notebooks provided the cost can be kept in line. There is also the issue of translating x86 instructions for legacy applications, but Microsoft already provides pretty good development tools for Arm-native compiling so common software can run natively.
According to ARM, process node shrinks below 16nm have not yielded significant clock speed increases. Rather, smaller process nodes primarily benefit from reduced power consumption and thermal output. This is still important for performance considerations, however, because a cooler chip can have improved sustained performance. Arm expects A76 cores to enter the market on the 7nm process for performance use cases and the 12nm process for lower cost implementations with the possibility of 5nm process variants in the future. Target TDP’s would be the same across these process nodes.
The A76 core is designed for use as the “big” core(s) in Arm’s DynamIQ clusters with the venerable Cortex-A55 comprising its “LITTLE” counterparts. As with the A75 before it, DynamIQ configurations can support up to four A76 cores with up to eight A55 cores, with a total combined maximum of eight cores. While Arm anticipates high-end processors will incorporate full 4x A76 + 4x A55 configurations, many mid-range and budget designs will utilize a 1x A76 + 7x A55 or a 2x A76 + 6x A55 layout with more emphasis on power efficiency.
Arm notes that the A55 core is unlikely to be replaced by a newer variant any time soon. It can scale down as smaller nodes are perfected to further improve efficiency while its role as the LITTLE core does not demand significant performance gains. That said, Arm has increased the amount of L2 cache in the A55 core when used with A76 cores.
Arm also detailed their new Mali-G76 GPU and V76 VPU designs which we will explore on the next page…