TensorPool: A 3D-Stacked 8.4TFLOPS/4.3W Many-Core Domain-Specific Processor for AI-Native Radio Access Networks
2026-04-02 • Hardware Architecture
Hardware Architecture
AI summaryⓘ
The authors explore how AI can improve the physical layer of 6G networks but note that this requires lots of computing power, which is hard to do within tight time and energy limits. They propose TensorPool, a specialized processor cluster with many small cores and powerful tensor engines designed to handle AI tasks efficiently. Their design achieves much higher performance and energy efficiency compared to using regular cores alone. Additionally, they show that stacking the processor components in 3D saves space without slowing down the chip. This work focuses on making AI processing in 6G faster and more energy-efficient within hardware constraints.
6G networksphysical layer (PHY)AI-native PHYtensor processingRISC-V coresFP16 MAC unitslow-latency memoryTSMC N7 process3D-stacked chipsenergy efficiency
Authors
Marco Bertuletti, Yichao Zhang, Diyou Shen, Alessandro Vanelli-Coralli, Frank K. Gürkaynak, Luca Benini
Abstract
The upcoming integration of AI in the physical layer (PHY) of 6G radio access networks (RAN) will enable a higher quality of service in challenging transmission scenarios. However, deeply optimized AI-Native PHY models impose higher computational complexity compared to conventional baseband, challenging deployment under the sub-msec real-time constraints typical of modern PHYs. Additionally, following the extension to terahertz carriers, the upcoming densification of 6G cell-sites further limits the power consumption of base stations, constraining the budget available for compute ($\leq$ 100W). The desired flexibility to ensure long term sustainability and the imperative energy-efficiency gains on the high-throughput tensor computations dominating AI-Native PHYs can be achieved by domain-specialization of many-core programmable baseband processors. Following the domain-specialization strategy, we present TensorPool, a cluster of 256 RISCV32IMAF programmable cores, accelerated by 16 256 MACs/cycle (FP16) tensor engines with low-latency access to 4MiB of L1 scratchpad for maximal data-reuse. Implemented in TSMC's N7, TensorPool achieves 3643~MACs/cycle (89% tensor-unit utilization) on tensor operations for AI-RAN, 6$\times$ more than a core-only cluster without tensor acceleration, while simultaneously improving GOPS/W/mm$^2$ efficiency by 9.1$\times$. Further, we show that 3D-stacking the computing blocks of TensorPool to better unfold the tensor engines to L1-memory routing provides 2.32$\times$ footprint improvement with no frequency degradation, compared to a 2D implementation.