Trilinear Compute-in-Memory Architecture for Energy-Efficient Transformer Acceleration

2026-04-08Hardware Architecture

Hardware ArchitectureEmerging TechnologiesNeural and Evolutionary Computing
AI summary

The authors present TrilinearCIM, a new type of computer chip design that improves how Transformers do their attention calculations inside memory. Unlike older designs, it avoids slow and damaging reprogramming processes in memory devices by using a special transistor called Double-Gate FeFET. Their tests on popular AI models like BERT and ViT show that this approach saves energy, speeds up processing, and works better on most language tasks. This is the first design that can do all Transformer attention steps fully inside memory without needing to rewrite data during use.

TransformerSelf-attentionCompute-in-Memory (CIM)Non-volatile memory (NVM)FeFETDouble-Gate FeFETMultiply-accumulateBERTViTBack-gate modulation
Authors
Md Zesun Ahmed Mia, Jiahui Duan, Kai Ni, Abhronil Sengupta
Abstract
Self-attention in Transformers generates dynamic operands that force conventional Compute-in-Memory (CIM) accelerators into costly non-volatile memory (NVM) reprogramming cycles, degrading throughput and stressing device endurance. Existing solutions either reduce but retain NVM writes through matrix decomposition or sparsity, or move attention computation to digital CMOS at the expense of NVM density. We present TrilinearCIM, a Double-Gate FeFET (DG-FeFET)-based architecture that uses back-gate modulation to realize a three-operand multiply-accumulate primitive for in-memory attention computation without dynamic ferroelectric reprogramming. Evaluated on BERT-base (GLUE) and ViT-base (ImageNet and CIFAR), TrilinearCIM outperforms conventional CIM on seven of nine GLUE tasks while achieving up to 46.6\% energy reduction and 20.4\% latency improvement over conventional FeFET CIM at 37.3\% area overhead. To our knowledge, this is the first architecture to perform complete Transformer attention computation exclusively in NVM cores without runtime reprogramming.