Avey-B

2026-02-17Computation and Language

Computation and LanguageArtificial Intelligence
AI summary

The authors looked at a new type of language model called Avey, which doesn’t use the usual attention method but can still work like the popular BERT models that understand context well. They changed Avey so it works well as a standalone encoder and made improvements like separating some model parts, adding new normalization, and compressing the model for efficiency. Their updated Avey model beats four common transformer-based models on tasks like recognizing words and finding information, and it handles longer texts more efficiently. This shows their approach is a promising alternative for situations with limited computing power.

pretrained bidirectional encodersself-attentionBERTautoregressive modelsencoder-only modelsnormalizationneural compressiontoken classificationinformation retrievaltransformers
Authors
Devang Acharya, Mohammad Hammoud
Abstract
Compact pretrained bidirectional encoders remain the backbone of industrial NLP under tight compute and memory budgets. Their effectiveness stems from self-attention's ability to deliver high-quality bidirectional contextualization with sequence-level parallelism, as popularized by BERT-style architectures. Recently, Avey was introduced as an autoregressive, attention-free alternative that naturally admits an encoder-only adaptation. In this paper, we reformulate Avey for the encoder-only paradigm and propose several innovations to its architecture, including decoupled static and dynamic parameterizations, stability-oriented normalization, and neural compression. Results show that this reformulated architecture compares favorably to four widely used Transformer-based encoders, consistently outperforming them on standard token-classification and information-retrieval benchmarks while scaling more efficiently to long contexts.