RAT: Bridging RNN Efficiency and Attention Accuracy in Language Modeling
The long-context problem is getting a lot of attention. While attention achieves great performance, its quadratic complexity slows things down as sequences grow. Many recent ideas have emerged, including linear attention (or state space models) and sliding window attention [1-3]. In this post, we explore a different approach: an intermediate design between RNNs and attention, combining their strengths in speed and accuracy.
Attention and RNN Recap
Attention has full-token access, but with quadratic computational complexity. For $t^{th}$ token, we form it as
In the mixing of information between tokens, we view attention as a slow but lossless module.
For RNN, we adopt a simple design used in [4,5]. Using diagonal matrices and removing nonlinearities, the recurrence reduces to an EMA-like gating behavior, which is highly training-efficient with the help of parallel scan algorithm.
$g_t$ and $z_t \in \mathbb{R}^{D}$ are the forget and output gates, computed by linear projections and sigmoid activations, where $D$ is the model dimension. While effective on short sequences, compressing all history into the current state degrades memory and limits long-range retention. We consider it a fast but lossy token mixing module.
RAT
We started by thinking that overusing attention for short-range context wastes its potential. Local patterns can be captured more efficiently by lightweight modules such as recurrence. This motivates RAT: an intermediate design that splits sequences into chunks, applies recurrence locally, and uses attention across chunks for long-range context.
Each token $t$ is re-indexed as $(c, l)$, denoting the chunk index and its position within the chunk. Gated recurrence first aggregates keys and values over time within each chunk. Each query attends to the final key vectors $\tilde{K}{:,-1}$ from preceding chunks and the current chunk’s key $\tilde{k}_{c,l}$. The causal mask $f(\cdot)$ ensures no future chunks are visible. Finally, an output gate is applied.
Overall, RAT is
-
an intermediate design that can interpolate between attention (L=1) and recurrence (L=T) simply by adjusting the chunk size L. It benefits from the effectiveness of RNNs on short contexts and enables attention to access distant information over reduced sequence lengths.
-
simple without extra convolution or normalization.
-
easy to implement, where we rely only on PyTorch higher-order ops like Flex Attention, without any custom CUDA or Triton kernels.
Efficiency
Given sequence length $T$, model dimension $D$, number of chunks $C$, and chunk length $L$ (with $T = C \cdot L$), the per-token FLOPs are $\mathcal{O}(T\cdot D)$ for attention, $\mathcal{O}(D)$ for RNNs, and $\mathcal{O}(C\cdot D)$ for RAT. RAT reduces FLOPs and KV cache by a factor of $L$, enabling faster training and generation.
Main accuracy results
We trained 1.3B models and evaluated them on six short-context reasoning tasks, 14 long-context tasks (LongBench[6]), and four long-context SFT tasks. We also propose a hybrid model that interleaves RAT and local attention layers, effectively combining efficient long-range modeling with strong local interactions. As shown below, this design achieves top throughput and accuracy.
For example, it improves accuracy by +1 on commonsense reasoning, +4 on code completion, +4 on a challenging QA SFT task, and +1 on summarization—compared to full attention or its local variant. It also achieves at least a 3–4× gain in maximum throughput.
Others
We also explored different ways to allocate parameters and encode positions. Additionally, we conducted experiments using NoPE[7] to evaluate RAT’s generalization ability, as well as tests on the RULER benchmark[8] to assess retrieval performance. Part of the results are shown below.
For more discussion and results, please see our
Preprint: https://arxiv.org/abs/2507.04416
Code: https://github.com/CLAIRE-Labo/RAT
Website: https://claire-labo.github.io/RAT
This is a collaboration with Anunay Yadav, Razvan Pascanu, and Caglar Gulcehre. Many thanks to them!
References
[2]. Griffin: Mixing gated linear recurrences with local attention for efficient language models.
[3]. Longformer: The long-document transformer.
[4]. Were RNNs All We Needed?
[5]. Simplified State Space Layers for Sequence Modeling.
[6]. Longbench: A bilingual, multitask benchmark for long context understanding.
[7]. The impact of positional encoding on length generalization in transformers.
[8]. RULER: What's the Real Context Size of Your Long-Context Language Models?