RAT: Bridging RNN Efficiency and Attention Accuracy in Language Modeling

30 Jun 2025

The long-context problem is getting a lot of attention. While attention achieves great performance, its quadratic complexity slows things down as sequences grow. Many recent ideas have emerged, including linear attention (or state space models) and sliding window attention [1-3]. In this post, we explore a different approach: an intermediate design between RNNs and attention, combining their strengths in speed and accuracy.

Attention and RNN Recap

Attention has full-token access, but with quadratic computational complexity. For $t^{th}$ token, we form it as

$\begin{aligned} y_t = f(q_t K_{:}^{\top}) V_{:} \end{aligned}$

In the mixing of information between tokens, we view attention as a slow but lossless module.

For RNN, we adopt a simple design used in [4,5]. Using diagonal matrices and removing nonlinearities, the recurrence reduces to an EMA-like gating behavior, which is highly training-efficient with the help of parallel scan algorithm.

$\begin{aligned} \tilde{v}_t &= g_t \odot \tilde{v}_{t-1} + (1 - g_t) \odot v_t \\ y_t &= z_t \odot \tilde{v}_t \end{aligned}$

$g_t$ and $z_t \in \mathbb{R}^{D}$ are the forget and output gates, computed by linear projections and sigmoid activations, where $D$ is the model dimension. While effective on short sequences, compressing all history into the current state degrades memory and limits long-range retention. We consider it a fast but lossy token mixing module.

RAT

We started by thinking that overusing attention for short-range context wastes its potential. Local patterns can be captured more efficiently by lightweight modules such as recurrence. This motivates RAT: an intermediate design that splits sequences into chunks, applies recurrence locally, and uses attention across chunks for long-range context.

Each token $t$ is re-indexed as $(c, l)$, denoting the chunk index and its position within the chunk. Gated recurrence first aggregates keys and values over time within each chunk. Each query attends to the final key vectors $\tilde{K}{:,-1}$ from preceding chunks and the current chunk’s key $\tilde{k}_{c,l}$. The causal mask $f(\cdot)$ ensures no future chunks are visible. Finally, an output gate is applied.

$\begin{aligned} & \tilde{v}_{c,l}=g_{c,l}\odot \tilde{v}_{c,l-1} + (1 - g_{c,l})\odot v_{c,l} & \quad \text{(Intra-chunk RNN)}\\ & \tilde{k}_{c,l}=g_{c,l}\odot \tilde{k}_{c,l-1} + (1 - g_{c,l})\odot k_{c,l} & \quad \text{(Intra-chunk RNN)} \\ &y_{c,l}=f([q_{c,l}\tilde{K}_{:,-1}^{\top};q_{c,l}\tilde{k}_{c,l}^{\top}])[\tilde{V}_{:,-1};\tilde{v}_{c,l}] & \quad \text{(Inter-chunk Attention)} \\ & y_{c,l}=z_{c,l} \odot y_{c,l} & \end{aligned}$

Overall, RAT is

an intermediate design that can interpolate between attention (L=1) and recurrence (L=T) simply by adjusting the chunk size L. It benefits from the effectiveness of RNNs on short contexts and enables attention to access distant information over reduced sequence lengths.
simple without extra convolution or normalization.
easy to implement, where we rely only on PyTorch higher-order ops like Flex Attention, without any custom CUDA or Triton kernels.

Efficiency

Given sequence length $T$, model dimension $D$, number of chunks $C$, and chunk length $L$ (with $T = C \cdot L$), the per-token FLOPs are $\mathcal{O}(T\cdot D)$ for attention, $\mathcal{O}(D)$ for RNNs, and $\mathcal{O}(C\cdot D)$ for RAT. RAT reduces FLOPs and KV cache by a factor of $L$, enabling faster training and generation.

Main accuracy results

We trained 1.3B models and evaluated them on six short-context reasoning tasks, 14 long-context tasks (LongBench[6]), and four long-context SFT tasks. We also propose a hybrid model that interleaves RAT and local attention layers, effectively combining efficient long-range modeling with strong local interactions. As shown below, this design achieves top throughput and accuracy.

For example, it improves accuracy by +1 on commonsense reasoning, +4 on code completion, +4 on a challenging QA SFT task, and +1 on summarization—compared to full attention or its local variant. It also achieves at least a 3–4× gain in maximum throughput.

Others

We also explored different ways to allocate parameters and encode positions. Additionally, we conducted experiments using NoPE[7] to evaluate RAT’s generalization ability, as well as tests on the RULER benchmark[8] to assess retrieval performance. Part of the results are shown below.

For more discussion and results, please see our

Preprint: https://arxiv.org/abs/2507.04416

Code: https://github.com/CLAIRE-Labo/RAT

Website: https://claire-labo.github.io/RAT

This is a collaboration with Anunay Yadav, Razvan Pascanu, and Caglar Gulcehre. Many thanks to them!

References

[1]. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.
[2]. Griffin: Mixing gated linear recurrences with local attention for efficient language models.
[3]. Longformer: The long-document transformer.
[4]. Were RNNs All We Needed?
[5]. Simplified State Space Layers for Sequence Modeling.
[6]. Longbench: A bilingual, multitask benchmark for long context understanding.
[7]. The impact of positional encoding on length generalization in transformers.
[8]. RULER: What's the Real Context Size of Your Long-Context Language Models?