MemoryWAM: Efficient World Action Modeling with Persistent Memory

Sizhe Yang*, Juncheng Mu*, Tianming Wei, Chenhao Lu, Xiaofan Li, Linning Xu, Zhengrong Xue, Zhecheng Yuan, Dahua Lin, Jiangmiao Pang, Huazhe Xu
The Chinese University of Hong Kong, Tsinghua University, Zhejiang University
(* equal contribution)

🔥 Highlight

MemoryWAM is a world action model with efficient persistent memory for robotic manipulation.

Its hybrid memory design combines recent frames, event-boundary anchors, and compact gist tokens, reducing inference complexity from O(N) to O(N/d) while preserving long-range context.

MemoryWAM outperforms strong VLA and WAM baselines on long-horizon, memory-dependent tasks in both simulation and the real world.

🛠️ Method

Model Architecture. Observations are first encoded into compact video latents by a causal video VAE. A video DiT processes visual dynamics and maintains a temporal KV cache, while an action DiT generates action chunks conditioned on the cached video representations. The two branches are organized in a mixture-of-transformers (MoT) architecture. During inference, the clean latent of the current observation is forwarded through the video DiT only once to update the KV cache; the action DiT then predicts the action chunk by denoising action tokens while attending to the cached video representations. Video generation is bypassed at inference time.

Hybrid Memory. Inspired by complementary forms of human memory, MemoryWAM maintains a compact temporal cache comprising three components: (1) short-term memory—a sliding-window cache over the most recent frames, preserving high-fidelity local context for immediate closed-loop control; (2) event-boundary memory—a small set of anchor frames at task onset with full visual tokens, grounding key information in the instruction; and (3) gist memory—M learnable gist tokens per frame (M ≪ L, the number of visual tokens) that compress long-range history, reducing the KV cache by L/M× compared with full-history attention.

Attention mask of MemoryWAM. Each frame's gist tokens attend to both the frame's visual tokens and its historical context, thereby distilling long-range information into a compact representation. For a video frame that is neither an anchor frame nor a recent frame, subsequent video and action tokens do not attend to it directly; instead, they attend to the corresponding gist tokens. During inference, MemoryWAM evicts the KV cache of such frames while preserving the KV cache of their gist tokens, so that long-range history is retained as a compact persistent memory rather than as a costly full-token KV cache.

📊 Results

Results on RMBench. We report the success rates over 100 rollouts.

Task	π0.5	FastWAM	Lingbot-VA	MemoryWAM (Ours)
Observe and Pick Up	9%	0%	13%	27%
Rearrange Blocks	13%	0%	100%	100%
Put Back Block	11%	0%	100%	100%
Swap Blocks	24%	0%	99%	100%
Swap T	15%	7%	88%	94%
Battery Try	16%	20%	41%	41%
Blocks Ranking Try	6%	26%	100%	100%
Cover Blocks	0%	0%	79%	98%
Press Button	0%	0%	84%	87%
Average	10.4%	5.9%	78.2%	83.0%

We evaluate MemoryWAM on RMBench, a simulation benchmark for long-horizon, memory-dependent robotic manipulation across nine dual-arm tasks. Baselines with bounded observation windows (π0.5, FastWAM) fail on most tasks, achieving only 10.4% and 5.9% average success. LingBot-VA, which retains the full historical KV cache, achieves strong performance at 78.2%. MemoryWAM further improves by 4.8 percentage points to 83.0%, achieving leading results on every task — demonstrating that the proposed hybrid memory mechanism is both efficient and effective for robotic manipulation.

We compare MemoryWAM's hybrid memory against full attention, TTT, and RNN-based mechanisms in terms of inference latency, GPU memory usage, and task performance. TTT and RNN maintain constant complexity but introduce overhead even for short trajectories. Full attention scales poorly with trajectory length. Hybrid memory achieves the same 87% success rate as full attention on the challenging Press Button task, while being substantially more efficient — even outperforming RNN- and TTT-based alternatives at 1,600 frames.

📌 TODO

Release the paper, the project page, and the demo.
Release the training and inference code.

🔗 Citation

If you find our work helpful, please cite:

@article{yang2026memorywam,
  title={MemoryWAM: Efficient World Action Modeling with Persistent Memory},
  author={Yang, Sizhe and Mu, Juncheng and Wei, Tianming and Lu, Chenhao and Li, Xiaofan and Xu, Linning and Xue, Zhengrong and Yuan, Zhecheng and Lin, Dahua and Pang, Jiangmiao and Xu, Huazhe},
  journal={arXiv preprint arXiv:2606.20562},
  year={2026}
}

📄 License

This repository is released under the Apache 2.0 license.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
asset		asset
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MemoryWAM: Efficient World Action Modeling with Persistent Memory

📋 Contents

🔥 Highlight

🛠️ Method

📊 Results

📌 TODO

🔗 Citation

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

MemoryWAM: Efficient World Action Modeling with Persistent Memory

📋 Contents

🔥 Highlight

🛠️ Method

📊 Results

📌 TODO

🔗 Citation

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages