Skip to content

yangsizhe/MemoryWAM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

MemoryWAM: Efficient World Action Modeling with Persistent Memory

Sizhe Yang*, Juncheng Mu*, Tianming Wei, Chenhao Lu, Xiaofan Li, Linning Xu, Zhengrong Xue, Zhecheng Yuan, Dahua Lin, Jiangmiao Pang, Huazhe Xu
The Chinese University of Hong Kong, Tsinghua University, Zhejiang University
(* equal contribution)

arXiv Homepage

📋 Contents

🔥 Highlight

teaser

MemoryWAM is a world action model with efficient persistent memory for robotic manipulation.

Its hybrid memory design combines recent frames, event-boundary anchors, and compact gist tokens, reducing inference complexity from O(N) to O(N/d) while preserving long-range context.

MemoryWAM outperforms strong VLA and WAM baselines on long-horizon, memory-dependent tasks in both simulation and the real world.

🛠️ Method

Pipeline

Model Architecture. Observations are first encoded into compact video latents by a causal video VAE. A video DiT processes visual dynamics and maintains a temporal KV cache, while an action DiT generates action chunks conditioned on the cached video representations. The two branches are organized in a mixture-of-transformers (MoT) architecture. During inference, the clean latent of the current observation is forwarded through the video DiT only once to update the KV cache; the action DiT then predicts the action chunk by denoising action tokens while attending to the cached video representations. Video generation is bypassed at inference time.

Hybrid Memory. Inspired by complementary forms of human memory, MemoryWAM maintains a compact temporal cache comprising three components: (1) short-term memory—a sliding-window cache over the most recent frames, preserving high-fidelity local context for immediate closed-loop control; (2) event-boundary memory—a small set of anchor frames at task onset with full visual tokens, grounding key information in the instruction; and (3) gist memoryM learnable gist tokens per frame (ML, the number of visual tokens) that compress long-range history, reducing the KV cache by L/M× compared with full-history attention.

Attention Mask

Attention mask of MemoryWAM. Each frame's gist tokens attend to both the frame's visual tokens and its historical context, thereby distilling long-range information into a compact representation. For a video frame that is neither an anchor frame nor a recent frame, subsequent video and action tokens do not attend to it directly; instead, they attend to the corresponding gist tokens. During inference, MemoryWAM evicts the KV cache of such frames while preserving the KV cache of their gist tokens, so that long-range history is retained as a compact persistent memory rather than as a costly full-token KV cache.

📊 Results

Results on RMBench. We report the success rates over 100 rollouts.

Task π0.5 FastWAM Lingbot-VA MemoryWAM (Ours)
Observe and Pick Up 9% 0% 13% 27%
Rearrange Blocks 13% 0% 100% 100%
Put Back Block 11% 0% 100% 100%
Swap Blocks 24% 0% 99% 100%
Swap T 15% 7% 88% 94%
Battery Try 16% 20% 41% 41%
Blocks Ranking Try 6% 26% 100% 100%
Cover Blocks 0% 0% 79% 98%
Press Button 0% 0% 84% 87%
Average 10.4% 5.9% 78.2% 83.0%

We evaluate MemoryWAM on RMBench, a simulation benchmark for long-horizon, memory-dependent robotic manipulation across nine dual-arm tasks. Baselines with bounded observation windows (π0.5, FastWAM) fail on most tasks, achieving only 10.4% and 5.9% average success. LingBot-VA, which retains the full historical KV cache, achieves strong performance at 78.2%. MemoryWAM further improves by 4.8 percentage points to 83.0%, achieving leading results on every task — demonstrating that the proposed hybrid memory mechanism is both efficient and effective for robotic manipulation.

Efficiency

We compare MemoryWAM's hybrid memory against full attention, TTT, and RNN-based mechanisms in terms of inference latency, GPU memory usage, and task performance. TTT and RNN maintain constant complexity but introduce overhead even for short trajectories. Full attention scales poorly with trajectory length. Hybrid memory achieves the same 87% success rate as full attention on the challenging Press Button task, while being substantially more efficient — even outperforming RNN- and TTT-based alternatives at 1,600 frames.

📌 TODO

  • Release the paper, the project page, and the demo.
  • Release the training and inference code.

🔗 Citation

If you find our work helpful, please cite:

@article{yang2026memorywam,
  title={MemoryWAM: Efficient World Action Modeling with Persistent Memory},
  author={Yang, Sizhe and Mu, Juncheng and Wei, Tianming and Lu, Chenhao and Li, Xiaofan and Xu, Linning and Xue, Zhengrong and Yuan, Zhecheng and Lin, Dahua and Pang, Jiangmiao and Xu, Huazhe},
  journal={arXiv preprint arXiv:2606.20562},
  year={2026}
}

📄 License

This repository is released under the Apache 2.0 license.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors