MobileForge

Core Idea

Annotation-free adaptation from real app interaction.

MobileForge targets two gaps in mobile GUI learning: generated tasks and feedback are often detached from the target app, while sparse rollout outcomes are hard to turn into reusable step-level policy updates.

MobileForge teaser showing annotation-free adaptation for mobile GUI agents — MobileForge grounds generated tasks in target-app interaction and converts hierarchical feedback into reusable policy-improvement signals.

Results

Main performance of MobileForge.

Full paper

MobileForge main performance table — Main results from the paper: scaling with generated MobileForge tasks, in-domain AndroidWorld adaptation, and out-of-domain MobileWorld GUI-only generalization.

67.24%ForgeOwl-8B Pass@1

AndroidWorld in-domain adaptation with the MobileForge-adapted GUI-Owl-1.5-8B model.

77.59%ForgeOwl-8B Pass@3

Strong AndroidWorld multi-attempt success after annotation-free adaptation.

41.03%MobileWorld SR

Out-of-domain GUI-only success with no MobileWorld rollout used for training.

67.24%ForgeQwen3-8B Pass@3

Qwen3-VL-8B after MobileForge adaptation, close to the closed-data GUI-specialized base model.

Method

MobileGym grounds experience; HiFPO turns feedback into updates.

MobileForge overview pipeline — MobileForge links target-app exploration, curriculum mining, rollout execution, hierarchical evaluation, and hint-contextualized policy optimization in one annotation-free adaptation loop.

MobileGym

A unified mobile substrate that mines executable tasks from app interaction traces and evaluates completed rollouts with outcome labels, process feedback, and corrective hints.

HiFPO

A feedback-guided optimization loop that reuses hints across attempts, filters mastered tasks and noisy steps, and trains on hint-contextualized step-level GRPO samples.

Open Release

Artifacts for reproducing annotation-free adaptation.

CodeExploration, rollout, training, evaluation, and release manifests for the MobileForge pipeline. ModelsForgeQwen3 and ForgeOwl checkpoints adapted from automatically generated MobileForge tasks. DatasetsGenerated tasks, exploration trajectories, and training data grouped in the MobileForge dataset collection. Benchmark ResultsAndroidWorld and MobileWorld evaluation archives with public model-to-result mappings.

MobileForge corrective hint case study — Corrective hints from MobileGym-Critic are reused by HiFPO to guide later attempts and training prompts.

Hierarchical Feedback

Making rollout feedback reusable.

MobileGym-Critic separates final task outcome, step-level process quality, and corrective hints. HiFPO uses these signals to keep informative experience, discard mastered tasks, and condition GRPO on feedback from earlier attempts.

Citation

Cite MobileForge

@article{liu2026mobileforge,
  title={MobileForge: Annotation-Free Adaptation for Mobile GUI Agents with Hierarchical Feedback-Guided Policy Optimization},
  author={Liu, Guangyi and Zhao, Pengxiang and Wu, Gao and Yin, Yiwen and Li, Mading and Liu, Liang and Liu, Congxiao and Qi, Zhang and Wang, Mengyan and Guo, Liang and Liu, Yong},
  journal={arXiv preprint arXiv:2606.19930},
  year={2026}
}