A Unified VLA Foundation Model for Versatile Embodied Navigation — Achieving Grand Unification Across 5 Core Tasks with a Hierarchical Brain-Action Architecture
Embodied navigation has long been fragmented by task-specific architectures. We introduce ABot-N0, a unified Vision-Language-Action (VLA) foundation model that achieves a "Grand Unification" across 5 core tasks: Point-Goal, Object-Goal, Instruction-Following, POI-Goal, and Person-Following. ABot-N0 utilizes a hierarchical "Brain-Action" architecture, pairing an LLM-based Cognitive Brain for semantic reasoning with a Flow Matching-based Action Expert for precise, continuous trajectory generation.
To support large-scale learning, we developed the ABot-N0 Data Engine, curating 16.9M expert trajectories and 5.0M reasoning samples across 7,802 high-fidelity 3D scenes (10.7 km²). ABot-N0 achieves new SOTA performance across 7 benchmarks, significantly outperforming specialized models. Furthermore, our Agentic Navigation System integrates a planner with hierarchical topological memory, enabling robust, long-horizon missions in dynamic real-world environments.
Reach precise metric coordinates, serving as the foundational primitive for robust locomotion and obstacle avoidance.
Actively search for and navigate to specific object categories in unseen environments with semantic reasoning.
Execute long-horizon, complex natural language paths with rigorous linguistic-action alignment.
Identify Points of Interest and navigate to their physical entrances, bridging outdoor-indoor environments.
Real-time tracking of dynamic human targets — a critical social capability for human-robot interaction.
A unified VLA architecture that combines high-level cognitive reasoning with low-level motion planning, seamlessly generalizing across five core navigation tasks.
Unifies heterogeneous inputs — panoramic RGB, episodic visual memory, text goals, and geometric coordinates — into a shared latent space through flexible token-based encoding.
Built on Qwen3-4B LLM backbone. Features task-conditional dual heads: a Reasoning Head for scene analysis and spatial reasoning, and an Action Head for navigation decisions.
Employs Flow Matching to generate precise, multi-modal trajectory distributions — 5 waypoints with position (x,y) and heading (θ) for continuous robot control.
A unified synthesis pipeline integrating high-fidelity 3D scenes, expert trajectories, and cognitive reasoning samples at unprecedented scale.

2.0M pseudo-trajectories from internet videos, 1.7M synthetic from 3D scenes, 340K real-world robot demonstrations.

Semantic search and discovery of specific object categories in unseen environments.

VLN-CE R2R/RxR navigation, door-traversal, language-guided person search, and short-horizon atomic movement primitives.

Outdoor-to-indoor navigation via streetview OCR, trajectory-instruction alignment, and video generation.
3 proximity configs × 3 challenge categories (STT, DT, AT), plus 400K target-absent cases.
A progressive training pipeline ensuring the model first understands the world before learning to act within it.
Before learning "how to move", the agent learns "what to see" and "how to reason." We freeze the Vision Encoder and fine-tune the LLM Brain using Next Token Prediction loss on diverse reasoning tasks. The Action Expert remains frozen, ensuring gradients focus purely on optimizing visual-linguistic representations.
All five navigation tasks are unified into a single multi-task training regime. A mixed-training strategy (20% reasoning replay) prevents catastrophic forgetting. Dual-Head Optimization jointly trains the AR Head and Action Expert.
A flow-based reinforcement learning framework that explicitly enforces social compliance. The Brain is frozen while the Action Expert is fine-tuned to maximize a composite reward balancing social compliance, expert similarity, smoothness, and efficiency.
Comprehensive evaluation demonstrating superior performance across all five navigation paradigms.
| Method | Mean | Turn | Crossing | Detour | Proximity | Crowd | Other | All |
|---|---|---|---|---|---|---|---|---|
| GNM | 16.2 | 31.1 | 14.8 | 12.5 | 14.7 | 12.8 | 11.0 | 12.1 |
| ViNT | 16.5 | 31.1 | 15.4 | 12.9 | 14.8 | 13.3 | 11.6 | 12.6 |
| NoMaD | 19.1 | 35.1 | 18.5 | 15.6 | 18.1 | 14.3 | 12.8 | 12.1 |
| CityWalker | 15.2 | 26.6 | 14.1 | 13.9 | 14.3 | 12.0 | 10.4 | 11.5 |
| ABot-N0 | 11.2 | 21.3 | 9.8 | 12.8 | 8.1 | 8.8 | 6.3 | 7.6 |
| Method | SR↑ | RC↑ | SPL↑ | DCR↑ | TCR↑ |
|---|---|---|---|---|---|
| GNM* | 43.3 | 62.4 | 37.0 | 26.5 | 28.7 |
| ViNT* | 45.6 | 66.2 | 39.5 | 31.4 | 33.8 |
| NoMaD* | 41.1 | 60.5 | 35.4 | 29.5 | 31.6 |
| CityWalker | 47.8 | 64.7 | 44.7 | 36.1 | 36.6 |
| ABot-N0 | 88.3 | 92.1 | 79.2 | 85.1 | 85.4 |
| Method | Val-Seen | Val-Seen-Synonyms | Val-Unseen | |||
|---|---|---|---|---|---|---|
| SR↑ | SPL↑ | SR↑ | SPL↑ | SR↑ | SPL↑ | |
| DAgRL+OD | 38.5 | 21.1 | 39.0 | 21.4 | 37.1 | 19.8 |
| Uni-NaVid | 41.3 | 21.1 | 43.9 | 21.8 | 39.5 | 19.8 |
| MTU3D | 55.0 | 23.6 | 45.0 | 14.7 | 40.8 | 12.1 |
| NavFoM (Four views) | 40.1 | 27.1 | 45.4 | 32.6 | 45.2 | 31.9 |
| ABot-N0 | 55.3 | 32.1 | 55.4 | 33.2 | 54.0 | 30.5 |
| Method | NE↓ | OS↑ | SR↑ | SPL↑ |
|---|---|---|---|---|
| NaVILA | 5.22 | 62.5 | 54.0 | 49.0 |
| StreamVLN | 4.98 | 64.2 | 56.9 | 51.9 |
| InternVLA-N1 (S1+S2) | 4.83 | 63.3 | 58.2 | 54.0 |
| NavFoM (Four views) | 4.61 | 72.1 | 61.7 | 55.3 |
| ABot-N0 | 3.78 | 70.8 | 66.4 | 63.9 |
| Method | NE↓ | SR↑ | SPL↑ |
|---|---|---|---|
| NaVILA | 6.77 | 49.3 | 44.0 |
| StreamVLN | 6.22 | 52.9 | 46.0 |
| InternVLA-N1 (S1+S2) | 5.91 | 53.5 | 46.1 |
| NavFoM (Four views) | 4.74 | 64.4 | 56.2 |
| ABot-N0 | 3.83 | 69.3 | 60.0 |
| Method | SR (0.1m)↑ | SR (0.2m)↑ | SR (0.3m)↑ | TR (mean)↓ | TR (best)↓ | TR (worst)↓ |
|---|---|---|---|---|---|---|
| NoMaD | 4.13 | 15.07 | 29.20 | 31.35 | 5.45 | 85.91 |
| CityWalker | 13.79 | 41.02 | 65.96 | 15.58 | 0.76 | 56.47 |
| OmniNav | 18.78 | 46.99 | 72.39 | 14.16 | 0.99 | 53.79 |
| ABot-N0 | 32.14 | 71.50 | 88.68 | 9.84 | 0.44 | 51.38 |
| Method | Single-Target (STT) | Distracted (DT) | Ambiguity (AT) | ||||||
|---|---|---|---|---|---|---|---|---|---|
| SR↑ | TR↑ | CR↓ | SR↑ | TR↑ | CR↓ | SR↑ | TR↑ | CR↓ | |
| TrackVLA | 85.1 | 78.6 | 1.65 | 57.6 | 63.2 | 5.80 | 50.2 | 63.7 | 17.1 |
| NavFoM | 85.0 | 80.5 | - | 61.4 | 68.2 | - | - | - | - |
| TrackVLA++ | 86.0 | 81.0 | 2.10 | 66.5 | 68.8 | 4.71 | 51.2 | 63.4 | 15.9 |
| ABot-N0 | 86.9 | 87.6 | 8.54 | 66.7 | 75.4 | 11.6 | 67.3 | 79.5 | 7.05 |
A deployable framework that augments ABot-N0 with planning, topological memory, and self-reflection for robust long-horizon real-world missions.
A hierarchical topological memory with 4 layers (Block, Road, Function, Object/POI) that enables cross-scale spatial knowledge deposition and dynamic updating — from residential interiors to urban environments.
Leverages VLM reasoning to decompose ambiguous user instructions into executable sub-tasks via Chain-of-Thought. Implements a "Coarse-to-Fine" strategy: Point-Goal for long-horizon approach, then Object/POI-Goal for precise local reaching.
A VLM-based Self-Reflector assesses sub-task completion. On failure, it diagnoses the cause and triggers re-planning — emulating human-like self-correction for robust long-horizon autonomy.
A high-speed reactive layer operating at 10Hz+ on edge devices. Translates abstract VLA waypoints into precise velocity commands (vx, vy, vyaw) using LiDAR-based occupancy mapping.
Successfully deployed on Unitree Go2 quadrupedal robot with 2Hz VLA inference and 10Hz closed-loop control.
Quadrupedal robot with 12 actuated DOF, dynamically stable locomotion across diverse terrains.
Three monocular RGB cameras (H120°×V90° each) providing comprehensive panoramic perception.
Unitree 4D LiDAR L2 for occupancy mapping and RTK-GNSS for global localization.
157 TOPS, 16GB RAM — all VLA inference runs onboard at 2Hz with only 3% performance reduction.
Planner on cloud (RTX 4090), VLA + Controller on edge — ensures autonomous operation even without network.
Watch ABot-N0 navigate complex real-world environments — from single-task atomic skills to long-horizon agentic missions.
Real-world navigation trajectories and application scenarios across diverse environments.
ArXiv preprint coming soon. Stay tuned!