Technical Report

ABot-N0

A Unified VLA Foundation Model for Versatile Embodied Navigation — Achieving Grand Unification Across 5 Core Tasks with a Hierarchical Brain-Action Architecture

0
Unified Tasks
0
SOTA Benchmarks
0
Expert Trajectories
0
3D Scenes
0
Reasoning Samples
0
km² Coverage
Read Paper GitHub Watch Demos
ABot-N0 Overview

Grand Unification of Embodied Navigation

Embodied navigation has long been fragmented by task-specific architectures. We introduce ABot-N0, a unified Vision-Language-Action (VLA) foundation model that achieves a "Grand Unification" across 5 core tasks: Point-Goal, Object-Goal, Instruction-Following, POI-Goal, and Person-Following. ABot-N0 utilizes a hierarchical "Brain-Action" architecture, pairing an LLM-based Cognitive Brain for semantic reasoning with a Flow Matching-based Action Expert for precise, continuous trajectory generation.


To support large-scale learning, we developed the ABot-N0 Data Engine, curating 16.9M expert trajectories and 5.0M reasoning samples across 7,802 high-fidelity 3D scenes (10.7 km²). ABot-N0 achieves new SOTA performance across 7 benchmarks, significantly outperforming specialized models. Furthermore, our Agentic Navigation System integrates a planner with hierarchical topological memory, enabling robust, long-horizon missions in dynamic real-world environments.

Point-Goal

Reach precise metric coordinates, serving as the foundational primitive for robust locomotion and obstacle avoidance.

Object-Goal

Actively search for and navigate to specific object categories in unseen environments with semantic reasoning.

Instruction-Following

Execute long-horizon, complex natural language paths with rigorous linguistic-action alignment.

POI-Goal

Identify Points of Interest and navigate to their physical entrances, bridging outdoor-indoor environments.

Person-Following

Real-time tracking of dynamic human targets — a critical social capability for human-robot interaction.

Hierarchical Brain-Action Design

A unified VLA architecture that combines high-level cognitive reasoning with low-level motion planning, seamlessly generalizing across five core navigation tasks.

ABot-N0 Architecture
The Architecture of ABot-N0. The model adopts a hierarchical "Brain-Action" design. The Universal Multi-Modal Encoder unifies heterogeneous inputs (RGB observations, visual history, and goal specifications) into a shared token sequence. The Cognitive Brain (LLM) supports dual-mode operation: a Reasoning Head for semantic understanding and an Action Head for motion planning. The Action Expert employs Flow Matching to generate trajectory distributions.
🔍

Universal Multi-Modal Encoder

Unifies heterogeneous inputs — panoramic RGB, episodic visual memory, text goals, and geometric coordinates — into a shared latent space through flexible token-based encoding.

🧠

Cognitive Brain

Built on Qwen3-4B LLM backbone. Features task-conditional dual heads: a Reasoning Head for scene analysis and spatial reasoning, and an Action Head for navigation decisions.

Action Expert

Employs Flow Matching to generate precise, multi-modal trajectory distributions — 5 waypoints with position (x,y) and heading (θ) for continuous robot control.

The Largest Embodied Navigation Data Engine

A unified synthesis pipeline integrating high-fidelity 3D scenes, expert trajectories, and cognitive reasoning samples at unprecedented scale.

7,802
3D Scenes
10.7 km²
Total Area Coverage
16.9M
Expert Trajectories
5.0M
Reasoning Samples
01

High-Fidelity 3D Scene Ecosystem

7,802 Scenes·10.7 km²·384,754m Nav Graphs

3D Scene Ecosystem & Data Sources
3D Scene Ecosystem & Statistics. 7,802 high-fidelity 3D scenes covering 6.25 km² indoor (homes, offices, malls, stations) and 4.42 km² outdoor (intersections, parks, SocCity) environments. All scenes are annotated with traversable navigation graphs for collision-free trajectory synthesis.
02

Universal Trajectories Dataset

16.9M Trajectories·5 Navigation Tasks

Point-Goal Data Pipeline
4.0M Trajectories

Point-Goal

2.0M pseudo-trajectories from internet videos, 1.7M synthetic from 3D scenes, 340K real-world robot demonstrations.

Object-Goal
3.2M Trajectories

Object-Goal

Semantic search and discovery of specific object categories in unseen environments.

Instruction-Following
2.8M Trajectories

Instruction-Following

VLN-CE R2R/RxR navigation, door-traversal, language-guided person search, and short-horizon atomic movement primitives.

POI-Goal
2.5M Trajectories

POI-Goal

Outdoor-to-indoor navigation via streetview OCR, trajectory-instruction alignment, and video generation.

Person-Following
4.0M Trajectories

Person-Following

3 proximity configs × 3 challenge categories (STT, DT, AT), plus 400K target-absent cases.

03

Cognitive Reasoning Dataset

5.0M Samples·6 Reasoning Tasks

Reasoning Dataset
Cognitive Reasoning Dataset. 5.0M samples spanning Navigable Areas Analysis (1.2M), Social Navigation CoT (0.8M), Instruction-Following Reasoning (1.3M), Object-Goal Reasoning (0.1M), POI Grounding (0.5M), and General VQA (1.1M) — grounding decision-making in explicit spatial-social logic.

Three-Stage Curriculum Learning

A progressive training pipeline ensuring the model first understands the world before learning to act within it.

Phase 1

Cognitive Warm-up

Before learning "how to move", the agent learns "what to see" and "how to reason." We freeze the Vision Encoder and fine-tune the LLM Brain using Next Token Prediction loss on diverse reasoning tasks. The Action Expert remains frozen, ensuring gradients focus purely on optimizing visual-linguistic representations.

Phase 2

Unified Sensorimotor SFT

All five navigation tasks are unified into a single multi-task training regime. A mixed-training strategy (20% reasoning replay) prevents catastrophic forgetting. Dual-Head Optimization jointly trains the AR Head and Action Expert.

Phase2 = λtxt · ℒNTPbrain) + λflow · ℒCFMaction | θbrain)
Phase 3

Post-Training Value Alignment via SAFE-GRPO

A flow-based reinforcement learning framework that explicitly enforces social compliance. The Brain is frozen while the Action Expert is fine-tuned to maximize a composite reward balancing social compliance, expert similarity, smoothness, and efficiency.

ℛ = wsoc · ℛsocial + wexp · ℛexpert + wsm · ℛsmooth + weff · ℛeff

State-of-the-Art Across 7 Benchmarks

Comprehensive evaluation demonstrating superior performance across all five navigation paradigms.

Method Mean Turn Crossing Detour Proximity Crowd Other All
GNM16.231.114.812.514.712.811.012.1
ViNT16.531.115.412.914.813.311.612.6
NoMaD19.135.118.515.618.114.312.812.1
CityWalker15.226.614.113.914.312.010.411.5
ABot-N011.221.39.812.88.18.86.37.6
CityWalker Benchmark (Open-Loop): MAOE metric (↓ lower is better). ABot-N0 achieves mean MAOE of 11.2, significantly outperforming the previous SOTA CityWalker (15.2) — a 26.3% improvement.
Method SR↑ RC↑ SPL↑ DCR↑ TCR↑
GNM*43.362.437.026.528.7
ViNT*45.666.239.531.433.8
NoMaD*41.160.535.429.531.6
CityWalker47.864.744.736.136.6
ABot-N088.392.179.285.185.4
SocNav Benchmark (Closed-Loop): ABot-N0 achieves 88.3% Success Rate, nearly doubling the baseline (47.8%). Social compliance DCR reaches 85.1% vs. 36.1% for the best baseline.
Method Val-Seen Val-Seen-Synonyms Val-Unseen
SR↑ SPL↑ SR↑ SPL↑ SR↑ SPL↑
DAgRL+OD38.521.139.021.437.119.8
Uni-NaVid41.321.143.921.839.519.8
MTU3D55.023.645.014.740.812.1
NavFoM (Four views)40.127.145.432.645.231.9
ABot-N055.332.155.433.254.030.5
HM3D-OVON Benchmark: ABot-N0 surpasses MTU3D by 13.2% in SR on the challenging Val-Unseen split. While MTU3D suffers a 14.2% drop from Val-Seen to Val-Unseen, ABot-N0 exhibits only 1.3% decline — demonstrating exceptional open-vocabulary generalization.
Method NE↓ OS↑ SR↑ SPL↑
NaVILA5.2262.554.049.0
StreamVLN4.9864.256.951.9
InternVLA-N1 (S1+S2)4.8363.358.254.0
NavFoM (Four views)4.6172.161.755.3
ABot-N03.7870.866.463.9
VLN-CE R2R Val-Unseen: ABot-N0 achieves 66.4% SR and 63.9% SPL, surpassing NavFoM by 4.7% SR and 8.6% SPL — using only panoramic RGB without depth or odometry.
Method NE↓ SR↑ SPL↑
NaVILA6.7749.344.0
StreamVLN6.2252.946.0
InternVLA-N1 (S1+S2)5.9153.546.1
NavFoM (Four views)4.7464.456.2
ABot-N03.8369.360.0
VLN-CE RxR Val-Unseen: ABot-N0 achieves 69.3% SR and 60.0% SPL, outperforming NavFoM by 4.9% SR and 3.8% SPL — demonstrating strong cross-lingual generalization.
Method SR (0.1m)↑ SR (0.2m)↑ SR (0.3m)↑ TR (mean)↓ TR (best)↓ TR (worst)↓
NoMaD4.1315.0729.2031.355.4585.91
CityWalker13.7941.0265.9615.580.7656.47
OmniNav18.7846.9972.3914.160.9953.79
ABot-N032.1471.5088.689.840.4451.38
BridgeNav Dataset: ABot-N0 achieves 70.1% improvement at the strictest 0.1m threshold and reduces average trajectory deviation by 30.5%.
Method Single-Target (STT) Distracted (DT) Ambiguity (AT)
SR↑TR↑CR↓ SR↑TR↑CR↓ SR↑TR↑CR↓
TrackVLA85.178.61.6557.663.25.8050.263.717.1
NavFoM85.080.5-61.468.2----
TrackVLA++86.081.02.1066.568.84.7151.263.415.9
ABot-N086.987.68.5466.775.411.667.379.57.05
EVT-Bench: ABot-N0 achieves 16.1% improvement in both SR and TR on the challenging Ambiguity Tracking task, even surpassing multi-view methods with single-view input.

Agentic Navigation System

A deployable framework that augments ABot-N0 with planning, topological memory, and self-reflection for robust long-horizon real-world missions.

Agentic Navigation System
Agentic Navigation System Overview. The system integrates ABot-N0 with an agentic framework comprising Agentic Planner, Actor, short-term Episodic Memory, and long-term Topo-Memory to handle complex real-world navigation tasks.
Task Execution Pipeline
Task Execution Pipeline. (1) Global Navigation employs Approaching (Point-Goal) to traverse known spaces using topological memory, (2) Local Navigation utilizes Reaching (Object-Goal/POI-Goal) for precise target discovery and Interaction (Instruction-Following, Person-Following) for dynamic engagement, and (3) Neural Controller executes low-level velocity-based motion control.

🗺️ Map-as-Memory (Topo-Memory)

A hierarchical topological memory with 4 layers (Block, Road, Function, Object/POI) that enables cross-scale spatial knowledge deposition and dynamic updating — from residential interiors to urban environments.

🧭 Agentic Planner

Leverages VLM reasoning to decompose ambiguous user instructions into executable sub-tasks via Chain-of-Thought. Implements a "Coarse-to-Fine" strategy: Point-Goal for long-horizon approach, then Object/POI-Goal for precise local reaching.

🔄 Closed-loop Self-Reflection

A VLM-based Self-Reflector assesses sub-task completion. On failure, it diagnoses the cause and triggers re-planning — emulating human-like self-correction for robust long-horizon autonomy.

🎮 Neural Controller

A high-speed reactive layer operating at 10Hz+ on edge devices. Translates abstract VLA waypoints into precise velocity commands (vx, vy, vyaw) using LiDAR-based occupancy mapping.

From Model to Robot

Successfully deployed on Unitree Go2 quadrupedal robot with 2Hz VLA inference and 10Hz closed-loop control.

Unitree Go2 X

Quadrupedal robot with 12 actuated DOF, dynamically stable locomotion across diverse terrains.

270° Vision Coverage

Three monocular RGB cameras (H120°×V90° each) providing comprehensive panoramic perception.

4D LiDAR + RTK-GNSS

Unitree 4D LiDAR L2 for occupancy mapping and RTK-GNSS for global localization.

NVIDIA Jetson Orin NX

157 TOPS, 16GB RAM — all VLA inference runs onboard at 2Hz with only 3% performance reduction.

Hybrid Cloud-Edge Architecture

Planner on cloud (RTX 4090), VLA + Controller on edge — ensures autonomous operation even without network.

Hardware Platform

Real-World Demo Videos

Watch ABot-N0 navigate complex real-world environments — from single-task atomic skills to long-horizon agentic missions.

Long-Horizon Agentic Missions

Multi-stage missions with planning, re-planning, and cross-environment transitions

OUTDOOR Outdoor Long-Horizon Mission
INDOOR Indoor Long-Horizon Mission

Real-World Applications

Practical application scenarios: AI Companion, Guide Dog Assistance — powered by ABot-N0

COMPANION Interactive Companion
GUIDE DOG Guide Dog Assistance

Single-Task Capabilities

Atomic navigation skills: Point-Goal, POI-Goal, Object-Goal, Instruction-Following, Person-Following

POINT-GOAL Point-Goal Navigation
POI-GOAL POI-Goal Navigation
OBJ-GOAL Object-Goal Navigation
INS-FOLLOW Instruction-Following
PERSON-FOLLOW Person-Following

Deployment Visualization

Real-world navigation trajectories and application scenarios across diverse environments.

Short-Horizon Navigation Tasks

Single-skill atomic tasks: Object-Goal, Instruction-Following, Point-Goal, POI-Goal, Person-Following

Object-Goal Visualization
OBJ-GOAL Object-Goal navigation: find target objects in real environments
Instruction-Following Visualization
INS-FOLLOW Instruction-Following: execute natural language navigation commands
Point-Goal Visualization
POINT-GOAL Point-Goal navigation: reach precise metric coordinates in real environments
POI-Goal Visualization
POI-GOAL POI-Goal navigation: navigate to Points of Interest entrances
Person-Following Visualization
PERSON-FOLLOW Person-Following: real-time tracking of dynamic human targets

Long-Horizon Agentic Navigation

Multi-stage missions requiring planning, re-planning, and cross-environment transitions

Indoor Long-Horizon Navigation
INDOOR Indoor agentic mission with multi-skill orchestration
Outdoor Long-Horizon Navigation 1
OUTDOOR Outdoor long-horizon navigation to the closest park
Outdoor Long-Horizon Navigation 2
CROSS-ENV Cross-environment: outdoor-to-indoor transition mission

Real-World Applications

Smart Follow & Load Carry, Guiding Assistance, AI Companion — powered by ABot-N0

Real-World Applications
APPLICATIONS Three application scenarios: Smart Follow & Load Carry, Guiding Assistance, and AI Companion with VQA capabilities

BibTeX

@misc{chu2026abotn0technicalreportvla,
    title={ABot-N0: Technical Report on the VLA Foundation Model for Versatile Embodied Navigation}, 
    author={Zedong Chu and Shichao Xie and Xiaolong Wu and Yanfen Shen and Minghua Luo and Zhengbo Wang and Fei Liu and Xiaoxu Leng and Junjun Hu and Mingyang Yin and Jia Lu and Yingnan Guo and Kai Yang and Jiawei Han and Xu Chen and Yanqing Zhu and Yuxiang Zhao and Xin Liu and Yirong Yang and Ye He and Jiahang Wang and Yang Cai and Tianlin Zhang and Li Gao and Liu Liu and Mingchao Sun and Fan Jiang and Chiyu Wang and Zhicheng Liu and Hongyu Pan and Honglin Han and Zhining Gu and Kuan Yang and Jianfang Zhang and Di Jing and Zihao Guan and Wei Guo and Guoqing Liu and Di Yang and Xiangpo Yang and Menglin Yang and Hongguang Xing and Weiguo Li and Mu Xu},
    year={2026},
    eprint={2602.11598},
    archivePrefix={arXiv},
    primaryClass={cs.RO},
    url={https://arxiv.org/abs/2602.11598}, 
}