ABot-N0: VLA Foundation Model for Versatile Embodied Navigation

Abstract

Grand Unification of Embodied Navigation

Embodied navigation has long been fragmented by task-specific architectures. We introduce ABot-N0, a unified Vision-Language-Action (VLA) foundation model that achieves a "Grand Unification" across 5 core tasks: Point-Goal, Object-Goal, Instruction-Following, POI-Goal, and Person-Following. ABot-N0 utilizes a hierarchical "Brain-Action" architecture, pairing an LLM-based Cognitive Brain for semantic reasoning with a Flow Matching-based Action Expert for precise, continuous trajectory generation.

To support large-scale learning, we developed the ABot-N0 Data Engine, curating 16.9M expert trajectories and 5.0M reasoning samples across 7,802 high-fidelity 3D scenes (10.7 km²). ABot-N0 achieves new SOTA performance across 7 benchmarks, significantly outperforming specialized models. Furthermore, our Agentic Navigation System integrates a planner with hierarchical topological memory, enabling robust, long-horizon missions in dynamic real-world environments.

Point-Goal

Reach precise metric coordinates, serving as the foundational primitive for robust locomotion and obstacle avoidance.

Object-Goal

Actively search for and navigate to specific object categories in unseen environments with semantic reasoning.

Instruction-Following

Execute long-horizon, complex natural language paths with rigorous linguistic-action alignment.

POI-Goal

Identify Points of Interest and navigate to their physical entrances, bridging outdoor-indoor environments.

Person-Following

Real-time tracking of dynamic human targets — a critical social capability for human-robot interaction.

Model Architecture

Hierarchical Brain-Action Design

A unified VLA architecture that combines high-level cognitive reasoning with low-level motion planning, seamlessly generalizing across five core navigation tasks.

The Architecture of ABot-N0. The model adopts a hierarchical "Brain-Action" design. The Universal Multi-Modal Encoder unifies heterogeneous inputs (RGB observations, visual history, and goal specifications) into a shared token sequence. The Cognitive Brain (LLM) supports dual-mode operation: a Reasoning Head for semantic understanding and an Action Head for motion planning. The Action Expert employs Flow Matching to generate trajectory distributions.

🔍

Universal Multi-Modal Encoder

Unifies heterogeneous inputs — panoramic RGB, episodic visual memory, text goals, and geometric coordinates — into a shared latent space through flexible token-based encoding.

🧠

Cognitive Brain

Built on Qwen3-4B LLM backbone. Features task-conditional dual heads: a Reasoning Head for scene analysis and spatial reasoning, and an Action Head for navigation decisions.

⚡

Action Expert

Employs Flow Matching to generate precise, multi-modal trajectory distributions — 5 waypoints with position (x,y) and heading (θ) for continuous robot control.

Data Engine

The Largest Embodied Navigation Data Engine

A unified synthesis pipeline integrating high-fidelity 3D scenes, expert trajectories, and cognitive reasoning samples at unprecedented scale.

7,802

3D Scenes

10.7 km²

Total Area Coverage

16.9M

Expert Trajectories

5.0M

Reasoning Samples

01

High-Fidelity 3D Scene Ecosystem

7,802 Scenes·10.7 km²·384,754m Nav Graphs

3D Scene Ecosystem & Statistics. 7,802 high-fidelity 3D scenes covering 6.25 km² indoor (homes, offices, malls, stations) and 4.42 km² outdoor (intersections, parks, SocCity) environments. All scenes are annotated with traversable navigation graphs for collision-free trajectory synthesis.

02

Universal Trajectories Dataset

16.9M Trajectories·5 Navigation Tasks

4.0M Trajectories

Point-Goal

2.0M pseudo-trajectories from internet videos, 1.7M synthetic from 3D scenes, 340K real-world robot demonstrations.

3.2M Trajectories

Object-Goal

Semantic search and discovery of specific object categories in unseen environments.

2.8M Trajectories

Instruction-Following

VLN-CE R2R/RxR navigation, door-traversal, language-guided person search, and short-horizon atomic movement primitives.

2.5M Trajectories

POI-Goal

Outdoor-to-indoor navigation via streetview OCR, trajectory-instruction alignment, and video generation.

4.0M Trajectories

Person-Following

3 proximity configs × 3 challenge categories (STT, DT, AT), plus 400K target-absent cases.

03

Cognitive Reasoning Dataset

5.0M Samples·6 Reasoning Tasks

Cognitive Reasoning Dataset. 5.0M samples spanning Navigable Areas Analysis (1.2M), Social Navigation CoT (0.8M), Instruction-Following Reasoning (1.3M), Object-Goal Reasoning (0.1M), POI Grounding (0.5M), and General VQA (1.1M) — grounding decision-making in explicit spatial-social logic.

Training Recipe

Three-Stage Curriculum Learning

A progressive training pipeline ensuring the model first understands the world before learning to act within it.

Phase 1

Cognitive Warm-up

Before learning "how to move", the agent learns "what to see" and "how to reason." We freeze the Vision Encoder and fine-tune the LLM Brain using Next Token Prediction loss on diverse reasoning tasks. The Action Expert remains frozen, ensuring gradients focus purely on optimizing visual-linguistic representations.

Phase 2

Unified Sensorimotor SFT

All five navigation tasks are unified into a single multi-task training regime. A mixed-training strategy (20% reasoning replay) prevents catastrophic forgetting. Dual-Head Optimization jointly trains the AR Head and Action Expert.

ℒ_Phase2 = λ_txt · ℒ_NTP(θ_brain) + λ_flow · ℒ_CFM(θ_action | θ_brain)

Phase 3

Post-Training Value Alignment via SAFE-GRPO

A flow-based reinforcement learning framework that explicitly enforces social compliance. The Brain is frozen while the Action Expert is fine-tuned to maximize a composite reward balancing social compliance, expert similarity, smoothness, and efficiency.

ℛ = w_soc · ℛ_social + w_exp · ℛ_expert + w_sm · ℛ_smooth + w_eff · ℛ_eff

Evaluation

State-of-the-Art Across 7 Benchmarks

Comprehensive evaluation demonstrating superior performance across all five navigation paradigms.

Method	Mean	Turn	Crossing	Detour	Proximity	Crowd	Other	All
GNM	16.2	31.1	14.8	12.5	14.7	12.8	11.0	12.1
ViNT	16.5	31.1	15.4	12.9	14.8	13.3	11.6	12.6
NoMaD	19.1	35.1	18.5	15.6	18.1	14.3	12.8	12.1
CityWalker	15.2	26.6	14.1	13.9	14.3	12.0	10.4	11.5
ABot-N0	11.2	21.3	9.8	12.8	8.1	8.8	6.3	7.6

CityWalker Benchmark (Open-Loop): MAOE metric (↓ lower is better). ABot-N0 achieves mean MAOE of 11.2, significantly outperforming the previous SOTA CityWalker (15.2) — a 26.3% improvement.

Method	SR↑	RC↑	SPL↑	DCR↑	TCR↑
GNM*	43.3	62.4	37.0	26.5	28.7
ViNT*	45.6	66.2	39.5	31.4	33.8
NoMaD*	41.1	60.5	35.4	29.5	31.6
CityWalker	47.8	64.7	44.7	36.1	36.6
ABot-N0	88.3	92.1	79.2	85.1	85.4

SocNav Benchmark (Closed-Loop): ABot-N0 achieves 88.3% Success Rate, nearly doubling the baseline (47.8%). Social compliance DCR reaches 85.1% vs. 36.1% for the best baseline.

Method	Val-Seen		Val-Seen-Synonyms		Val-Unseen
	SR↑	SPL↑	SR↑	SPL↑	SR↑	SPL↑
DAgRL+OD	38.5	21.1	39.0	21.4	37.1	19.8
Uni-NaVid	41.3	21.1	43.9	21.8	39.5	19.8
MTU3D	55.0	23.6	45.0	14.7	40.8	12.1
NavFoM (Four views)	40.1	27.1	45.4	32.6	45.2	31.9
ABot-N0	55.3	32.1	55.4	33.2	54.0	30.5

HM3D-OVON Benchmark: ABot-N0 surpasses MTU3D by 13.2% in SR on the challenging Val-Unseen split. While MTU3D suffers a 14.2% drop from Val-Seen to Val-Unseen, ABot-N0 exhibits only 1.3% decline — demonstrating exceptional open-vocabulary generalization.

Method	NE↓	OS↑	SR↑	SPL↑
NaVILA	5.22	62.5	54.0	49.0
StreamVLN	4.98	64.2	56.9	51.9
InternVLA-N1 (S1+S2)	4.83	63.3	58.2	54.0
NavFoM (Four views)	4.61	72.1	61.7	55.3
ABot-N0	3.78	70.8	66.4	63.9

VLN-CE R2R Val-Unseen: ABot-N0 achieves 66.4% SR and 63.9% SPL, surpassing NavFoM by 4.7% SR and 8.6% SPL — using only panoramic RGB without depth or odometry.

Method	NE↓	SR↑	SPL↑
NaVILA	6.77	49.3	44.0
StreamVLN	6.22	52.9	46.0
InternVLA-N1 (S1+S2)	5.91	53.5	46.1
NavFoM (Four views)	4.74	64.4	56.2
ABot-N0	3.83	69.3	60.0

VLN-CE RxR Val-Unseen: ABot-N0 achieves 69.3% SR and 60.0% SPL, outperforming NavFoM by 4.9% SR and 3.8% SPL — demonstrating strong cross-lingual generalization.

Method	SR (0.1m)↑	SR (0.2m)↑	SR (0.3m)↑	TR (mean)↓	TR (best)↓	TR (worst)↓
NoMaD	4.13	15.07	29.20	31.35	5.45	85.91
CityWalker	13.79	41.02	65.96	15.58	0.76	56.47
OmniNav	18.78	46.99	72.39	14.16	0.99	53.79
ABot-N0	32.14	71.50	88.68	9.84	0.44	51.38

BridgeNav Dataset: ABot-N0 achieves 70.1% improvement at the strictest 0.1m threshold and reduces average trajectory deviation by 30.5%.

Method	Single-Target (STT)			Distracted (DT)			Ambiguity (AT)
	SR↑	TR↑	CR↓	SR↑	TR↑	CR↓	SR↑	TR↑	CR↓
TrackVLA	85.1	78.6	1.65	57.6	63.2	5.80	50.2	63.7	17.1
NavFoM	85.0	80.5	-	61.4	68.2	-	-	-	-
TrackVLA++	86.0	81.0	2.10	66.5	68.8	4.71	51.2	63.4	15.9
ABot-N0	86.9	87.6	8.54	66.7	75.4	11.6	67.3	79.5	7.05

EVT-Bench: ABot-N0 achieves 16.1% improvement in both SR and TR on the challenging Ambiguity Tracking task, even surpassing multi-view methods with single-view input.

Agentic System

Agentic Navigation System

A deployable framework that augments ABot-N0 with planning, topological memory, and self-reflection for robust long-horizon real-world missions.

Agentic Navigation System Overview. The system integrates ABot-N0 with an agentic framework comprising Agentic Planner, Actor, short-term Episodic Memory, and long-term Topo-Memory to handle complex real-world navigation tasks.

Task Execution Pipeline. (1) Global Navigation employs Approaching (Point-Goal) to traverse known spaces using topological memory, (2) Local Navigation utilizes Reaching (Object-Goal/POI-Goal) for precise target discovery and Interaction (Instruction-Following, Person-Following) for dynamic engagement, and (3) Neural Controller executes low-level velocity-based motion control.

🗺️ Map-as-Memory (Topo-Memory)

A hierarchical topological memory with 4 layers (Block, Road, Function, Object/POI) that enables cross-scale spatial knowledge deposition and dynamic updating — from residential interiors to urban environments.

🧭 Agentic Planner

Leverages VLM reasoning to decompose ambiguous user instructions into executable sub-tasks via Chain-of-Thought. Implements a "Coarse-to-Fine" strategy: Point-Goal for long-horizon approach, then Object/POI-Goal for precise local reaching.

🔄 Closed-loop Self-Reflection

A VLM-based Self-Reflector assesses sub-task completion. On failure, it diagnoses the cause and triggers re-planning — emulating human-like self-correction for robust long-horizon autonomy.

🎮 Neural Controller

A high-speed reactive layer operating at 10Hz+ on edge devices. Translates abstract VLA waypoints into precise velocity commands (v_x, v_y, v_yaw) using LiDAR-based occupancy mapping.

Real-World Deployment

From Model to Robot

Successfully deployed on Unitree Go2 quadrupedal robot with 2Hz VLA inference and 10Hz closed-loop control.

Unitree Go2 X

Quadrupedal robot with 12 actuated DOF, dynamically stable locomotion across diverse terrains.

270° Vision Coverage

Three monocular RGB cameras (H120°×V90° each) providing comprehensive panoramic perception.

4D LiDAR + RTK-GNSS

Unitree 4D LiDAR L2 for occupancy mapping and RTK-GNSS for global localization.

NVIDIA Jetson Orin NX

157 TOPS, 16GB RAM — all VLA inference runs onboard at 2Hz with only 3% performance reduction.

Hybrid Cloud-Edge Architecture

Planner on cloud (RTX 4090), VLA + Controller on edge — ensures autonomous operation even without network.

Video Demonstrations

Real-World Demo Videos

Watch ABot-N0 navigate complex real-world environments — from single-task atomic skills to long-horizon agentic missions.

Long-Horizon Agentic Missions

Multi-stage missions with planning, re-planning, and cross-environment transitions

OUTDOOR Outdoor Long-Horizon Mission

INDOOR Indoor Long-Horizon Mission

Real-World Applications

Practical application scenarios: AI Companion, Guide Dog Assistance — powered by ABot-N0

COMPANION Interactive Companion

GUIDE DOG Guide Dog Assistance

Single-Task Capabilities

Atomic navigation skills: Point-Goal, POI-Goal, Object-Goal, Instruction-Following, Person-Following

POINT-GOAL Point-Goal Navigation

POI-GOAL POI-Goal Navigation

OBJ-GOAL Object-Goal Navigation

INS-FOLLOW Instruction-Following

PERSON-FOLLOW Person-Following

Visualization

Deployment Visualization

Real-world navigation trajectories and application scenarios across diverse environments.

Short-Horizon Navigation Tasks

Single-skill atomic tasks: Object-Goal, Instruction-Following, Point-Goal, POI-Goal, Person-Following

OBJ-GOAL Object-Goal navigation: find target objects in real environments

INS-FOLLOW Instruction-Following: execute natural language navigation commands

POINT-GOAL Point-Goal navigation: reach precise metric coordinates in real environments

POI-GOAL POI-Goal navigation: navigate to Points of Interest entrances

PERSON-FOLLOW Person-Following: real-time tracking of dynamic human targets

Long-Horizon Agentic Navigation

Multi-stage missions requiring planning, re-planning, and cross-environment transitions

INDOOR Indoor agentic mission with multi-skill orchestration

OUTDOOR Outdoor long-horizon navigation to the closest park

CROSS-ENV Cross-environment: outdoor-to-indoor transition mission

Real-World Applications

Smart Follow & Load Carry, Guiding Assistance, AI Companion — powered by ABot-N0

APPLICATIONS Three application scenarios: Smart Follow & Load Carry, Guiding Assistance, and AI Companion with VQA capabilities

Citation

BibTeX

ArXiv preprint coming soon. Stay tuned!