Technical Report

ABot-N0

A Unified VLA Foundation Model for Versatile Embodied Navigation — Achieving Grand Unification Across 5 Core Tasks with a Hierarchical Brain-Action Architecture

0
Unified Tasks
0
SOTA Benchmarks
0
Expert Trajectories
0
3D Scenes
0
Reasoning Samples
0
km² Coverage
Read Paper GitHub Watch Demos
ABot-N0 Overview

Grand Unification of Embodied Navigation

Embodied navigation has long been fragmented by task-specific architectures. We introduce ABot-N0, a unified Vision-Language-Action (VLA) foundation model that achieves a "Grand Unification" across 5 core tasks: Point-Goal, Object-Goal, Instruction-Following, POI-Goal, and Person-Following. ABot-N0 utilizes a hierarchical "Brain-Action" architecture, pairing an LLM-based Cognitive Brain for semantic reasoning with a Flow Matching-based Action Expert for precise, continuous trajectory generation.


To support large-scale learning, we developed the ABot-N0 Data Engine, curating 16.9M expert trajectories and 5.0M reasoning samples across 7,802 high-fidelity 3D scenes (10.7 km²). ABot-N0 achieves new SOTA performance across 7 benchmarks, significantly outperforming specialized models. Furthermore, our Agentic Navigation System integrates a planner with hierarchical topological memory, enabling robust, long-horizon missions in dynamic real-world environments.

Point-Goal

Reach precise metric coordinates, serving as the foundational primitive for robust locomotion and obstacle avoidance.

Object-Goal

Actively search for and navigate to specific object categories in unseen environments with semantic reasoning.

Instruction-Following

Execute long-horizon, complex natural language paths with rigorous linguistic-action alignment.

POI-Goal

Identify Points of Interest and navigate to their physical entrances, bridging outdoor-indoor environments.

Person-Following

Real-time tracking of dynamic human targets — a critical social capability for human-robot interaction.

Hierarchical Brain-Action Design

A unified VLA architecture that combines high-level cognitive reasoning with low-level motion planning, seamlessly generalizing across five core navigation tasks.

ABot-N0 Architecture
The Architecture of ABot-N0. The model adopts a hierarchical "Brain-Action" design. The Universal Multi-Modal Encoder unifies heterogeneous inputs (RGB observations, visual history, and goal specifications) into a shared token sequence. The Cognitive Brain (LLM) supports dual-mode operation: a Reasoning Head for semantic understanding and an Action Head for motion planning. The Action Expert employs Flow Matching to generate trajectory distributions.
🔍

Universal Multi-Modal Encoder

Unifies heterogeneous inputs — panoramic RGB, episodic visual memory, text goals, and geometric coordinates — into a shared latent space through flexible token-based encoding.

🧠

Cognitive Brain

Built on Qwen3-4B LLM backbone. Features task-conditional dual heads: a Reasoning Head for scene analysis and spatial reasoning, and an Action Head for navigation decisions.

Action Expert

Employs Flow Matching to generate precise, multi-modal trajectory distributions — 5 waypoints with position (x,y) and heading (θ) for continuous robot control.

The Largest Embodied Navigation Data Engine

A unified synthesis pipeline integrating high-fidelity 3D scenes, expert trajectories, and cognitive reasoning samples at unprecedented scale.

7,802
3D Scenes
10.7 km²
Total Area Coverage
16.9M
Expert Trajectories
5.0M
Reasoning Samples
01

High-Fidelity 3D Scene Ecosystem

7,802 Scenes·10.7 km²·384,754m Nav Graphs

3D Scene Ecosystem & Data Sources
3D Scene Ecosystem & Statistics. 7,802 high-fidelity 3D scenes covering 6.25 km² indoor (homes, offices, malls, stations) and 4.42 km² outdoor (intersections, parks, SocCity) environments. All scenes are annotated with traversable navigation graphs for collision-free trajectory synthesis.
02

Universal Trajectories Dataset

16.9M Trajectories·5 Navigation Tasks

Point-Goal Data Pipeline
4.0M Trajectories

Point-Goal

2.0M pseudo-trajectories from internet videos, 1.7M synthetic from 3D scenes, 340K real-world robot demonstrations.

Object-Goal
3.2M Trajectories

Object-Goal

Semantic search and discovery of specific object categories in unseen environments.

Instruction-Following
2.8M Trajectories

Instruction-Following

VLN-CE R2R/RxR navigation, door-traversal, language-guided person search, and short-horizon atomic movement primitives.

POI-Goal
2.5M Trajectories

POI-Goal

Outdoor-to-indoor navigation via streetview OCR, trajectory-instruction alignment, and video generation.

Person-Following
4.0M Trajectories

Person-Following

3 proximity configs × 3 challenge categories (STT, DT, AT), plus 400K target-absent cases.

03

Cognitive Reasoning Dataset

5.0M Samples·6 Reasoning Tasks

Reasoning Dataset
Cognitive Reasoning Dataset. 5.0M samples spanning Navigable Areas Analysis (1.2M), Social Navigation CoT (0.8M), Instruction-Following Reasoning (1.3M), Object-Goal Reasoning (0.1M), POI Grounding (0.5M), and General VQA (1.1M) — grounding decision-making in explicit spatial-social logic.

Three-Stage Curriculum Learning

A progressive training pipeline ensuring the model first understands the world before learning to act within it.

Phase 1

Cognitive Warm-up

Before learning "how to move", the agent learns "what to see" and "how to reason." We freeze the Vision Encoder and fine-tune the LLM Brain using Next Token Prediction loss on diverse reasoning tasks. The Action Expert remains frozen, ensuring gradients focus purely on optimizing visual-linguistic representations.

Phase 2

Unified Sensorimotor SFT

All five navigation tasks are unified into a single multi-task training regime. A mixed-training strategy (20% reasoning replay) prevents catastrophic forgetting. Dual-Head Optimization jointly trains the AR Head and Action Expert.

Phase2 = λtxt · ℒNTPbrain) + λflow · ℒCFMaction | θbrain)
Phase 3

Post-Training Value Alignment via SAFE-GRPO

A flow-based reinforcement learning framework that explicitly enforces social compliance. The Brain is frozen while the Action Expert is fine-tuned to maximize a composite reward balancing social compliance, expert similarity, smoothness, and efficiency.

ℛ = wsoc · ℛsocial + wexp · ℛexpert + wsm · ℛsmooth + weff · ℛeff

State-of-the-Art Across 7 Benchmarks

Comprehensive evaluation demonstrating superior performance across all five navigation paradigms.

Method Mean Turn Crossing Detour Proximity Crowd Other All
GNM16.231.114.812.514.712.811.012.1
ViNT16.531.115.412.914.813.311.612.6
NoMaD19.135.118.515.618.114.312.812.1
CityWalker15.226.614.113.914.312.010.411.5
ABot-N011.221.39.812.88.18.86.37.6
CityWalker Benchmark (Open-Loop): MAOE metric (↓ lower is better). ABot-N0 achieves mean MAOE of 11.2, significantly outperforming the previous SOTA CityWalker (15.2) — a 26.3% improvement.
Method SR↑ RC↑ SPL↑ DCR↑ TCR↑
GNM*43.362.437.026.528.7
ViNT*45.666.239.531.433.8
NoMaD*41.160.535.429.531.6
CityWalker47.864.744.736.136.6
ABot-N088.392.179.285.185.4
SocNav Benchmark (Closed-Loop): ABot-N0 achieves 88.3% Success Rate, nearly doubling the baseline (47.8%). Social compliance DCR reaches 85.1% vs. 36.1% for the best baseline.
Method Val-Seen Val-Seen-Synonyms Val-Unseen
SR↑ SPL↑ SR↑ SPL↑ SR↑ SPL↑
DAgRL+OD38.521.139.021.437.119.8
Uni-NaVid41.321.143.921.839.519.8
MTU3D55.023.645.014.740.812.1
NavFoM (Four views)40.127.145.432.645.231.9
ABot-N055.332.155.433.254.030.5
HM3D-OVON Benchmark: ABot-N0 surpasses MTU3D by 13.2% in SR on the challenging Val-Unseen split. While MTU3D suffers a 14.2% drop from Val-Seen to Val-Unseen, ABot-N0 exhibits only 1.3% decline — demonstrating exceptional open-vocabulary generalization.
Method NE↓ OS↑ SR↑ SPL↑
NaVILA5.2262.554.049.0
StreamVLN4.9864.256.951.9
InternVLA-N1 (S1+S2)4.8363.358.254.0
NavFoM (Four views)4.6172.161.755.3
ABot-N03.7870.866.463.9
VLN-CE R2R Val-Unseen: ABot-N0 achieves 66.4% SR and 63.9% SPL, surpassing NavFoM by 4.7% SR and 8.6% SPL — using only panoramic RGB without depth or odometry.
Method NE↓ SR↑ SPL↑
NaVILA6.7749.344.0
StreamVLN6.2252.946.0
InternVLA-N1 (S1+S2)5.9153.546.1
NavFoM (Four views)4.7464.456.2
ABot-N03.8369.360.0
VLN-CE RxR Val-Unseen: ABot-N0 achieves 69.3% SR and 60.0% SPL, outperforming NavFoM by 4.9% SR and 3.8% SPL — demonstrating strong cross-lingual generalization.
Method SR (0.1m)↑ SR (0.2m)↑ SR (0.3m)↑ TR (mean)↓ TR (best)↓ TR (worst)↓
NoMaD4.1315.0729.2031.355.4585.91
CityWalker13.7941.0265.9615.580.7656.47
OmniNav18.7846.9972.3914.160.9953.79
ABot-N032.1471.5088.689.840.4451.38
BridgeNav Dataset: ABot-N0 achieves 70.1% improvement at the strictest 0.1m threshold and reduces average trajectory deviation by 30.5%.
Method Single-Target (STT) Distracted (DT) Ambiguity (AT)
SR↑TR↑CR↓ SR↑TR↑CR↓ SR↑TR↑CR↓
TrackVLA85.178.61.6557.663.25.8050.263.717.1
NavFoM85.080.5-61.468.2----
TrackVLA++86.081.02.1066.568.84.7151.263.415.9
ABot-N086.987.68.5466.775.411.667.379.57.05
EVT-Bench: ABot-N0 achieves 16.1% improvement in both SR and TR on the challenging Ambiguity Tracking task, even surpassing multi-view methods with single-view input.

Agentic Navigation System

A deployable framework that augments ABot-N0 with planning, topological memory, and self-reflection for robust long-horizon real-world missions.

Agentic Navigation System
Agentic Navigation System Overview. The system integrates ABot-N0 with an agentic framework comprising Agentic Planner, Actor, short-term Episodic Memory, and long-term Topo-Memory to handle complex real-world navigation tasks.
Task Execution Pipeline
Task Execution Pipeline. (1) Global Navigation employs Approaching (Point-Goal) to traverse known spaces using topological memory, (2) Local Navigation utilizes Reaching (Object-Goal/POI-Goal) for precise target discovery and Interaction (Instruction-Following, Person-Following) for dynamic engagement, and (3) Neural Controller executes low-level velocity-based motion control.

🗺️ Map-as-Memory (Topo-Memory)

A hierarchical topological memory with 4 layers (Block, Road, Function, Object/POI) that enables cross-scale spatial knowledge deposition and dynamic updating — from residential interiors to urban environments.

🧭 Agentic Planner

Leverages VLM reasoning to decompose ambiguous user instructions into executable sub-tasks via Chain-of-Thought. Implements a "Coarse-to-Fine" strategy: Point-Goal for long-horizon approach, then Object/POI-Goal for precise local reaching.

🔄 Closed-loop Self-Reflection

A VLM-based Self-Reflector assesses sub-task completion. On failure, it diagnoses the cause and triggers re-planning — emulating human-like self-correction for robust long-horizon autonomy.

🎮 Neural Controller

A high-speed reactive layer operating at 10Hz+ on edge devices. Translates abstract VLA waypoints into precise velocity commands (vx, vy, vyaw) using LiDAR-based occupancy mapping.

From Model to Robot

Successfully deployed on Unitree Go2 quadrupedal robot with 2Hz VLA inference and 10Hz closed-loop control.

Unitree Go2 X

Quadrupedal robot with 12 actuated DOF, dynamically stable locomotion across diverse terrains.

270° Vision Coverage

Three monocular RGB cameras (H120°×V90° each) providing comprehensive panoramic perception.

4D LiDAR + RTK-GNSS

Unitree 4D LiDAR L2 for occupancy mapping and RTK-GNSS for global localization.

NVIDIA Jetson Orin NX

157 TOPS, 16GB RAM — all VLA inference runs onboard at 2Hz with only 3% performance reduction.

Hybrid Cloud-Edge Architecture

Planner on cloud (RTX 4090), VLA + Controller on edge — ensures autonomous operation even without network.

Hardware Platform

Real-World Demo Videos

Watch ABot-N0 navigate complex real-world environments — from single-task atomic skills to long-horizon agentic missions.

Long-Horizon Agentic Missions

Multi-stage missions with planning, re-planning, and cross-environment transitions

OUTDOOR Outdoor Long-Horizon Mission
INDOOR Indoor Long-Horizon Mission

Real-World Applications

Practical application scenarios: AI Companion, Guide Dog Assistance — powered by ABot-N0

COMPANION Interactive Companion
GUIDE DOG Guide Dog Assistance

Single-Task Capabilities

Atomic navigation skills: Point-Goal, POI-Goal, Object-Goal, Instruction-Following, Person-Following

POINT-GOAL Point-Goal Navigation
POI-GOAL POI-Goal Navigation
OBJ-GOAL Object-Goal Navigation
INS-FOLLOW Instruction-Following
PERSON-FOLLOW Person-Following

Deployment Visualization

Real-world navigation trajectories and application scenarios across diverse environments.

Short-Horizon Navigation Tasks

Single-skill atomic tasks: Object-Goal, Instruction-Following, Point-Goal, POI-Goal, Person-Following

Object-Goal Visualization
OBJ-GOAL Object-Goal navigation: find target objects in real environments
Instruction-Following Visualization
INS-FOLLOW Instruction-Following: execute natural language navigation commands
Point-Goal Visualization
POINT-GOAL Point-Goal navigation: reach precise metric coordinates in real environments
POI-Goal Visualization
POI-GOAL POI-Goal navigation: navigate to Points of Interest entrances
Person-Following Visualization
PERSON-FOLLOW Person-Following: real-time tracking of dynamic human targets

Long-Horizon Agentic Navigation

Multi-stage missions requiring planning, re-planning, and cross-environment transitions

Indoor Long-Horizon Navigation
INDOOR Indoor agentic mission with multi-skill orchestration
Outdoor Long-Horizon Navigation 1
OUTDOOR Outdoor long-horizon navigation to the closest park
Outdoor Long-Horizon Navigation 2
CROSS-ENV Cross-environment: outdoor-to-indoor transition mission

Real-World Applications

Smart Follow & Load Carry, Guiding Assistance, AI Companion — powered by ABot-N0

Real-World Applications
APPLICATIONS Three application scenarios: Smart Follow & Load Carry, Guiding Assistance, and AI Companion with VQA capabilities

BibTeX

ArXiv preprint coming soon. Stay tuned!