Alibaba AMAP CV Lab

🗺️

Map & Autonomous Driving

The core of our research lies in integrating perception, mapping, and decision-making for intelligent transportation. We develop next-generation 3D map engines, traffic rule reasoning, and scene-level behavior modeling, enabling AI to understand spatial context and make interpretable decisions in real-world urban environments.

🛣 Online Navigation Refinement: Achieving Lane-Level Guidance by Associating Standard-Definition and Online Perception Maps

The first benchmark for Online Navigation Refinement, which proposes a path-aware transformer to associate standard maps with online perception and unifies global topology with real-time geometry for low-cost lane-level navigation.

Project ICLR 2026 arXiv Code

🚘 FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

The first VLA for autonomous driving visual reasoning, which proposes spatio-temporal CoT to think visually about trajectory planning and unifies visual generation and understanding with minimal data.

Project NeurIPS 2025 (Spotlight) arXiv Code

🗺 UniMapGen: A Generative Frameworkfor Large-Scale Map Construction from Multi-modal Data

A generative unified framework that autoregressively generates smooth and topologically consistent vectorized maps from multi-modal inputs, enabling scalable, occlusion-robust city-scale mapping without costly on-site data collection.

Project AAAI 2026 (Oral) arXiv Code

🛣️ PriorDrive: Enhancing Online HD Mapping with Unified Vector Priors

This is the first framework that unifies the encoding and integration of diverse vectorized prior maps (such as SD maps, outdated HD maps, and historical maps) to enhance online HD map construction.

Project AAAI 2026 arXiv Code

🚥 Persistent Autoregressive Mapping with Traffic Rules for Autonomous Driving

Pioneering a generative co-reasoning paradigm in autonomous mapping, this work (PAMR) unifies the autoregressive construction of lane geometry and persistent traffic rules, enabling vehicles to build maps with long-term memory and consistent rule awareness across extended sequences.

Project arXiv Code

📑 SeqGrowGraph: Learning Lane Topology as a Chain of Graph Expansions

A generative framework that reframes lane network learning as a process of incrementally building an adjacency matrix.

ICCV 2025 arXiv

🚗 Driving by the Rules: A Benchmark for Integrating Traffic Sign Regulations into Vectorized HD Map

Benchmark and multi-modal approach for integrating lane-level traffic sign regulations into vectorized HD maps.

Project CVPR 2025 (Highlight) arXiv

🕺🏻

Human-Centric AI

Centered on generative AI, our digital human research advances from driven generation to autonomous action. Through the Fantasy AIGC Family, we achieve expressive, identity-consistent, and physically realistic video generation via multimodal diffusion and 3D-aware modeling.

🗣️ FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis

The first Wan-based high-fidelity audio-driven avatar system that synchronizes facial expressions, lip motion, and body gestures in dynamic scenes through dual-stage audio-visual alignment and controllable motion modulation.

Project ACM MM 2025 arXiv Code HuggingFace ModelScope

🎙️ FantasyTalking2: Timestep-Layer Adaptive Preference Optimization for Audio-Driven Portrait Animation

A novel Timestep-Layer Adaptive Multi-Expert Preference Optimization (TLPO) method enhances the quality of audio-driven avatar in three dimensions: lip-sync, motion naturalness, and visual quality.

Project AAAI 2026 arXiv Coming Soon

🗿 FantasyHSI: Video-Generation-Centric 4D Human Synthesis In Any Scene through A Graph-based Multi-Agent Framework

A graph-based multi-agent framework that grounds video generation within 3D world dynamics, enabling digital humans to perceive, plan, and act autonomously, thus serving as the technical bridge that links human modeling to world modeling through unified perception–action reasoning.

Project AAAI 2026 arXiv Coming Soon

🤡 FantasyPortrait: Enhancing Multi-Character Portrait Animation with Expression-Augmented Diffusion Transformers

A novel expression-driven video-generation method that pairs emotion-enhanced learning with masked cross-attention, enabling the creation of high-quality, richly expressive animations for both single and multi-portrait scenarios.

Project arXiv Code

🆔 FantasyID: Face Knowledge Enhanced ID-Preserving Video Generation

A tuning-free text-to-video model that leverages 3D facial priors, multi-view augmentation, and layer-aware guidance injection to deliver dynamic, identity-preserving video generation.

Project arXiv Code HuggingFace ModelScope

💃🏻 HumanRig: Learning Automatic Rigging for Humanoid Characters in Animation

The first dataset for automatic rigging of 3D generated digital humans and a transformer-based end-to-end automatic rigging algorithm.

Project CVPR 2025 (Highlight) arXiv Code HuggingFace

🧭

Embodied AI

We study perception, reasoning, and action of intelligent agents in both virtual and physical environments. By integrating vision-language models and reinforcement learning, we build embodied agents capable of environmental perception, goal planning, and task execution, forming a unified cognitive foundation for robots and digital humans.

🧠 JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation

The first visual-language navigation agent with dual implicit memory decouples visual semantics and spatial perception and models them respectively as compact implicit neural representations.

Project ICLR 2026 arXiv Code ModelScope

CE-Nav: Flow-Guided Reinforcement Refinement for Cross-Embodiment Local Navigation

A novel cross-embodiment local navigation framework, which can serve as a "one brain, multiple forms", plug-and-play fast system.

Project ICLR 2026 arXiv Code

OmniNav: A Unified Framework for Prospective Exploration and Visual-Language Navigation

OmniNav is a unified embodied navigation framework that combines a lightweight, real-time (up to 5 Hz) continuous waypoint policy with a fast–slow planning architecture and large-scale vision-language multi-task training to robustly handle instruction-, object-, and point-goal navigation and frontier exploration, achieving state-of-the-art performance and real-world validation.

ICLR 2026 arXiv Code

🕵🏻‍♂️ FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-and-Language Navigation

A unified multimodal Chain-of-Thought (CoT) reasoning framework that internalizes the inference capabilities of world models into the VLN architecture, enabling efficient and precise navigation based on natural language instructions and visual observations.

Project arXiv Code HuggingFace ModelScope

Seeing Space and Motion: Enhancing Latent Actions with Spatial and Dynamic Awareness for VLA

A Robust Vision-Language-Action Framework with Structural Perception and Explicit Dynamics Reasoning.

arXiv

🌐

World Modeling

We aim to construct dynamic, interactive world models for understanding, predicting, and generating physically consistent spatiotemporal phenomena. By leveraging multimodal modeling and generative learning, our research enables a perception-to-simulation loop that empowers AI to comprehend and recreate the real world.

🌏 FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction

A unified world model integrating video priors and geometric grounding for synthesizing explorable and geometrically consistent 3D scenes.

Project ICLR 2026 arXiv Coming Soon

World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training

A novel framework leveraging world model as a virtual environment for VLA post training.

arXiv

🧊

3D Generation & Reconstruction

Our research in 3D generation and reconstruction covers Gaussian Splatting, NeRF, and 3D-aware diffusion, aiming for real-time rendering, continuous level-of-detail control, and semantically consistent 3D scene synthesis.

🛰 Sat3DGen: Comprehensive Street-Level 3D Scene Generation from Single Satellite Image

A feed-forward generative framework for synthesizing street-view-level 3D content from a single satellite image based on a geometry-first strategy. Without requiring 3D annotations.

ICLR 2026

💠 CLoD-GS: Continuous Level-of-Detail Gaussian Splatting for Real-Time Rendering

CLoD-GS equips 3D Gaussian Splatting with learnable distance-adaptive opacity, enabling smooth, storage-efficient, artifact-free continuous level-of-detail rendering from a single model.

ICLR 2026 arXiv

🧸 G3PT: Unleash the Power of Autoregressive Modeling in 3D Generative Tasks

The first native 3D generation foundational model based on next-scale autoregression.

IJCAI 2025 arXiv

🏙 Global-Guided Focal Neural Radiance Field for Large-Scale Scene Representation

GF-NeRF introduces a global-guided two-stage architecture to achieve consistent and high-fidelity large-scale scene rendering without relying on prior scene knowledge.

Project WACV 2025 arXiv

🎨 MVPainter: Accurate and Detailed 3D Texture Generation via Multi-View Diffusion with Geometric Control

Geometrically controlled multi-view diffusion model for generating high-fidelity, detail-rich, and geometrically consistent 3D textures and PBR materials from a single reference image.

Project arXiv Code

🧠

General Deep Learning

We focus on general representation learning and model optimization as the foundation for multimodal and cross-domain AI systems. Our research includes Transformer architecture optimization, distributed training, model compression, and preference alignment (DPO, RLHF) to enhance generalization and interpretability.

🎙️ A Study on the Adverse Impact of Synthetic Speech on Speech Recognition

Performance analysis and novel solution exploration for speech recognition under synthetic speech interference.

ICASSP 2024

Doubly-Fused ViT: Fuse Information from Dual Vision Transformer Streams

DFvT introduces a doubly-fused Vision Transformer that combines efficient global context modeling with fine-grained spatial detail preservation to achieve high accuracy and efficiency.

ECCV 2022 Code

SCMT: Self-Correction Mean Teacher for Semi-supervised Object Detection

A self-correction mean teacher architecture that mitigates the impact of noisy pseudo-labels, offering a novel technological breakthrough in the field of semi-supervised object detection.

IJCAI 2022

DPOSE: Online Keypoint-CAM Guided Inference for Driver Pose Estimation

An optimization scheme for a proprietary HPE task in DMS scenarios which involves a pose-wise hard mining strategy for distribution balance and an online keypoint-aligned Grad-CAM loss to constrain activations to semantic regions.

CVPR Workshop 2023

Alibaba AMAP CV Lab

Latest News

Public Technologies