Alibaba AMAP CV Lab

The Alibaba AMAP CV Lab focuses on cutting-edge research and innovative applications centered around computer vision technology, dedicated to building the core technological capabilities of the spatiotemporal internet. Positioned at the intersection of the physical and digital worlds, we empower smart mobility, daily life, and virtual spaces through AI-driven understanding and generation.

As the core technical driving force behind AMAP, our research spans the entire chain from perception to generation, and from human-centric intelligence to world modeling. We are structured into six major research domains:

The AMAP CV Lab stands at the forefront of computer vision research and application, serving as a key technological practitioner in Alibaba’s spatial intelligent internet.
We believe that AI’s ability to understand the world defines the future of intelligent mobility and everyday life.

Latest News

Public Technologies

🗺️
Map & Autonomous Driving
The core of our research lies in integrating perception, mapping, and decision-making for intelligent transportation. We develop next-generation 3D map engines, traffic rule reasoning, and scene-level behavior modeling, enabling AI to understand spatial context and make interpretable decisions in real-world urban environments.
🛣 Online Navigation Refinement: Achieving Lane-Level Guidance by Associating Standard-Definition and Online Perception Maps
The first benchmark for Online Navigation Refinement, which proposes a path-aware transformer to associate standard maps with online perception and unifies global topology with real-time geometry for low-cost lane-level navigation.
🚘 FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving
The first VLA for autonomous driving visual reasoning, which proposes spatio-temporal CoT to think visually about trajectory planning and unifies visual generation and understanding with minimal data.
🗺 UniMapGen: A Generative Frameworkfor Large-Scale Map Construction from Multi-modal Data
A generative unified framework that autoregressively generates smooth and topologically consistent vectorized maps from multi-modal inputs, enabling scalable, occlusion-robust city-scale mapping without costly on-site data collection.
🛣️ PriorDrive: Enhancing Online HD Mapping with Unified Vector Priors
This is the first framework that unifies the encoding and integration of diverse vectorized prior maps (such as SD maps, outdated HD maps, and historical maps) to enhance online HD map construction.
🚥 Persistent Autoregressive Mapping with Traffic Rules for Autonomous Driving
Pioneering a generative co-reasoning paradigm in autonomous mapping, this work (PAMR) unifies the autoregressive construction of lane geometry and persistent traffic rules, enabling vehicles to build maps with long-term memory and consistent rule awareness across extended sequences.
📑 SeqGrowGraph: Learning Lane Topology as a Chain of Graph Expansions
A generative framework that reframes lane network learning as a process of incrementally building an adjacency matrix.
🚗 Driving by the Rules: A Benchmark for Integrating Traffic Sign Regulations into Vectorized HD Map
Benchmark and multi-modal approach for integrating lane-level traffic sign regulations into vectorized HD maps.
🕺🏻
Human-Centric AI
Centered on generative AI, our digital human research advances from driven generation to autonomous action. Through the Fantasy AIGC Family, we achieve expressive, identity-consistent, and physically realistic video generation via multimodal diffusion and 3D-aware modeling.
🗣️ FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis
The first Wan-based high-fidelity audio-driven avatar system that synchronizes facial expressions, lip motion, and body gestures in dynamic scenes through dual-stage audio-visual alignment and controllable motion modulation.
🎙️ FantasyTalking2: Timestep-Layer Adaptive Preference Optimization for Audio-Driven Portrait Animation
A novel Timestep-Layer Adaptive Multi-Expert Preference Optimization (TLPO) method enhances the quality of audio-driven avatar in three dimensions: lip-sync, motion naturalness, and visual quality.
🗿 FantasyHSI: Video-Generation-Centric 4D Human Synthesis In Any Scene through A Graph-based Multi-Agent Framework
A graph-based multi-agent framework that grounds video generation within 3D world dynamics, enabling digital humans to perceive, plan, and act autonomously, thus serving as the technical bridge that links human modeling to world modeling through unified perception–action reasoning.
🤡 FantasyPortrait: Enhancing Multi-Character Portrait Animation with Expression-Augmented Diffusion Transformers
A novel expression-driven video-generation method that pairs emotion-enhanced learning with masked cross-attention, enabling the creation of high-quality, richly expressive animations for both single and multi-portrait scenarios.
🆔 FantasyID: Face Knowledge Enhanced ID-Preserving Video Generation
A tuning-free text-to-video model that leverages 3D facial priors, multi-view augmentation, and layer-aware guidance injection to deliver dynamic, identity-preserving video generation.
💃🏻 HumanRig: Learning Automatic Rigging for Humanoid Characters in Animation
The first dataset for automatic rigging of 3D generated digital humans and a transformer-based end-to-end automatic rigging algorithm.
🧭
Embodied AI
We study perception, reasoning, and action of intelligent agents in both virtual and physical environments. By integrating vision-language models and reinforcement learning, we build embodied agents capable of environmental perception, goal planning, and task execution, forming a unified cognitive foundation for robots and digital humans.
🧠 JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation
The first visual-language navigation agent with dual implicit memory decouples visual semantics and spatial perception and models them respectively as compact implicit neural representations.
CE-Nav: Flow-Guided Reinforcement Refinement for Cross-Embodiment Local Navigation
A novel cross-embodiment local navigation framework, which can serve as a "one brain, multiple forms", plug-and-play fast system.
OmniNav: A Unified Framework for Prospective Exploration and Visual-Language Navigation
OmniNav is a unified embodied navigation framework that combines a lightweight, real-time (up to 5 Hz) continuous waypoint policy with a fast–slow planning architecture and large-scale vision-language multi-task training to robustly handle instruction-, object-, and point-goal navigation and frontier exploration, achieving state-of-the-art performance and real-world validation.
🕵🏻‍♂️ FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-and-Language Navigation
A unified multimodal Chain-of-Thought (CoT) reasoning framework that internalizes the inference capabilities of world models into the VLN architecture, enabling efficient and precise navigation based on natural language instructions and visual observations.
Seeing Space and Motion: Enhancing Latent Actions with Spatial and Dynamic Awareness for VLA
A Robust Vision-Language-Action Framework with Structural Perception and Explicit Dynamics Reasoning.
🌐
World Modeling
We aim to construct dynamic, interactive world models for understanding, predicting, and generating physically consistent spatiotemporal phenomena. By leveraging multimodal modeling and generative learning, our research enables a perception-to-simulation loop that empowers AI to comprehend and recreate the real world.
🌏 FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction
A unified world model integrating video priors and geometric grounding for synthesizing explorable and geometrically consistent 3D scenes.
World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training
A novel framework leveraging world model as a virtual environment for VLA post training.
🧊
3D Generation & Reconstruction
Our research in 3D generation and reconstruction covers Gaussian Splatting, NeRF, and 3D-aware diffusion, aiming for real-time rendering, continuous level-of-detail control, and semantically consistent 3D scene synthesis.
🛰 Sat3DGen: Comprehensive Street-Level 3D Scene Generation from Single Satellite Image
A feed-forward generative framework for synthesizing street-view-level 3D content from a single satellite image based on a geometry-first strategy. Without requiring 3D annotations.
💠 CLoD-GS: Continuous Level-of-Detail Gaussian Splatting for Real-Time Rendering
CLoD-GS equips 3D Gaussian Splatting with learnable distance-adaptive opacity, enabling smooth, storage-efficient, artifact-free continuous level-of-detail rendering from a single model.
🧸 G3PT: Unleash the Power of Autoregressive Modeling in 3D Generative Tasks
The first native 3D generation foundational model based on next-scale autoregression.
🏙 Global-Guided Focal Neural Radiance Field for Large-Scale Scene Representation
GF-NeRF introduces a global-guided two-stage architecture to achieve consistent and high-fidelity large-scale scene rendering without relying on prior scene knowledge.
🎨 MVPainter: Accurate and Detailed 3D Texture Generation via Multi-View Diffusion with Geometric Control
Geometrically controlled multi-view diffusion model for generating high-fidelity, detail-rich, and geometrically consistent 3D textures and PBR materials from a single reference image.
🧠
General Deep Learning
We focus on general representation learning and model optimization as the foundation for multimodal and cross-domain AI systems. Our research includes Transformer architecture optimization, distributed training, model compression, and preference alignment (DPO, RLHF) to enhance generalization and interpretability.
🎙️ A Study on the Adverse Impact of Synthetic Speech on Speech Recognition
Performance analysis and novel solution exploration for speech recognition under synthetic speech interference.
Doubly-Fused ViT: Fuse Information from Dual Vision Transformer Streams
DFvT introduces a doubly-fused Vision Transformer that combines efficient global context modeling with fine-grained spatial detail preservation to achieve high accuracy and efficiency.
SCMT: Self-Correction Mean Teacher for Semi-supervised Object Detection
A self-correction mean teacher architecture that mitigates the impact of noisy pseudo-labels, offering a novel technological breakthrough in the field of semi-supervised object detection.
DPOSE: Online Keypoint-CAM Guided Inference for Driver Pose Estimation
An optimization scheme for a proprietary HPE task in DMS scenarios which involves a pose-wise hard mining strategy for distribution balance and an online keypoint-aligned Grad-CAM loss to constrain activations to semantic regions.