Advancing AI Navigation with Human-Aware Systems

In an article recently submitted to the arXiv* server, researchers introduced a new task for embodied artificial intelligence (AI) called human-aware vision-and-language navigation (HA-VLN). This task aims to bridge the gap between simulation and reality in vision-and-language navigation (VLN). To support this, they developed a realistic simulator named human-aware 3D (HA3D) and created two navigation agents.

Study: Advancing AI Navigation with Human-Aware Systems. Image Credit: sdecoret/Shutterstock
Study: Advancing AI Navigation with Human-Aware Systems. Image Credit: sdecoret/Shutterstock

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.


VLN is a benchmark for evaluating the simulation to real (Sim2Real) transfer capabilities of embodied AI agents, which learn from their environments. In VLN, an agent follows natural language instructions to navigate to a specific location within a three-dimensional (3D) space.

However, most existing VLN frameworks operate under simplifying assumptions like static environments, optimal expert supervision, and panoramic action spaces. These constraints limit their applicability and robustness in real-world scenarios. Additionally, these frameworks often overlook the nature of human activities in populated environments which limits their effectiveness in navigating real-world settings.

About the Research

In this paper, the authors developed HA-VLN, a novel task that extends VLN by incorporating human activities. HA-VLN involves an agent navigating environments with human activities, guided by natural language instructions. This task adopts an egocentric action space with a 60-degree field of view, mirroring human-like visual perception. It also integrates dynamic environments and 3D human motion models using the skinned multi-person linear (SMPL) model to capture realistic human poses and shapes. Furthermore, HA-VLN employs sub-optimal expert supervision, enabling the agent to learn from imperfect demonstrations of an adaptive policy, thus better handling real-world tasks with imperfect instructions.

To support HA-VLN, the researchers developed the HA3D simulator, which integrates dynamic human activities from the custom human activity and pose simulation (HAPS) dataset with photorealistic environments from the Matterport3D dataset. The HAPS dataset includes 145 human activities converted into 435 3D human motion models.

HA3D combines these human motion models with Matterport3D to create diverse and challenging navigation scenarios. It features an annotation tool for placing each human model in various indoor regions across 90 building scenes and uses Pyrender to render dynamic human bodies with high visual realism. HA3D also provides interfaces for agent-environment interaction, including first-person RGB-D video observation, navigable viewpoints, and human collision feedback.

Additionally, the study introduced the human-aware room-to-room (HA-R2R) dataset, an extension of the popular room-to-room (R2R) VLN dataset. HA-R2R incorporates human activity descriptions, resulting in 21,567 instructions with 145 activity types, categorized as start, obstacle, surrounding, and end based on their positions relative to the agent’s starting point. Compared to R2R, HA-R2R features longer average instruction lengths and a larger vocabulary, reflecting the increased complexity and diversity of the task.

Research Findings

To address the challenges of HA-VLN, the study introduced two multimodal agents designed to effectively integrate visual and linguistic information for navigation. The first agent called expert-supervised cross-modal (VLN-CM), is an LSTM-based sequence-to-sequence model. It learns by imitating expert demonstrations. The second agent named non-expert-supervised decision transformer (VLN-DT), is an autoregressive transformer model.

The study evaluated the performance of the HA-VLN task using metrics that considered both human perception and navigation aspects. The outcomes revealed that HA-VLN posed a significant challenge for existing VLN agents. Even after retraining, these agents struggled to match the oracle agent.

Furthermore, the VLN-DT, trained only on random data, outperformed the VLN-CM model trained under expert supervision, showcasing VLN-DT's superior generalization ability. Finally, the study demonstrated the real-world validation of these agents using a quadruped robot, which showed perception and avoidance capabilities. This highlighted the need for continued enhancement to better align with real-world scenarios.


The HA-VLN and HA3D have significant implications in embodied AI and Sim2Real transfer research. They can help develop and test navigation agents capable of operating in dynamic, human-populated environments such as homes, offices, hotels, and museums. These tools can also explore human-aware navigation strategies, including adaptive responses and social norms, and enhance human-robot collaboration. Additionally, they can provide valuable benchmarks and insights for advancing embodied AI and Sim2Real transfer to develop more realistic and effective VLN systems.


In summary, HA-VLN represented a significant advancement in embodied AI and Sim2Real research by introducing tasks that reflect real-world dynamics. Although current models had limitations in replicating human behavior, HA-VLN provided a critical foundation for future advancements. Future work should focus on enhancing the simulator, integrating more realistic human avatars, and expanding the HA-VLN framework to outdoor environments, paving the way for advanced VLN systems in human-populated settings.

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.

Journal reference:
  • Preliminary scientific report. Li, M., et, al. Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions. arXiv, 2024, 2406, 19236. DOI: 10.48550/arXiv.2406.19236,
Muhammad Osama

Written by

Muhammad Osama

Muhammad Osama is a full-time data analytics consultant and freelance technical writer based in Delhi, India. He specializes in transforming complex technical concepts into accessible content. He has a Bachelor of Technology in Mechanical Engineering with specialization in AI & Robotics from Galgotias University, India, and he has extensive experience in technical content writing, data science and analytics, and artificial intelligence.


Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Osama, Muhammad. (2024, July 10). Advancing AI Navigation with Human-Aware Systems. AZoAi. Retrieved on July 17, 2024 from

  • MLA

    Osama, Muhammad. "Advancing AI Navigation with Human-Aware Systems". AZoAi. 17 July 2024. <>.

  • Chicago

    Osama, Muhammad. "Advancing AI Navigation with Human-Aware Systems". AZoAi. (accessed July 17, 2024).

  • Harvard

    Osama, Muhammad. 2024. Advancing AI Navigation with Human-Aware Systems. AZoAi, viewed 17 July 2024,


The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Enhancing Education with AI-Generated QA Pairs