Abstract:To address the issue of inadequate reasoning ability of existing vision-language navigation methods in continuous environments, a method for semantic topological maps-based reasoning for vision-and-language navigation in continuous environments is proposed. First, regions and objects in the navigation environment are identified through scene understanding auxiliary tasks, and a knowledge base of spatial proximity is constructed. Second, the agent interacts with the environment in real time during the navigation process, collecting location information, encoding visual features and predicting semantic labels of regions and objects. Thereby a semantic topological map is gradually generated. On this basis, an auxiliary reasoning localization strategy is designed. A self-attention mechanism is employed to extract object and region information from navigation instructions, and the spatial proximity knowledge base is combined with semantic topological map to infer and localize objects and regions. The above assists navigation decisions and ensures that the agent navigation trajectory aligns with the instructions. Experimental results on public datasets R2R-CE and RxR-CE demonstrate the proposed method achieves a higher navigation success rate.
[1] ANDERSON P, WU Q, TENEY D, et al. Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments//Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2018: 3674-3683. [2] ANDERSON P, HE X D, BUEHLER C, et al. Bottom-up and Top-down Attention for Image Captioning and Visual Question Answering//Proc of the IEEE/CVF Conference on Computer Vision and Pa-ttern Recognition. Washington, USA: IEEE, 2018: 6077-6086. [3] FRIED D, HU R H, CIRIK V, et al. Speaker-Follower Models for Vision-and-Language Navigation//Proc of the 32nd International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2018: 3318-3329. [4] QI Y K, WU Q, ANDERSON P, et al. REVERIE: Remote Embo-died Visual Referring Expression in Real Indoor Environments//Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2020: 9979-9988. [5] ZHU F D, ZHU Y, CHANG X J, et al. Vision-Language Navigation with Self-Supervised Auxiliary Reasoning Tasks//Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2020: 10009-10019. [6] PASHEVICH A, SCHMID C, SUN C.Episodic Transformer for Vision-and-Language Navigation//Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2021: 15922-15932. [7] CHEN K, CHEN J K, CHUANG J, et al. Topological Planning with Transformers for Vision-and-Language Navigation//Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2021: 11271-11281. [8] CHEN S Z, GUHUR P, SCHMID C, et al. History Aware Multimodal Transformer for Vision-and-Language Navigation//Proc of the 35th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2021: 5834-5847. [9] CHEN S Z, GUHUR P L, TAPASWI M, et al. Think Global, Act Local: Dual-Scale Graph Transformer for Vision-and-Language Navi-gation//Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 16516-16526. [10] GAO C, CHEN J Y, LIU S, et al. Room-and-Object Aware Knowledge Reasoning for Remote Embodied Referring Expression//Proc of the IEEE/CVF Conference on Computer Vision and Pa-ttern Recognition. Washington, USA: IEEE, 2021: 3063-3072. [11] LI X Y, WANG Z H, YANG J H, et al. KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation//Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023: 2583-2592. [12] XU M, XIE Z L.Hierarchical Spatial Proximity Reasoning for Vi-sion-and-Language Navigation[C/OL].[2024-04-03].https://arxiv.org/pdf/2403.11541.pdf. [13] KRANTZ J, WIJMANS E, MAJUMDAR A, et al. Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments//Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2020: 104-120. [14] CHANG A, DAI A, FUNKHOUSER T, et al. Matterport3D: Lear-ning from RGB-D Data in Indoor Environments//Proc of the International Conference on 3D Vision. Washington, USA: IEEE, 2017: 667-676. [15] SAVVA M, KADIAN A, MAKSYMETS O, et al. Habitat: A Platform for Embodied AI Research//Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2019: 9338-9346. [16] IRSHAD M Z, MITHUN C N, SEYMOUR Z, et al. SASRA: Semantically-Aware Spatio-Temporal Reasoning Agent for Vision-and-Language Navigation in Continuous Environments[C/OL].[2024-04-03]. https://arxiv.org/pdf/2108.11945.pdf. [17] KRANTZ J, GOKASLAN A, BATRA D, et al. Waypoint Models for Instruction-Guided Navigation in Continuous Environments//Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2021: 15142-15151. [18] RAYCHAUDHURI S, WANI S, PATEL S, et al. Language-Ali-gned Waypoint(LAW) Supervision for Vision-and-Language Navigation in Continuous Environments//Proc of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg, USA: ACL, 2021: 4018-4028. [19] HONG Y C, WANG Z, WU Q, et al. Bridging the Gap between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation//Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 15418-15428. [20] KRANTZ J, LEE S.Sim-2-Sim Transfer for Vision-and-Language Navigation in Continuous Environments//Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2022: 588-603. [21] WANG T, WU Z K, YAO F Y, et al. Graph Based Environment Representation for Vision-and-Language Navigation in Continuous Environments//Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Washington, USA: IEEE, 2024: 8331-8335. [22] AN D, WANG H Q, WANG W G, et al. ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. DOI: 10.1109/TPAMI.2024.3386695. [23] VASWANI A, SHAZEER N, PARMAR N, et al. Attention Is All You Need//Proc of the 31st International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2017: 6000-6010. [24] RUSSAKOVSKY O, DENG J, SU H, et al. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 2015, 115(3): 211-252. [25] HE K M, ZHANG X Y, REN S Q, et al. Deep Residual Learning for Image Recognition//Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2016: 770-778. [26] WIJMANS E, KADIAN A, MORCOS A, et al. DD-PPO: Lear-ning Near-Perfect Point-Goal Navigators from 2.5 Billion Frames[C/OL].[2024-04-03]. https://arxiv.org/pdf/1911.00357. [27] GIRSHICK R.Fast R-CNN//Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2015: 1440-1448. [28] CHAPLOT D S, GANDHI D, GUPTA S, et al. Learning to Explore Using Active Neural SLAM[C/OL].[2024-04-03]. https://arxiv.org/pdf/2004.05155. [29] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding//Proc of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolo-gies(Long and Short Paper). Stroudsburg, USA: ACL, 2019: 4171-4186. [30] KU A, ANDERSON P, PATEL R, et al. Room-across-Room: Multi-lingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding//Proc of the Conference on Empirical Methods in Na-tural Language Processing. Stroudsburg, USA: ACL, 2020: 4392-4412. [31] TAN H, BANSAL M.LXMERT: Learning Cross-Modality Encoder Representations from Transformers//Proc of the Conference on Empirical Methods in Natural Language Processing and the 9th In-ternational Joint Conference on Natural Language Processing. Strouds-burg, USA: ACL, 2019: 5100-5111.