Dual modality prompt learning for visual question-grounded answering in robotic surgery.
Grounding-answering
Prompt learning
Textual prompt
Visual prompt
Visual question answering
Journal
Visual computing for industry, biomedicine, and art
ISSN: 2524-4442
Titre abrégé: Vis Comput Ind Biomed Art
Pays: Germany
ID NLM: 101759975
Informations de publication
Date de publication:
22 Apr 2024
22 Apr 2024
Historique:
received:
05
02
2024
accepted:
11
04
2024
medline:
22
4
2024
pubmed:
22
4
2024
entrez:
22
4
2024
Statut:
epublish
Résumé
With recent advancements in robotic surgery, notable strides have been made in visual question answering (VQA). Existing VQA systems typically generate textual answers to questions but fail to indicate the location of the relevant content within the image. This limitation restricts the interpretative capacity of the VQA models and their ability to explore specific image regions. To address this issue, this study proposes a grounded VQA model for robotic surgery, capable of localizing a specific region during answer prediction. Drawing inspiration from prompt learning in language models, a dual-modality prompt model was developed to enhance precise multimodal information interactions. Specifically, two complementary prompters were introduced to effectively integrate visual and textual prompts into the encoding process of the model. A visual complementary prompter merges visual prompt knowledge with visual information features to guide accurate localization. The textual complementary prompter aligns visual information with textual prompt knowledge and textual information, guiding textual information towards a more accurate inference of the answer. Additionally, a multiple iterative fusion strategy was adopted for comprehensive answer reasoning, to ensure high-quality generation of textual and grounded answers. The experimental results validate the effectiveness of the model, demonstrating its superiority over existing methods on the EndoVis-18 and EndoVis-17 datasets.
Identifiants
pubmed: 38647624
doi: 10.1186/s42492-024-00160-z
pii: 10.1186/s42492-024-00160-z
doi:
Types de publication
Journal Article
Langues
eng
Pagination
9Subventions
Organisme : 111 Project
ID : D23006
Organisme : National Key Research and Development Program of China
ID : 2021ZD0112400
Informations de copyright
© 2024. The Author(s).
Références
Wu DC, Wang YH, Ma HM, Ai LY, Yang JL, Zhang SJ et al (2023) Adaptive feature extraction method for capsule endoscopy images. Vis Comput Ind Biomed Art 6(1):24. https://doi.org/10.1186/S42492-023-00151-6
doi: 10.1186/S42492-023-00151-6
Pan J, Lv RJ, Wang Q, Zhao XB, Liu JG, Ai L (2023) Discrimination between leucine-rich glioma-inactivated 1 antibody encephalitis and gamma-aminobutyric acid B receptor antibody encephalitis based on ResNet18. Vis Comput Ind Biomed Art 6(1):17. https://doi.org/10.1186/S42492-023-00144-5
doi: 10.1186/S42492-023-00144-5
Sarmah M, Neelima A, Singh HR (2023) Survey of methods and principles in three-dimensional reconstruction from two-dimensional medical images. Vis Comput Ind Biomed Art 6(1):15. https://doi.org/10.1186/S42492-023-00142-7
doi: 10.1186/S42492-023-00142-7
Khan AU, Kuehne H, Duarte K, Gan C, Lobo N, Shah M (2021) Found a reason for me? Weakly-supervised grounded visual question answering using capsules. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. IEEE, Nashville. https://doi.org/10.1109/CVPR46437.2021.00836
Anderson P, He XD, Buehler C, Teney D, Johnson M, Gould S et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the 2018 IEEE/CVF conference on computer vision and pattern recognition. IEEE, Salt Lake City. https://doi.org/10.1109/CVPR.2018.00636
Urooj A, Mazaheri A, Da Vitoria Lobo N, Shah M (2020) MMFT-BERT: multimodal fusion transformer with BERT encodings for visual question answering. In: Proceedings of the association for computational linguistics: EMNLP 2020, Online Event, Association for Computational Linguistics. https://doi.org/10.18653/V1/2020.FINDINGS-EMNLP.417
Hu RH, Rohrbach A, Darrell T, Saenko K (2019) Language-conditioned graph networks for relational reasoning. In: Proceedings of the 2019 IEEE/CVF international conference on computer vision. IEEE, Seoul. https://doi.org/10.1109/ICCV.2019.01039
Jiang Y, Natarajan V, Chen XL, Rohrbach M, Batra D, Parikh D (2018) Pythia v0.1: the winning entry to the VQA challenge 2018. arXiv preprint arXiv:1807.09956
Zhou Y, Ren T, Zhu C, Sun X, Liu J, Ding X et al (2021) TRAR: Routing the Attention Spans in Transformer for Visual Question Answering. In: Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. IEEE, Montreal. https://doi.org/10.1109/ICCV48922.2021.00208
Reich D, Putze F, Schultz T (2023) Measuring Faithful and Plausible Visual Grounding in VQA. In: Findings of the Association for Computational Linguistics: EMNLP 2023, Association for Computational Linguistics, Singapore, pp 3129–3144. https://doi.org/10.18653/v1/2023.findings-emnlp.206
Gan C, Li YD, Li HX, Sun C, Gong BQ (2017) VQS: linking segmentations to questions and answers for supervised attention in VQA and question-focused semantic segmentation. In: Proceedings of the IEEE international conference on computer vision. IEEE, Venice. https://doi.org/10.1109/ICCV.2017.201
Hudson DA, Manning CD (2019) GQA: a new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. IEEE, Long Beach. https://doi.org/10.1109/CVPR.2019.00686
Chen CY, Anjum S, Gurari D (2022) Grounding answers for visual questions asked by visually impaired people. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. IEEE, New Orleans. https://doi.org/10.1109/CVPR52688.2022.01851
Bai L, Islam M, Seenivasan L, Ren HL (2023) Surgical-VQLA: transformer with gated vision-language embedding for visual question localized-answering in robotic surgery. In: Proceedings of the IEEE international conference on robotics and automation, IEEE, London, 29 May-2 June 2023. https://doi.org/10.1109/ICRA48891.2023.10160403
Bai L, Islam M, Ren HL (2023) CAT-ViL: co-attention gated vision-language embedding for visual question localized-answering in robotic surgery. In: Greenspan H, Madabhushi A, Mousavi P, Salcudean S, Duncan J, Syeda-Mahmood T et al (eds) Medical image computing and computer assisted intervention MICCAI 2023. 26th international conference, Vancouver, October 2023. Lecture notes in computer science, vol 14228. Springer, Cham, pp 397–407. https://doi.org/10.1007/978-3-031-43996-4_38
Bai L, Islam M, Ren HL (2023) Revisiting distillation for continual learning on visual question localized-answering in robotic surgery. In: Greenspan H, Madabhushi A, Mousavi P, Salcudean S, Duncan J, Syeda-Mahmood T, et al (eds) Medical image computing and computer assisted intervention MICCAI 2023. 26th international conference, Vancouver, October 2023. Lecture notes in computer science, vol 14228. Springer, Cham, pp 68–78. https://doi.org/10.1007/978-3-031-43996-4_7
Tascon-Morales S, Márquez-Neila P, Sznitman R (2023) Localized questions in medical visual question answering. In: Greenspan H, Madabhushi A, Mousavi P, Salcudean S, Duncan J, Syeda-Mahmood T et al (eds) Medical image computing and computer assisted intervention MICCAI 2023. 26th international conference, Vancouver, October 2023. Lecture notes in computer science, vol 14221. Springer, Cham, pp 361–370. https://doi.org/10.1007/978-3-031-43895-0_34
Lester B, Al-Rfou R, Constant N (2021) The power of scale for parameter-efficient prompt tuning. In: Proceedings of the 2021 conference on empirical methods in natural language processing. Association for Computational Linguistics, Punta Cana. https://doi.org/10.18653/V1/2021.EMNLP-MAIN.243
Liu PF, Yuan WZ, Fu JL, Jiang ZB, Hayashi H, Neubig G (2023) Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput Surv 55(9):195. https://doi.org/10.1145/3560815
doi: 10.1145/3560815
Jia ML, Tang LM, Chen BC, Cardie C, Belongie S, Hariharan B et al (2022) Visual prompt tuning. In: Avidan S, Brostow G, Cissé M, Farinella GM, Hassner T (eds) Computer vision - ECCV 2022. 17th European conference, Tel Aviv, October 2022. Lecture notes in computer science, vol 13693. Springer, Cham, pp 709–727. https://doi.org/10.1007/978-3-031-19827-4_41
Chen SF, Ge CJ, Tong Z, Wang JL, Song YB, Wang J et al (2022) AdaptFormer: adapting vision transformers for scalable visual recognition. In: Proceedings of the 36th conference on neural information processing systems, New Orleans
Jie SB, Deng ZH (2022) Convolutional bypasses are better vision transformer adapters. arXiv preprint arXiv: 2207.07039. https://doi.org/10.48550/ARXIV.2207.07039
Dai YM, Gieseke F, Oehmcke S, Wu YQ, Barnard K (2021) Attentional feature fusion. In: Proceedings of the IEEE winter conference on applications of computer vision, IEEE, Waikoloa. https://doi.org/10.1109/WACV48630.2021.00360
Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H (2021) Training data-efficient image transformers & distillation through attention. In: Proceedings of the 38th international conference on machine learning, ICML, Virtual Event
Allan M, Kondo S, Bodenstedt S, Leger S, Kadkhodamohammadi R, Luengo I et al (2020) 2018 Robotic scene segmentation challenge. arXiv preprint arXiv: 2001.11190
Allan M, Shvets A, Kurmann T, Zhang ZC, Duggal R, Su YH et al (2019) 2017 Robotic instrument segmentation challenge. arXiv preprint arXiv: 1902.06426
Seenivasan L, Mitheran S, Islam M, Ren HL (2022) Global-reasoned multi-task learning model for surgical scene understanding. IEEE Robot Autom Lett 7(2):3858–3865. https://doi.org/10.1109/LRA.2022.3146544
doi: 10.1109/LRA.2022.3146544
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Proceedings of the 3rd international conference on learning representations, San Diego
Li LH, Yatskar M, Yin D, Hsieh CJ, Chang KW (2019) VisualBERT: a simple and performant baseline for vision and language. arXiv preprint arXiv: 1908.03557
Seenivasan L, Islam M, Krishna AK, Ren HL (2022) Surgical-VQA: visual question answering in surgical scenes using transformer. In: Wang LW, Dou Q, Fletcher PT, Speidel S, Li S (eds) Medical image computing and computer assisted intervention – MICCAI 2022. 25th international conference, Singapore, September 2022. Lecture notes in computer science, vol 13437. Springer, Cham, pp 33–43. https://doi.org/10.1007/978-3-031-16449-1_4
Yu Z, Yu J, Cui YH, Tao DC, Tian Q (2019) Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, IEEE, Long Beach. https://doi.org/10.1109/CVPR.2019.00644
Ren SQ, He KM, Girshick R, Sun J (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031
doi: 10.1109/TPAMI.2016.2577031
Ben-Younes H, Cadene R, Cord M, Thome N (2017) MUTAN: multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE international conference on computer vision. IEEE, Venice. https://doi.org/10.1109/ICCV.2017.285
Yu Z, Yu J, Xiang CC, Fan JP, Tao DC (2018) Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans Neural Netw Learn Syst 29(12):5947–5959. https://doi.org/10.1109/TNNLS.2018.2817340
doi: 10.1109/TNNLS.2018.2817340
Ben-Younes H, Cadene R, Thome N, Cord M (2019) BLOCK: bilinear superdiagonal fusion for visual question answering and visual relationship detection. In: Proceedings of the 33rd AAAI conference on artificial intelligence, AAAI Press, Honolulu, 27 January-1 February 2019. https://doi.org/10.1609/AAAI.V33I01.33018102
He KM, Zhang XY, Ren SQ, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the 2016 IEEE conference on computer vision and pattern recognition. IEEE, Las Vegas. https://doi.org/10.1109/CVPR.2016.90
Hendrycks D, Dietterich TG (2019) Benchmarking neural network robustness to common corruptions and perturbations. In: Proceedings of the 7th international conference on learning representations, OpenReview.net, New Orleans
Valderrama N, Puentes PR, Hernández I, Ayobi N, Verlyck M, Santander J et al (2022) Towards holistic surgical scene understanding. In: Wang LW, Dou Q, Fletcher PT, Speidel S, Li S (eds) Medical image computing and computer assisted intervention – MICCAI 2022. 25th international conference, Singapore, September 2022. Lecture notes in computer science, vol 13437. Springer, Cham, pp 442–452. https://doi.org/10.1007/978-3-031-16449-1_42
Lu HY, Liu W, Zhang B, Wang BX, Dong K, Liu B et al (2024) DeepSeek-VL: towards real-world vision-language understanding. arXiv preprint arXiv: 2403.05525. https://doi.org/10.48550/arXiv.2403.05525
Chen GH, Chen SN, Zhang RF, Chen JY, Wu XB, Zhang ZY et al (2024) ALLaVA: harnessing GPT4V-synthesized data for a lite vision-language model. arXiv preprint arXiv: 2402.11684. https://doi.org/10.48550/ARXIV.2402.11684