Dual modality prompt learning for visual question-grounded answering in robotic surgery.

Grounding-answering Prompt learning Textual prompt Visual prompt Visual question answering

Journal

Visual computing for industry, biomedicine, and art

ISSN: 2524-4442

Titre abrégé: Vis Comput Ind Biomed Art

Pays: Germany

ID NLM: 101759975

Informations de publication

Date de publication:
22 Apr 2024

Historique:

received: 05 02 2024

accepted: 11 04 2024

medline: 22 4 2024

pubmed: 22 4 2024

entrez: 22 4 2024

Statut: epublish

Résumé

With recent advancements in robotic surgery, notable strides have been made in visual question answering (VQA). Existing VQA systems typically generate textual answers to questions but fail to indicate the location of the relevant content within the image. This limitation restricts the interpretative capacity of the VQA models and their ability to explore specific image regions. To address this issue, this study proposes a grounded VQA model for robotic surgery, capable of localizing a specific region during answer prediction. Drawing inspiration from prompt learning in language models, a dual-modality prompt model was developed to enhance precise multimodal information interactions. Specifically, two complementary prompters were introduced to effectively integrate visual and textual prompts into the encoding process of the model. A visual complementary prompter merges visual prompt knowledge with visual information features to guide accurate localization. The textual complementary prompter aligns visual information with textual prompt knowledge and textual information, guiding textual information towards a more accurate inference of the answer. Additionally, a multiple iterative fusion strategy was adopted for comprehensive answer reasoning, to ensure high-quality generation of textual and grounded answers. The experimental results validate the effectiveness of the model, demonstrating its superiority over existing methods on the EndoVis-18 and EndoVis-17 datasets.

Identifiants

DOI: 10.1186/s42492-024-00160-z PMID: 38647624

pubmed: 38647624

doi: 10.1186/s42492-024-00160-z

pii: 10.1186/s42492-024-00160-z

doi:

Types de publication

Journal Article

Langues

eng

Pagination

Subventions

Organisme : 111 Project

ID : D23006

Organisme : National Key Research and Development Program of China

ID : 2021ZD0112400

Informations de copyright

Références

Wu DC, Wang YH, Ma HM, Ai LY, Yang JL, Zhang SJ et al (2023) Adaptive feature extraction method for capsule endoscopy images. Vis Comput Ind Biomed Art 6(1):24. https://doi.org/10.1186/S42492-023-00151-6

doi: 10.1186/S42492-023-00151-6

Pan J, Lv RJ, Wang Q, Zhao XB, Liu JG, Ai L (2023) Discrimination between leucine-rich glioma-inactivated 1 antibody encephalitis and gamma-aminobutyric acid B receptor antibody encephalitis based on ResNet18. Vis Comput Ind Biomed Art 6(1):17. https://doi.org/10.1186/S42492-023-00144-5

doi: 10.1186/S42492-023-00144-5

Sarmah M, Neelima A, Singh HR (2023) Survey of methods and principles in three-dimensional reconstruction from two-dimensional medical images. Vis Comput Ind Biomed Art 6(1):15. https://doi.org/10.1186/S42492-023-00142-7

doi: 10.1186/S42492-023-00142-7

Khan AU, Kuehne H, Duarte K, Gan C, Lobo N, Shah M (2021) Found a reason for me? Weakly-supervised grounded visual question answering using capsules. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. IEEE, Nashville. https://doi.org/10.1109/CVPR46437.2021.00836

Anderson P, He XD, Buehler C, Teney D, Johnson M, Gould S et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the 2018 IEEE/CVF conference on computer vision and pattern recognition. IEEE, Salt Lake City. https://doi.org/10.1109/CVPR.2018.00636

Urooj A, Mazaheri A, Da Vitoria Lobo N, Shah M (2020) MMFT-BERT: multimodal fusion transformer with BERT encodings for visual question answering. In: Proceedings of the association for computational linguistics: EMNLP 2020, Online Event, Association for Computational Linguistics. https://doi.org/10.18653/V1/2020.FINDINGS-EMNLP.417

Hu RH, Rohrbach A, Darrell T, Saenko K (2019) Language-conditioned graph networks for relational reasoning. In: Proceedings of the 2019 IEEE/CVF international conference on computer vision. IEEE, Seoul. https://doi.org/10.1109/ICCV.2019.01039

Jiang Y, Natarajan V, Chen XL, Rohrbach M, Batra D, Parikh D (2018) Pythia v0.1: the winning entry to the VQA challenge 2018. arXiv preprint arXiv:1807.09956

Zhou Y, Ren T, Zhu C, Sun X, Liu J, Ding X et al (2021) TRAR: Routing the Attention Spans in Transformer for Visual Question Answering. In: Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. IEEE, Montreal. https://doi.org/10.1109/ICCV48922.2021.00208

Reich D, Putze F, Schultz T (2023) Measuring Faithful and Plausible Visual Grounding in VQA. In: Findings of the Association for Computational Linguistics: EMNLP 2023, Association for Computational Linguistics, Singapore, pp 3129–3144. https://doi.org/10.18653/v1/2023.findings-emnlp.206

Gan C, Li YD, Li HX, Sun C, Gong BQ (2017) VQS: linking segmentations to questions and answers for supervised attention in VQA and question-focused semantic segmentation. In: Proceedings of the IEEE international conference on computer vision. IEEE, Venice. https://doi.org/10.1109/ICCV.2017.201

Hudson DA, Manning CD (2019) GQA: a new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. IEEE, Long Beach. https://doi.org/10.1109/CVPR.2019.00686

Chen CY, Anjum S, Gurari D (2022) Grounding answers for visual questions asked by visually impaired people. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. IEEE, New Orleans. https://doi.org/10.1109/CVPR52688.2022.01851

Bai L, Islam M, Seenivasan L, Ren HL (2023) Surgical-VQLA: transformer with gated vision-language embedding for visual question localized-answering in robotic surgery. In: Proceedings of the IEEE international conference on robotics and automation, IEEE, London, 29 May-2 June 2023. https://doi.org/10.1109/ICRA48891.2023.10160403

Bai L, Islam M, Ren HL (2023) CAT-ViL: co-attention gated vision-language embedding for visual question localized-answering in robotic surgery. In: Greenspan H, Madabhushi A, Mousavi P, Salcudean S, Duncan J, Syeda-Mahmood T et al (eds) Medical image computing and computer assisted intervention MICCAI 2023. 26th international conference, Vancouver, October 2023. Lecture notes in computer science, vol 14228. Springer, Cham, pp 397–407. https://doi.org/10.1007/978-3-031-43996-4_38

Bai L, Islam M, Ren HL (2023) Revisiting distillation for continual learning on visual question localized-answering in robotic surgery. In: Greenspan H, Madabhushi A, Mousavi P, Salcudean S, Duncan J, Syeda-Mahmood T, et al (eds) Medical image computing and computer assisted intervention MICCAI 2023. 26th international conference, Vancouver, October 2023. Lecture notes in computer science, vol 14228. Springer, Cham, pp 68–78. https://doi.org/10.1007/978-3-031-43996-4_7

Tascon-Morales S, Márquez-Neila P, Sznitman R (2023) Localized questions in medical visual question answering. In: Greenspan H, Madabhushi A, Mousavi P, Salcudean S, Duncan J, Syeda-Mahmood T et al (eds) Medical image computing and computer assisted intervention MICCAI 2023. 26th international conference, Vancouver, October 2023. Lecture notes in computer science, vol 14221. Springer, Cham, pp 361–370. https://doi.org/10.1007/978-3-031-43895-0_34

Lester B, Al-Rfou R, Constant N (2021) The power of scale for parameter-efficient prompt tuning. In: Proceedings of the 2021 conference on empirical methods in natural language processing. Association for Computational Linguistics, Punta Cana. https://doi.org/10.18653/V1/2021.EMNLP-MAIN.243

Liu PF, Yuan WZ, Fu JL, Jiang ZB, Hayashi H, Neubig G (2023) Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput Surv 55(9):195. https://doi.org/10.1145/3560815

doi: 10.1145/3560815

Jia ML, Tang LM, Chen BC, Cardie C, Belongie S, Hariharan B et al (2022) Visual prompt tuning. In: Avidan S, Brostow G, Cissé M, Farinella GM, Hassner T (eds) Computer vision - ECCV 2022. 17th European conference, Tel Aviv, October 2022. Lecture notes in computer science, vol 13693. Springer, Cham, pp 709–727. https://doi.org/10.1007/978-3-031-19827-4_41

Chen SF, Ge CJ, Tong Z, Wang JL, Song YB, Wang J et al (2022) AdaptFormer: adapting vision transformers for scalable visual recognition. In: Proceedings of the 36th conference on neural information processing systems, New Orleans

Jie SB, Deng ZH (2022) Convolutional bypasses are better vision transformer adapters. arXiv preprint arXiv: 2207.07039. https://doi.org/10.48550/ARXIV.2207.07039

Dai YM, Gieseke F, Oehmcke S, Wu YQ, Barnard K (2021) Attentional feature fusion. In: Proceedings of the IEEE winter conference on applications of computer vision, IEEE, Waikoloa. https://doi.org/10.1109/WACV48630.2021.00360

Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H (2021) Training data-efficient image transformers & distillation through attention. In: Proceedings of the 38th international conference on machine learning, ICML, Virtual Event

Allan M, Kondo S, Bodenstedt S, Leger S, Kadkhodamohammadi R, Luengo I et al (2020) 2018 Robotic scene segmentation challenge. arXiv preprint arXiv: 2001.11190

Allan M, Shvets A, Kurmann T, Zhang ZC, Duggal R, Su YH et al (2019) 2017 Robotic instrument segmentation challenge. arXiv preprint arXiv: 1902.06426

Seenivasan L, Mitheran S, Islam M, Ren HL (2022) Global-reasoned multi-task learning model for surgical scene understanding. IEEE Robot Autom Lett 7(2):3858–3865. https://doi.org/10.1109/LRA.2022.3146544

doi: 10.1109/LRA.2022.3146544

Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Proceedings of the 3rd international conference on learning representations, San Diego

Li LH, Yatskar M, Yin D, Hsieh CJ, Chang KW (2019) VisualBERT: a simple and performant baseline for vision and language. arXiv preprint arXiv: 1908.03557

Seenivasan L, Islam M, Krishna AK, Ren HL (2022) Surgical-VQA: visual question answering in surgical scenes using transformer. In: Wang LW, Dou Q, Fletcher PT, Speidel S, Li S (eds) Medical image computing and computer assisted intervention – MICCAI 2022. 25th international conference, Singapore, September 2022. Lecture notes in computer science, vol 13437. Springer, Cham, pp 33–43. https://doi.org/10.1007/978-3-031-16449-1_4

Yu Z, Yu J, Cui YH, Tao DC, Tian Q (2019) Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, IEEE, Long Beach. https://doi.org/10.1109/CVPR.2019.00644

Ren SQ, He KM, Girshick R, Sun J (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031

doi: 10.1109/TPAMI.2016.2577031

Ben-Younes H, Cadene R, Cord M, Thome N (2017) MUTAN: multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE international conference on computer vision. IEEE, Venice. https://doi.org/10.1109/ICCV.2017.285

Yu Z, Yu J, Xiang CC, Fan JP, Tao DC (2018) Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans Neural Netw Learn Syst 29(12):5947–5959. https://doi.org/10.1109/TNNLS.2018.2817340

doi: 10.1109/TNNLS.2018.2817340

Ben-Younes H, Cadene R, Thome N, Cord M (2019) BLOCK: bilinear superdiagonal fusion for visual question answering and visual relationship detection. In: Proceedings of the 33rd AAAI conference on artificial intelligence, AAAI Press, Honolulu, 27 January-1 February 2019. https://doi.org/10.1609/AAAI.V33I01.33018102

He KM, Zhang XY, Ren SQ, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the 2016 IEEE conference on computer vision and pattern recognition. IEEE, Las Vegas. https://doi.org/10.1109/CVPR.2016.90

Hendrycks D, Dietterich TG (2019) Benchmarking neural network robustness to common corruptions and perturbations. In: Proceedings of the 7th international conference on learning representations, OpenReview.net, New Orleans

Valderrama N, Puentes PR, Hernández I, Ayobi N, Verlyck M, Santander J et al (2022) Towards holistic surgical scene understanding. In: Wang LW, Dou Q, Fletcher PT, Speidel S, Li S (eds) Medical image computing and computer assisted intervention – MICCAI 2022. 25th international conference, Singapore, September 2022. Lecture notes in computer science, vol 13437. Springer, Cham, pp 442–452. https://doi.org/10.1007/978-3-031-16449-1_42

Lu HY, Liu W, Zhang B, Wang BX, Dong K, Liu B et al (2024) DeepSeek-VL: towards real-world vision-language understanding. arXiv preprint arXiv: 2403.05525. https://doi.org/10.48550/arXiv.2403.05525

Chen GH, Chen SN, Zhang RF, Chen JY, Wu XB, Zhang ZY et al (2024) ALLaVA: harnessing GPT4V-synthesized data for a lite vision-language model. arXiv preprint arXiv: 2402.11684. https://doi.org/10.48550/ARXIV.2402.11684

Dual modality prompt learning for visual question-grounded answering in robotic surgery.

Journal

Informations de publication

Résumé

Identifiants

Types de publication

Langues

Pagination

Subventions

Informations de copyright

Références

Auteurs

Yue Zhang (Y)

Wanshu Fan (W)

Peixi Peng (P)

Xin Yang (X)

Dongsheng Zhou (D)

Xiaopeng Wei (X)

Classifications MeSH