DOMAS: DATA ORIENTED MEDICAL VISUAL QUESTION ANSWERING USING SWIN TRANSFORMER
DOI:
https://doi.org/10.24193/subbi.2023.1.04Keywords:
Medical Visual Question Answering, Swin Transformer.Abstract
The Medical Visual Question Answering problem is a joined Computer Vision and Natural Language Processing task that aims to obtain answers in natural language to a question, posed in natural language as well, regarding an image. Both the image and question are of a medical nature. In this paper, we introduce DOMAS, a deep learning model that solves this task on the Med-VQA 2019 dataset. The method is based on dividing the task into smaller classification problems by using a BERT-based question classification and a unique approach that makes use of dataset information for selecting the suited model. For the image classification problems, transfer learning was performed by using a pre-trained Swin Transform based architecture. DOMAS uses a question classifier and seven image classifiers along with the image classifier selection strategy and achieves 0.616 strict accuracy and 0.654 BLUE score. The results are competitive with other state-of-the-art models, proving that our approach is effective in solving the presented task.
References
Abacha, A. B., Hasan, S. A., Datla, V. V., Liu, J., Demner-Fushman, D., and Muller, H. VQA-Med: Overview of the medical visual question answering task at ImageCLEF 2019. CLEF (working notes) 2, 6 (2019).
Al-Sadi, A., Talafha, B., Al-Ayyoub, M., Jararweh, Y., and Costen, F. JUST at ImageCLEF 2019 Visual Question Answering in the Medical Domain. In CLEF (working notes) (2019).
Bao, H., Wang, W., Dong, L., Liu, Q., Mohammed, O. K., Aggarwal, K., Som, S., and Wei, F. VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts, 2022.
Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A., Padlewski, P., Salz, D., Goodman, S., Grycner, A., Mustafa, B., Beyer, L., et al. PaLI: A Jointly-Scaled Multilingual Language-Image Model. arXiv preprint arXiv:2209.06794 (2022).
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition (2009), pp. 248–255.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 (2018).
He, K., Zhang, X., Ren, S., and Sun, J. Deep Residual Learning for Image Recognition, 2015.
Khare, Y., Bagal, V., Mathew, M., Devi, A., Priyakumar, U. D., and Jawahar, C. MMBERT: Multimodal BERT Pretraining for Improved Medical VQA. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI) (2021), IEEE, pp. 1033–1036.[9] Kingma, D. P., and Ba, J. Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980 (2014).
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF international conference on computer vision (2021), pp. 10012–10022.
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (2002), pp. 311–318.
Ren, F., and Zhou, Y. CGMVQA: A New Classification and Generative Model for Medical Visual Question Answering. IEEE Access 8 (2020), 50626–50636.
Shi, L., Liu, F., and Rosen, M. P. Deep Multimodal Learning for Medical Visual Question Answering. In CLEF (working notes) (2019).
Simonyan, K., and Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition, 2015.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. The journal of machine learning research 15, 1 (2014), 1929–1958.
Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In Proceedings of the AAAI conference on artificial intelligence (2017), vol. 31.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention Is All You Need. Advances in neural information processing systems 30 (2017).
Vu, M., Sznitman, R., Nyholm, T., and Lofstedt, T. Ensemble of Streamlined Bilinear Visual Question Answering Models for the ImageCLEF 2019 Challenge in the Medical Domain. In CLEF 2019-Conference and Labs of the Evaluation Forum, Lugano, Switzerland, Sept 9-12, 2019 (2019), vol. 2380.
Wang, W., Bao, H., Dong, L., Bjorck, J., Peng, Z., Liu, Q., Aggarwal, K., Mohammed, O. K., Singhal, S., Som, S., et al. Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks. arXiv preprint arXiv:2208.10442 (2022).
Yan, X., Li, L., Xie, C., Xiao, J., and Gu, L. ImageCLEF 2019 Visual Question Answering in the Medical Domain. Zhejiang University (2019).
Yu, Z., Yu, J., Fan, J., and Tao, D. Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering. In Proceedings of the IEEE international conference on computer vision (2017), pp. 1821–1830.
Zhou, Y., Kang, X., and Ren, F. TUA1 at ImageCLEF 2019 VQA-Med: a Classification and Generation Model based on Transfer Learning. In CLEF (Working Notes) (2019).
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2023 Studia Universitatis Babeș-Bolyai Informatica
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.