Towards building multilingual language model for medicine.
Journal
Nature communications
ISSN: 2041-1723
Titre abrégé: Nat Commun
Pays: England
ID NLM: 101528555
Informations de publication
Date de publication:
27 Sep 2024
27 Sep 2024
Historique:
received:
27
02
2024
accepted:
06
09
2024
medline:
28
9
2024
pubmed:
28
9
2024
entrez:
27
9
2024
Statut:
epublish
Résumé
The development of open-source, multilingual medical language models can benefit a wide, linguistically diverse audience from different regions. To promote this domain, we present contributions from the following: First, we construct a multilingual medical corpus, containing approximately 25.5B tokens encompassing 6 main languages, termed as MMedC, enabling auto-regressive domain adaptation for general LLMs; Second, to monitor the development of multilingual medical LLMs, we propose a multilingual medical multi-choice question-answering benchmark with rationale, termed as MMedBench; Third, we have assessed a number of open-source large language models (LLMs) on our benchmark, along with those further auto-regressive trained on MMedC. Our final model, MMed-Llama 3, with only 8B parameters, achieves superior performance compared to all other open-source models on both MMedBench and English benchmarks, even rivaling GPT-4. In conclusion, in this work, We present a large-scale corpus, a benchmark and a series of models to support the development of multilingual medical LLMs.
Identifiants
pubmed: 39333468
doi: 10.1038/s41467-024-52417-z
pii: 10.1038/s41467-024-52417-z
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
8384Subventions
Organisme : Science and Technology Commission of Shanghai Municipality (Shanghai Municipal Science and Technology Commission)
ID : 18DZ2270700
Organisme : Science and Technology Commission of Shanghai Municipality (Shanghai Municipal Science and Technology Commission)
ID : 21DZ1100100
Informations de copyright
© 2024. The Author(s).
Références
Achiam, J. et al. Gpt-4 technical report. ArXiv, abs/2303.08774 (2023).
Singhal, K. et al. Towards expert-level medical question answering with large language models. ArXiv, abs/2305.09617 (2023).
Wu, C. et al. Pmc-llama: toward building open-source language models for medicine. J. Am. Med. Inform. Assoc. 31, 1833–1843 (2024).
Han, T. et al. Medalpaca–an open-source collection of medical conversational AI models and training data. ArXiv, abs/2304.08247, (2023).
Li, Y. et al. Chatdoctor: a medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge. Cureus 15, e40895 (2023).
Chen, Z. et al. Meditron-70b: Scaling medical pretraining for large language models. ArXiv, abs/2311.16079, 2023.
BigScienceWorkshop Scao et al. Bloom: A 176b-parameter open-access multilingual language model. Nov 2022.
InternLM Team. Internlm: a multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM , Accessed: Feb. 2024.
Papineni, K., Roukos, S., Ward, T., and Zhu, Wei-Jing. Bleu: a method for automatic evaluation of machine translation. In Pierre Isabelle, Eugene Charniak, and Dekang Lin, editors, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics.
Lin, Chin-Yew. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, 74–81, 2004.
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., Artzi, Y. Bertscore: evaluating text generation with bert. In International Conference on Learning Representations (2019).
Anil, R. et al. Palm 2 technical report. ArXiv, abs/2305.10403 (2023).
Blinov, P., Reshetnikova, A., Nesterov, A., Zubkova, G., and Kokh, V. Rumedbench: A Russian medical language understanding benchmark. ArXiv, abs/2201.06499, 2022.
Nori, H. et al. Can generalist foundation models outcompete special-purpose tuning? Case study in medicine. ArXiv, abs/2311.16452 (2023).
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2022).
Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616, 259–265 (2023).
doi: 10.1038/s41586-023-05881-4
pubmed: 37045921
Moor, M. et al. Med-flamingo: a multimodal medical few-shot learner. ArXiv, abs/2307.15189, 2023.
Wu, C., Zhang, X., Zhang, Y., Wang, Y., and Xie, W. Towards generalist foundation model for radiology. ArXiv, abs/2308.02463, 2023.
Tu, T. et al. Towards generalist biomedical AI. ArXiv, abs/2307.14334, 2023.
Zakka, C. et al. Almanac - retrieval-augmented language models for clinical medicine. NEJM AI 1, AIoa2300068 (2024).
doi: 10.1056/AIoa2300068
Zhang, P., Xiao, S., Liu, Z., Dou, Z., and Nie, Jian-Yun. Retrieve anything to augment large language models. ArXiv, abs/2310.07554 (2023).
Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. ArXiv, abs/2005.11401 (2020).
Omiye, J. A., Lester, J. C., Spichak, S., Rotemberg, V. & Daneshjou, R. Large language models propagate race-based medicine. npj Digit. Med. 6, 195 (2023).
doi: 10.1038/s41746-023-00939-z
pubmed: 37864012
pmcid: 10589311
Joyce, D. W., Kormilitzin, A., Smith, K. A. & Cipriani, A. Explainable artificial intelligence for mental health through transparency and interpretability for understandability. npj Digit. Med. 6, 6 (2023).
doi: 10.1038/s41746-023-00751-9
pubmed: 36653524
pmcid: 9849399
Crawl, C. Common crawl maintains a free, open repository of web crawl data that can be used by anyone. https://commoncrawl.org/ Accessed: Apr. 2024.
Nguyen, T. et al. Culturax: a cleaned, enormous, and multilingual dataset for large language models in 167 languages. ArXiv, abs/2309.09400 (2023).
Foundation, W. Wikimedia downloads. https://dumps.wikimedia.org , Accessed: May. 2024.
BIT-ENGD. baidu_baike. https://github.com/BIT-ENGD/baidu_baike , Accessed: Apr. 2024.
Institute of Formal and Applied Linguistics. Ufal medical corpus. https://ufal.mff.cuni.cz/ufal_medical_corpus , Accessed: Nov. 2024.
Jin, D. et al. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. ArXiv, abs/2009.13081 (2020).
Kasai, J., Kasai, Y., Sakaguchi, K., Yamada, Y., Radev, D. Evaluating GPT-4 and ChatGPT on Japanese medical licensing examinations. ArXiv, abs/2303.18027 (2023).
Labrak, Y. et al. Frenchmedmcqa: a French multiple-choice question answering dataset for medical domain. ArXiv, abs/2304.04280, (2023).
Vilares, D., Gómez-Rodríguez, C. HEAD-QA: A healthcare dataset for complex reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 960–966, Florence, Italy, July 2019. Association for Computational Linguistics.
Kung, T. H. et al. Performance of ChatGPT on usmle: potential for ai-assisted medical education using large language models. PLoS Digit. Health 2, e0000198 (2023).
doi: 10.1371/journal.pdig.0000198
pubmed: 36812645
pmcid: 9931230
Edwardnm, H. et al. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations (2022).
Brown, T. et al. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc. (2020).
Kim, S. et al. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12685–12708, (2023).
OpenAI. Openai. introducing chatgpt. https://openai.com/blog/chatgpt/ , Accessed: Dec. 2024.
Gemini Team. Gemini: A family of highly capable multimodal models. ArXiv, abs/2312.11805 (2023).
Touvron, H. et al. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971 (2023).
Albertnm, J. et al. Mistral 7b. ArXiv, abs/2310.06825 (2023).
Labrak, Y. et al. Biomistral: a collection of open-source pretrained large language models for medical domains. ArXiv, abs/2402.10373 (2024).
Gemma Team et al. Gemma: Open models based on Gemini research and technology. ArXiv, abs/2403.08295 (2024).
Zheng, L. et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Adv. Neural Inform. Process. Syst. 36, 46595–46623 (2024).
Jin, Q., Dhingra, B., Liu, Z., Cohen, W. W., and Lu, X. Pubmedqa: A dataset for biomedical research question answering. In Conference on Empirical Methods in Natural Language Processing, 2019.
Diao, S. et al. Lmflow: An extensible toolkit for finetuning and inference of large foundation models. ArXiv, abs/2306.12420 (2023).
Pal, A., Umapathi, Logesh Kumar, and Sankarasubbu, M. Medmcqa: a large-scale multi-subject multi-choice dataset for medical domain question answering. In Gerardo Flores, George H Chen, Tom Pollard, Joyce C Ho, Tristan Naumann, editors, Proceedings of the Conference on Health, Inference, and Learning, volume 174 of Proceedings of Machine Learning Research, pages 248–260. PMLR, 07–08 Apr 2022.
Hendrycks, D. et al. Measuring massive multitask language understanding. ArXiv, abs/2009.03300 (2020).
Chung, HyungWon et al. Scaling instruction-finetuned language models. J. Mach. Learn. Res. 25, 1–53 (2024).
Longpre, S. et al. The flan collection: designing data and methods for effective instruction tuning. In International Conference on Machine Learning, pages 22631–22648. PMLR (2023).
Wang, Y. et al. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. ArXiv, abs/2204.07705, (2022).
Qiu, P. et al. Towards building multilingual language model for medicine. https://doi.org/10.5281/zenodo.12748399 (2024).
Wang, X. et al. Self-consistency improves chain of thought reasoning in language models. ArXiv, abs/2203.11171 (2022).
Pal, A., Minervini, P., Motzfeldt, Andreas Geert, and Alex, B. Open medical llm leaderboard. https://huggingface.co/spaces/openlifescienceai/open_medical_llm_leaderboard , Accessed: Apr. (2024).