Towards building multilingual language model for medicine.

Multilingualism Humans Language Benchmarking Natural Language Processing

Journal

Nature communications

ISSN: 2041-1723

Titre abrégé: Nat Commun

Pays: England

ID NLM: 101528555

Informations de publication

Date de publication:
27 Sep 2024

Historique:

received: 27 02 2024

accepted: 06 09 2024

medline: 28 9 2024

pubmed: 28 9 2024

entrez: 27 9 2024

Statut: epublish

Résumé

The development of open-source, multilingual medical language models can benefit a wide, linguistically diverse audience from different regions. To promote this domain, we present contributions from the following: First, we construct a multilingual medical corpus, containing approximately 25.5B tokens encompassing 6 main languages, termed as MMedC, enabling auto-regressive domain adaptation for general LLMs; Second, to monitor the development of multilingual medical LLMs, we propose a multilingual medical multi-choice question-answering benchmark with rationale, termed as MMedBench; Third, we have assessed a number of open-source large language models (LLMs) on our benchmark, along with those further auto-regressive trained on MMedC. Our final model, MMed-Llama 3, with only 8B parameters, achieves superior performance compared to all other open-source models on both MMedBench and English benchmarks, even rivaling GPT-4. In conclusion, in this work, We present a large-scale corpus, a benchmark and a series of models to support the development of multilingual medical LLMs.

Identifiants

DOI: 10.1038/s41467-024-52417-z PMID: 39333468

pubmed: 39333468

doi: 10.1038/s41467-024-52417-z

pii: 10.1038/s41467-024-52417-z

doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

Pagination

8384

Subventions

Organisme : Science and Technology Commission of Shanghai Municipality (Shanghai Municipal Science and Technology Commission)

ID : 18DZ2270700

Organisme : Science and Technology Commission of Shanghai Municipality (Shanghai Municipal Science and Technology Commission)

ID : 21DZ1100100

Informations de copyright

Références

Achiam, J. et al. Gpt-4 technical report. ArXiv, abs/2303.08774 (2023).

Singhal, K. et al. Towards expert-level medical question answering with large language models. ArXiv, abs/2305.09617 (2023).

Wu, C. et al. Pmc-llama: toward building open-source language models for medicine. J. Am. Med. Inform. Assoc. 31, 1833–1843 (2024).

Han, T. et al. Medalpaca–an open-source collection of medical conversational AI models and training data. ArXiv, abs/2304.08247, (2023).

Li, Y. et al. Chatdoctor: a medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge. Cureus 15, e40895 (2023).

Chen, Z. et al. Meditron-70b: Scaling medical pretraining for large language models. ArXiv, abs/2311.16079, 2023.

BigScienceWorkshop Scao et al. Bloom: A 176b-parameter open-access multilingual language model. Nov 2022.

InternLM Team. Internlm: a multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM , Accessed: Feb. 2024.

Papineni, K., Roukos, S., Ward, T., and Zhu, Wei-Jing. Bleu: a method for automatic evaluation of machine translation. In Pierre Isabelle, Eugene Charniak, and Dekang Lin, editors, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics.

Lin, Chin-Yew. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, 74–81, 2004.

Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., Artzi, Y. Bertscore: evaluating text generation with bert. In International Conference on Learning Representations (2019).

Anil, R. et al. Palm 2 technical report. ArXiv, abs/2305.10403 (2023).

Blinov, P., Reshetnikova, A., Nesterov, A., Zubkova, G., and Kokh, V. Rumedbench: A Russian medical language understanding benchmark. ArXiv, abs/2201.06499, 2022.

Nori, H. et al. Can generalist foundation models outcompete special-purpose tuning? Case study in medicine. ArXiv, abs/2311.16452 (2023).

Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2022).

Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616, 259–265 (2023).

doi: 10.1038/s41586-023-05881-4 pubmed: 37045921

Moor, M. et al. Med-flamingo: a multimodal medical few-shot learner. ArXiv, abs/2307.15189, 2023.

Wu, C., Zhang, X., Zhang, Y., Wang, Y., and Xie, W. Towards generalist foundation model for radiology. ArXiv, abs/2308.02463, 2023.

Tu, T. et al. Towards generalist biomedical AI. ArXiv, abs/2307.14334, 2023.

Zakka, C. et al. Almanac - retrieval-augmented language models for clinical medicine. NEJM AI 1, AIoa2300068 (2024).

doi: 10.1056/AIoa2300068

Zhang, P., Xiao, S., Liu, Z., Dou, Z., and Nie, Jian-Yun. Retrieve anything to augment large language models. ArXiv, abs/2310.07554 (2023).

Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. ArXiv, abs/2005.11401 (2020).

Omiye, J. A., Lester, J. C., Spichak, S., Rotemberg, V. & Daneshjou, R. Large language models propagate race-based medicine. npj Digit. Med. 6, 195 (2023).

doi: 10.1038/s41746-023-00939-z pubmed: 37864012 pmcid: 10589311

Joyce, D. W., Kormilitzin, A., Smith, K. A. & Cipriani, A. Explainable artificial intelligence for mental health through transparency and interpretability for understandability. npj Digit. Med. 6, 6 (2023).

doi: 10.1038/s41746-023-00751-9 pubmed: 36653524 pmcid: 9849399

Crawl, C. Common crawl maintains a free, open repository of web crawl data that can be used by anyone. https://commoncrawl.org/ Accessed: Apr. 2024.

Nguyen, T. et al. Culturax: a cleaned, enormous, and multilingual dataset for large language models in 167 languages. ArXiv, abs/2309.09400 (2023).

Foundation, W. Wikimedia downloads. https://dumps.wikimedia.org , Accessed: May. 2024.

BIT-ENGD. baidu_baike. https://github.com/BIT-ENGD/baidu_baike , Accessed: Apr. 2024.

Institute of Formal and Applied Linguistics. Ufal medical corpus. https://ufal.mff.cuni.cz/ufal_medical_corpus , Accessed: Nov. 2024.

Jin, D. et al. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. ArXiv, abs/2009.13081 (2020).

Kasai, J., Kasai, Y., Sakaguchi, K., Yamada, Y., Radev, D. Evaluating GPT-4 and ChatGPT on Japanese medical licensing examinations. ArXiv, abs/2303.18027 (2023).

Labrak, Y. et al. Frenchmedmcqa: a French multiple-choice question answering dataset for medical domain. ArXiv, abs/2304.04280, (2023).

Vilares, D., Gómez-Rodríguez, C. HEAD-QA: A healthcare dataset for complex reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 960–966, Florence, Italy, July 2019. Association for Computational Linguistics.

Kung, T. H. et al. Performance of ChatGPT on usmle: potential for ai-assisted medical education using large language models. PLoS Digit. Health 2, e0000198 (2023).

doi: 10.1371/journal.pdig.0000198 pubmed: 36812645 pmcid: 9931230

Edwardnm, H. et al. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations (2022).

Brown, T. et al. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc. (2020).

Kim, S. et al. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12685–12708, (2023).

OpenAI. Openai. introducing chatgpt. https://openai.com/blog/chatgpt/ , Accessed: Dec. 2024.

Gemini Team. Gemini: A family of highly capable multimodal models. ArXiv, abs/2312.11805 (2023).

Touvron, H. et al. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971 (2023).

Albertnm, J. et al. Mistral 7b. ArXiv, abs/2310.06825 (2023).

Labrak, Y. et al. Biomistral: a collection of open-source pretrained large language models for medical domains. ArXiv, abs/2402.10373 (2024).

Gemma Team et al. Gemma: Open models based on Gemini research and technology. ArXiv, abs/2403.08295 (2024).

Zheng, L. et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Adv. Neural Inform. Process. Syst. 36, 46595–46623 (2024).

Jin, Q., Dhingra, B., Liu, Z., Cohen, W. W., and Lu, X. Pubmedqa: A dataset for biomedical research question answering. In Conference on Empirical Methods in Natural Language Processing, 2019.

Diao, S. et al. Lmflow: An extensible toolkit for finetuning and inference of large foundation models. ArXiv, abs/2306.12420 (2023).

Pal, A., Umapathi, Logesh Kumar, and Sankarasubbu, M. Medmcqa: a large-scale multi-subject multi-choice dataset for medical domain question answering. In Gerardo Flores, George H Chen, Tom Pollard, Joyce C Ho, Tristan Naumann, editors, Proceedings of the Conference on Health, Inference, and Learning, volume 174 of Proceedings of Machine Learning Research, pages 248–260. PMLR, 07–08 Apr 2022.

Hendrycks, D. et al. Measuring massive multitask language understanding. ArXiv, abs/2009.03300 (2020).

Chung, HyungWon et al. Scaling instruction-finetuned language models. J. Mach. Learn. Res. 25, 1–53 (2024).

Longpre, S. et al. The flan collection: designing data and methods for effective instruction tuning. In International Conference on Machine Learning, pages 22631–22648. PMLR (2023).

Wang, Y. et al. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. ArXiv, abs/2204.07705, (2022).

Qiu, P. et al. Towards building multilingual language model for medicine. https://doi.org/10.5281/zenodo.12748399 (2024).

Wang, X. et al. Self-consistency improves chain of thought reasoning in language models. ArXiv, abs/2203.11171 (2022).

Pal, A., Minervini, P., Motzfeldt, Andreas Geert, and Alex, B. Open medical llm leaderboard. https://huggingface.co/spaces/openlifescienceai/open_medical_llm_leaderboard , Accessed: Apr. (2024).

Towards building multilingual language model for medicine.

Journal

Informations de publication

Résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Pagination

Subventions

Informations de copyright

Références

Auteurs

Pengcheng Qiu (P)

Chaoyi Wu (C)

Xiaoman Zhang (X)

Weixiong Lin (W)

Haicheng Wang (H)

Ya Zhang (Y)

Yanfeng Wang (Y)

Weidi Xie (W)

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Smoking Cessation and Incident Cardiovascular Disease.

Evaluation of Low-Value Services Across Major Medicare Advantage Insurers and Traditional Medicare.

Effectiveness of Virtual Yoga for Chronic Low Back Pain: A Randomized Clinical Trial.

Classifications MeSH