Vision-Language Models for Vision Tasks: A Survey.

Journal

IEEE transactions on pattern analysis and machine intelligence

ISSN: 1939-3539

Titre abrégé: IEEE Trans Pattern Anal Mach Intell

Pays: United States

ID NLM: 9885960

Informations de publication

Date de publication:
26 Feb 2024

Historique:

medline: 26 2 2024

pubmed: 26 2 2024

entrez: 26 2 2024

Statut: aheadofprint

Résumé

Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision-Language Models (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available on the Internet and enables zero-shot predictions on various visual recognition tasks with a single VLM. This paper provides a systematic review of visual language models for various visual recognition tasks, including: (1) the background that introduces the development of visual recognition paradigms; (2) the foundations of VLM that summarize the widely-adopted network architectures, pre-training objectives, and downstream tasks; (3) the widely-adopted datasets in VLM pre-training and evaluations; (4) the review and categorization of existing VLM pre-training methods, VLM transfer learning methods, and VLM knowledge distillation methods; (5) the benchmarking, analysis and discussion of the reviewed methods; (6) several research challenges and potential research directions that could be pursued in the future VLM studies for visual recognition. A project associated with this survey has been created at https://github.com/jingyi0000/VLM_survey.

Identifiants

DOI: 10.1109/TPAMI.2024.3369699 PMID: 38408000

pubmed: 38408000

doi: 10.1109/TPAMI.2024.3369699

doi:

Types de publication

Journal Article

Langues

eng

Vision-Language Models for Vision Tasks: A Survey.

Journal

Informations de publication

Résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Auteurs

Jingyi Zhang (J)

Jiaxing Huang (J)

Sheng Jin (S)

Shijian Lu (S)

Classifications MeSH