VideoABC: A Real-World Video Dataset for Abductive Visual Reasoning.

Algorithms Datasets as Topic Humans

Journal

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

ISSN: 1941-0042

Titre abrégé: IEEE Trans Image Process

Pays: United States

ID NLM: 9886191

Informations de publication

Date de publication:
2022

Historique:

pubmed: 15 9 2022

medline: 23 9 2022

entrez: 14 9 2022

Statut: ppublish

Résumé

In this paper, we investigate the problem of abductive visual reasoning (AVR), which requires vision systems to infer the most plausible explanation for visual observations. Unlike previous work which performs visual reasoning on static images or synthesized scenes, we exploit long-term reasoning from instructional videos that contain a wealth of detailed information about the physical world. We conceptualize two tasks for this emerging and challenging topic. The primary task is AVR, which is based on the initial configuration and desired goal from an instructional video, and the model is expected to figure out what is the most plausible sequence of steps to achieve the goal. In order to avoid trivial solutions based on appearance information rather than reasoning, the second task called AVR++ is constructed, which requires the model to answer why the unselected options are less plausible. We introduce a new dataset called VideoABC, which consists of 46,354 unique steps derived from 11,827 instructional videos, formulated as 13,526 abductive reasoning questions with an average reasoning duration of 51 seconds. Through an adversarial hard hypothesis mining algorithm, non-trivial and high-quality problems are generated efficiently and effectively. To achieve human-level reasoning, we propose a Hierarchical Dual Reasoning Network (HDRNet) to capture the long-term dependencies among steps and observations. We establish a benchmark for abductive visual reasoning, and our method set state-of-the-arts on AVR ( ∼ 74 %) and AVR++ ( ∼ 45 %), and humans can easily achieve over 90% accuracy on these two tasks. The large performance gap reveals the limitation of current video understanding models on temporal reasoning and leaves substantial room for future research on this challenging problem. Our dataset and code are available at https://github.com/wl-zhao/VideoABC.

Identifiants

DOI: 10.1109/TIP.2022.3205207 PMID: 36103440

pubmed: 36103440

doi: 10.1109/TIP.2022.3205207

doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

Pagination

6048-6061

VideoABC: A Real-World Video Dataset for Abductive Visual Reasoning.

Journal

Informations de publication

Résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Pagination

Auteurs

Wenliang Zhao (W)

Yongming Rao (Y)

Yansong Tang (Y)

Jie Zhou (J)

Jiwen Lu (J)

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Smoking Cessation and Incident Cardiovascular Disease.

Evaluation of Low-Value Services Across Major Medicare Advantage Insurers and Traditional Medicare.

Effectiveness of Virtual Yoga for Chronic Low Back Pain: A Randomized Clinical Trial.

Classifications MeSH