MACHINE_LEARN Telegram 3471
ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents

25 Feb 2025 · Qiuchen Wang, Ruixue Ding, Zehui Chen, Weiqi Wu, Shihang Wang, Pengjun Xie, Feng Zhao ·

Understanding information from visually rich documents remains a significant challenge for traditional Retrieval-Augmented Generation (RAG) methods. Existing benchmarks predominantly focus on image-based question answering (QA), overlooking the fundamental challenges of efficient retrieval, comprehension, and reasoning within dense visual documents. To bridge this gap, we introduce ViDoSeek, a novel dataset designed to evaluate RAG performance on visually rich documents requiring complex reasoning. Based on it, we identify key limitations in current RAG approaches: (i) purely visual retrieval methods struggle to effectively integrate both textual and visual features, and (ii) previous approaches often allocate insufficient reasoning tokens, limiting their effectiveness. To address these challenges, we propose #ViDoRAG, a novel multi-agent RAG framework tailored for complex reasoning across visual documents. ViDoRAG employs a Gaussian Mixture Model (GMM)-based hybrid strategy to effectively handle multi-modal retrieval. To further elicit the model's reasoning capabilities, we introduce an iterative agent workflow incorporating exploration, summarization, and reflection, providing a framework for investigating test-time scaling in RAG domains. Extensive experiments on ViDoSeek validate the effectiveness and generalization of our approach. Notably, ViDoRAG outperforms existing methods by over 10% on the competitive #ViDoSeek benchmark.

Paper: https://arxiv.org/pdf/2502.18017v1.pdf

Code: https://github.com/Alibaba-NLP/ViDoRAG

@Machine_learn
5👍1



tgoop.com/Machine_learn/3471
Create:
Last Update:

ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents

25 Feb 2025 · Qiuchen Wang, Ruixue Ding, Zehui Chen, Weiqi Wu, Shihang Wang, Pengjun Xie, Feng Zhao ·

Understanding information from visually rich documents remains a significant challenge for traditional Retrieval-Augmented Generation (RAG) methods. Existing benchmarks predominantly focus on image-based question answering (QA), overlooking the fundamental challenges of efficient retrieval, comprehension, and reasoning within dense visual documents. To bridge this gap, we introduce ViDoSeek, a novel dataset designed to evaluate RAG performance on visually rich documents requiring complex reasoning. Based on it, we identify key limitations in current RAG approaches: (i) purely visual retrieval methods struggle to effectively integrate both textual and visual features, and (ii) previous approaches often allocate insufficient reasoning tokens, limiting their effectiveness. To address these challenges, we propose #ViDoRAG, a novel multi-agent RAG framework tailored for complex reasoning across visual documents. ViDoRAG employs a Gaussian Mixture Model (GMM)-based hybrid strategy to effectively handle multi-modal retrieval. To further elicit the model's reasoning capabilities, we introduce an iterative agent workflow incorporating exploration, summarization, and reflection, providing a framework for investigating test-time scaling in RAG domains. Extensive experiments on ViDoSeek validate the effectiveness and generalization of our approach. Notably, ViDoRAG outperforms existing methods by over 10% on the competitive #ViDoSeek benchmark.

Paper: https://arxiv.org/pdf/2502.18017v1.pdf

Code: https://github.com/Alibaba-NLP/ViDoRAG

@Machine_learn

BY Machine learning books and papers




Share with your friend now:
tgoop.com/Machine_learn/3471

View MORE
Open in Telegram


Telegram News

Date: |

In the “Bear Market Screaming Therapy Group” on Telegram, members are only allowed to post voice notes of themselves screaming. Anything else will result in an instant ban from the group, which currently has about 75 members. Telegram iOS app: In the “Chats” tab, click the new message icon in the right upper corner. Select “New Channel.” The best encrypted messaging apps To upload a logo, click the Menu icon and select “Manage Channel.” In a new window, hit the Camera icon. Hashtags are a fast way to find the correct information on social media. To put your content out there, be sure to add hashtags to each post. We have two intelligent tips to give you:
from us


Telegram Machine learning books and papers
FROM American