𝗣𝗮𝗽𝗲𝗿 𝗔𝗰𝗰𝗲𝗽𝘁𝗮𝗻𝗰𝗲 𝗔𝗻𝗻𝗼𝘂𝗻𝗰𝗲𝗺𝗲𝗻𝘁 : Diagnosing Evidence Utilization in Multimodal Document Question Answering has been accepted at ACM SIGKDD 2026

Paper titled “Diagnosing Evidence Utilization in Multimodal Document Question Answering” has been accepted at ACM SIGKDD 2026 Research Track (A* venue).

Authors: Debolena Basak, Digbalay Bose, Koustava Goswami, Maunendra Sankar Desarkar

Author’s Affiliation: * Debolena Basak: Dept. of Artificial Intelligence, IIT Hyderabad (This work was done during an internship at Adobe Research Bangalore)

Digbalay Bose: Adobe Research, Bangalore
Koustava Goswami: Adobe Research, San Jose
Maunendra Sankar Desarkar: Dept. of CSE, IIT Hyderabad

👏 Congratulations to all the authors!

🔍 Key Highlight / Summary: This paper conducts a comprehensive diagnosis of how effectively 7 popular Multimodal Large Language Models (MLLMs) use relevant evidence for document question answering across text, image, table, chart, and cross-modal inputs.

Results reveal a strong reliance on text-based evidence, with notably weaker performance on image-only and cross-evidence inputs, a gap that supervised fine-tuning also fails to consistently close. Attention-based analysis further shows that low attention to image tokens contributes to poor utilisation of image evidence, highlighting the key limitations in current MLLMs.