๐ฃ๐ฎ๐ฝ๐ฒ๐ฟ ๐๐ฐ๐ฐ๐ฒ๐ฝ๐๐ฎ๐ป๐ฐ๐ฒ ๐๐ป๐ป๐ผ๐๐ป๐ฐ๐ฒ๐บ๐ฒ๐ป๐ : Diagnosing Evidence Utilization in Multimodal Document Question Answering has been accepted at ACM SIGKDD 2026
Paper titled โDiagnosing Evidence Utilization in Multimodal Document Question Answeringโ has been accepted at ACM SIGKDD 2026 Research Track (A* venue).
Authors: Debolena Basak, Digbalay Bose, Koustava Goswami, Maunendra Sankar Desarkar
Authorโs Affiliation: * Debolena Basak: Dept. of Artificial Intelligence, IIT Hyderabad (This work was done during an internship at Adobe Research Bangalore)
- Digbalay Bose: Adobe Research, Bangalore
- Koustava Goswami: Adobe Research, San Jose
- Maunendra Sankar Desarkar: Dept. of CSE, IIT Hyderabad
๐ Congratulations to all the authors!
๐ Key Highlight / Summary: This paper conducts a comprehensive diagnosis of how effectively 7 popular Multimodal Large Language Models (MLLMs) use relevant evidence for document question answering across text, image, table, chart, and cross-modal inputs.
Results reveal a strong reliance on text-based evidence, with notably weaker performance on image-only and cross-evidence inputs, a gap that supervised fine-tuning also fails to consistently close. Attention-based analysis further shows that low attention to image tokens contributes to poor utilisation of image evidence, highlighting the key limitations in current MLLMs.