Important: It has come to our notice that some fraudsters are trying to misuse the names of IITH faculty members for defrauding educational Institutes and other organizations. Please beware! ×

๐—ฃ๐—ฎ๐—ฝ๐—ฒ๐—ฟ ๐—”๐—ฐ๐—ฐ๐—ฒ๐—ฝ๐˜๐—ฎ๐—ป๐—ฐ๐—ฒ ๐—”๐—ป๐—ป๐—ผ๐˜‚๐—ป๐—ฐ๐—ฒ๐—บ๐—ฒ๐—ป๐˜ : Diagnosing Evidence Utilization in Multimodal Document Question Answering has been accepted at ACM SIGKDD 2026

Paper titled โ€œDiagnosing Evidence Utilization in Multimodal Document Question Answeringโ€ has been accepted at ACM SIGKDD 2026 Research Track (A* venue).

Authors: Debolena Basak, Digbalay Bose, Koustava Goswami, Maunendra Sankar Desarkar

Authorโ€™s Affiliation: * Debolena Basak: Dept. of Artificial Intelligence, IIT Hyderabad (This work was done during an internship at Adobe Research Bangalore)

  • Digbalay Bose: Adobe Research, Bangalore
  • Koustava Goswami: Adobe Research, San Jose
  • Maunendra Sankar Desarkar: Dept. of CSE, IIT Hyderabad

๐Ÿ‘ Congratulations to all the authors!

๐Ÿ” Key Highlight / Summary: This paper conducts a comprehensive diagnosis of how effectively 7 popular Multimodal Large Language Models (MLLMs) use relevant evidence for document question answering across text, image, table, chart, and cross-modal inputs.

Results reveal a strong reliance on text-based evidence, with notably weaker performance on image-only and cross-evidence inputs, a gap that supervised fine-tuning also fails to consistently close. Attention-based analysis further shows that low attention to image tokens contributes to poor utilisation of image evidence, highlighting the key limitations in current MLLMs.