Multimodal Retrieval for Image Search and Video Moment Localization

Title of the Talk: Multimodal Retrieval for Image Search and Video Moment Localization
Speakers: Dr. Manish Gupta
Host Faculty: Dr.Sandipan Dandapat
Date: Jan 12, 2026
Time: 10:00am
Venue: EE Seminar hall

Abstract: In this talk, I will present two multimodal retrieval systems that address challenging computer vision problems: image search and video moment localization. I will introduce novel frameworks that leverage diverse input modalities (including text, sketches, and video) to interpret complex user intent and context. I will begin with Composite Sketch+Text Based Image Retrieval, a new paradigm for image search that uses hand-drawn sketches to capture hard-to-name objects and text to describe attributes or interactions that are difficult to sketch. I will then move to the temporal domain with Video-to-Video Moment Retrieval, where a query video is used to precisely localize a semantically corresponding event within a longer target video. Together, these works demonstrate a unified vision: advanced multimodal alignment models are essential for enabling robust, fine-grained retrieval across images and videos, especially when user intent is nuanced, composite, or hard to express through any single modality.

Bio: Manish Gupta is a Principal Applied Researcher at Microsoft, Hyderabad, India. He is also an Adjunct Faculty at IIIT Hyderabad and a visiting faculty at ISB Hyderabad. He received his Masters in Computer Science from IIT Bombay in 2007 and his Ph.D. from UIUC in 2013. His research interests are in the areas of deep learning, natural language processing and web mining. He has published more than 200 research papers and also co-authored two books.