Building Multilingual NLP datasets at scale

Title of the Talk: Building Multilingual NLP datasets at scale
Host Faculty: Dr.Sandipan
Speaker: Dr. Anoop Kunchukuttan
Date: 09th April 2026
Time: 11 am - 12 noon
Venue: Online

Abstract This talk explores the critical challenge of scaling multilingual Natural Language Processing (NLP) to address the significant data skew currently favoring high-resource languages. While deep learning and large language models have achieved remarkable success, extending these benefits to the diverse linguistic landscape of India requires a multi’-pronged approach to dataset creation. The session will detail systematic strategies for large-scale data mining from the web, including the extraction of parallel corpora for translation and transliteration. We will also examine the emerging role of synthetic data generation, leveraging existing models to create high-quality instruction-following and conversational datasets. Beyond automation, the presentation emphasizes the necessity of expert-curated benchmarks and “seed” data to ensure model reliability and cultural relevance. Furthermore, the discussion will cover the expansion of these techniques into the multimodal domain, focusing on large-scale speech collection for low-resource Indic languages.

Bio Dr. Anoop Kunchukuttan is a Principal Applied Researcher at Microsoft India, with his research focussing on multilingual and multimodal language technologies. He currently works with the Speech team and has been a long-time member of the Machine Translation team. He is a co-founder and co-lead of AI4Bharat, a research center at IIT Madras, that drive advances  and develops large-scale open-source models, datasets, and tools for Indian languages. He received his Ph.D. from the Indian Institute of Technology Bombay. He is broadly interested in natural language processing and machine learning. His research interests include multilingual learning and LLMs, post-training of LLMs, reasoning and evaluation in LLMs, representation learning, NLP for related languages, machine translation and transliteration. These works have been published in top-tier Natural Language Processing (NLP) conferences like ACL, EMNLP, and NAACL, AAAI as well as journals like Transactions of the ACL, ACM Computing Surveys, etc. He is passionate about building software and resources for NLP in Indian languages. He actively develops and maintains the Indic NLP Library and the Indic NLP Catalog, and has driven the development of resources like the IndicTrans MT system, IndicXlit, IndicLLMSuite, IndicBERT, Indic NLP Suite, BPCC corpus, and the IIT Bombay parallel corpus among others.