Data-Centric Machine Learning for Generalizable Phenotyping from Longitudinal Electronic Health Records
Title of the Talk: Data-Centric Machine Learning for Generalizable Phenotyping from Longitudinal Electronic Health Records
Speaker: Dr. Tushar Mungle
Host Faculty: Dr.Sandipan Dandapat
Date: 05th March 2026
Time: 02:30 pm
Abstract Building machine learning models that generalize beyond a single institution is a central challenge when working with longitudinal electronic health records (EHRs), due to heterogeneous data, class imbalance, and variability in labeling practices. In this talk, I will present data-centric machine learning approaches for robust and generalizable clinical phenotyping from real-world longitudinal EHR data. I will focus on a multimodal ensemble learning framework trained on data from one institution and externally evaluated on independent site, demonstrating improved robustness and fairness compared to individual model baselines. These results highlight that while algorithmic strategies can address data distribution shift, scaling evaluation to large multi-center settings is fundamentally constrained by label availability and consistency. Motivated by this bottleneck, I will discuss briefly, results from a proof-of- concept study in which large language models (LLMs) are explored as scalable tools for label generation and harmonization under weak supervision. The broader goal of this research is to develop principled, reproducible methodologies for learning from longitudinal, heterogeneous data in real-world settings that can be replicated across domains and healthcare institutions.
Bio Dr. Tushar Mungle is a clinical informatics researcher focused on applied data science and machine learning for real-world healthcare applications involving longitudinal datasets. He recently completed his postdoctoral training at Stanford University, where he developed generalizable and fair machine learning and large language model-based methods for longitudinal electronic health record analysis and phenotype extraction. He earned his Ph.D. from IIT Kharagpur and has led the design, implementation, and deployment of decision-support system integrated into clinical workflow at Tata Medical Centre, Kolkata. His specialization lies in working with heterogeneous healthcare data from electronic health records including structured data, unstructured text, imaging, and genomics, to generate clinically actionable insights and build scalable, production-grade healthcare applications.