Yifan Peng
Research Scientist, NVIDIA

I am a Research Scientist at NVIDIA NeMo Speech AI team. I received my Ph.D. in Electrical and Computer Engineering from Carnegie Mellon University, Pittsburgh, PA, USA, in 2025, where I was fortunate to be supervised by Prof. Shinji Watanabe (Sep 2021 - May 2025) and Prof. Ian Lane (Aug 2020 - Aug 2021; now at UC, Santa Cruz). Prior to that, I received my bachelor’s degree from the Department of Electronic Engineering, Tsinghua University, Beijing, China, in 2020. During my Ph.D., I interned at NVIDIA NeMo (Summer 2024), Meta FAIR (Summer 2023), and ASAPP (Summer 2022), where I conducted research on speech language models and speech recognition.
I am interested in building open multimodal foundation models for speech and language processing. My recent focus at NVIDIA has been on multimodal large language models (LLMs) and full-duplex speech-to-speech dialog systems. I have published papers at top-tier ML/AI/NLP/speech conferences, including ICML, ICLR, ACL, EMNLP, NAACL, AAAI, ICASSP, INTERSPEECH, and more. I received INTERSPEECH 2025 Best Student Paper Award (first-authored), EMNLP 2024 Best Paper Award, IEEE SLT 2024 Best Paper Award, ICASSP 2023 Top 3% Paper Recognition (two first-authored and one co-authored), and SPIE Medical Imaging 2020 Best Student Paper Award Finalist (first-authored).
At CMU WAVLab, I led the Open Whisper-style Speech Models (OWSM) project, developing the first large-scale, fully open speech foundation model from academia. I am also a core contributor to the widely-used speech processing toolkit, ESPnet. My Ph.D. thesis is “Towards Effective and Efficient Open Speech Foundation Models”. More specifically, I was the primary contributor to multiple projects:
- Novel speech encoder architectures: Branchformer (ICML’22), E-Branchformer vs Conformer (INTERSPEECH’23)
- Speech model compression: I3D (ICASSP’23 Top 3%), HJ-Pruning (ICASSP’23 Top 3%), DPHuBERT (INTERSPEECH’23)
- Open speech foundation models: OWSM (ASRU’23), OWSM v3.1 (INTERSPEECH’24), OWSM-CTC (ACL’24), OWSM v4 (INTERSPEECH’25 Best Student Paper)
- Speech language models: SpeechLM analysis, MSLM-S2ST, VoiceTextBlender (NAACL’25), SLM Survey
News
Aug 21, 2025 | ![]() |
---|---|
May 19, 2025 | ![]() |
May 01, 2025 | ![]() |
Apr 29, 2025 | ![]() |
Mar 26, 2025 | I defended my PhD thesis at CMU |
Jan 22, 2025 | ![]() |
Jan 22, 2025 | ![]() |
Dec 09, 2024 | ![]() |
Dec 04, 2024 | ![]() |
Nov 14, 2024 | ![]() |
Aug 30, 2024 | ![]() |
Jun 04, 2024 | ![]() |
May 16, 2024 | ![]() |
May 13, 2024 | ![]() |
Jan 01, 2024 | ![]() |
Dec 13, 2023 | ![]() |
Dec 13, 2023 | ![]() |
Sep 22, 2023 | ![]() |
Jun 04, 2023 | ![]() |
May 22, 2023 | ![]() |
May 17, 2023 | ![]() |
Feb 17, 2023 | ![]() |
Sep 30, 2022 | ![]() |
Jul 17, 2022 | ![]() |
May 31, 2022 | ![]() |
May 15, 2022 | ![]() |
Select Publications
- ThesisFoundation ModelTowards Effective and Efficient Open Speech Foundation ModelsCarnegie Mellon University, May 2025
- NAACLFoundation ModelVoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-TuningIn Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL), Apr 2025
- ICASSPFoundation ModelVoxtLM: Unified Decoder-Only Models for Consolidating Speech Recognition, Synthesis and Speech, Text Continuation TasksIn Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr 2024
- SLTArchitectureE-Branchformer: Branchformer with Enhanced Merging for Speech RecognitionIn Proceedings of the 2022 IEEE Spoken Language Technology Workshop (SLT), Jan 2023