Yifan Peng

I am a Research Scientist at NVIDIA NeMo. I received my Ph.D. in Electrical and Computer Engineering from Carnegie Mellon University in 2025, where I was fortunate to be supervised by Prof. Shinji Watanabe (Sep 2021 - May 2025) and Prof. Ian Lane (Aug 2020 - Aug 2021; now at UC, Santa Cruz). Prior to that, I received my bachelor’s degree from the Department of Electronic Engineering at Tsinghua University in 2020. During my Ph.D., I interned at NVIDIA NeMo (Summer 2024), Meta FAIR (Summer 2023), and ASAPP (Summer 2022), where I conducted research on speech language models and speech recognition.

I am interested in building open multimodal foundation models for speech and language processing. My recent focus has been on multimodal large language models (LLMs) and full-duplex speech-to-speech dialog systems. I have published papers at top-tier venues, including TMLR, ICML, ICLR, ACL, EMNLP, NAACL, AAAI, ICASSP, INTERSPEECH, etc. I am a recipient of the INTERSPEECH 2025 Best Student Paper Award (first-authored), EMNLP 2024 Best Paper Award, IEEE SLT 2024 Best Paper Award, ICASSP 2023 Top 3% Paper Recognition (two first-authored and one co-authored), and SPIE Medical Imaging 2020 Best Student Paper Award Finalist (first-authored).

At CMU WAVLab, I led the Open Whisper-style Speech Models (OWSM) project, developing the first large-scale, fully open speech foundation model from academia. I am also a core contributor to the widely-used speech processing toolkit, ESPnet. My Ph.D. thesis is “Towards Effective and Efficient Open Speech Foundation Models”. More specifically, I was the primary contributor to multiple projects:

Novel speech encoder architectures: Branchformer (ICML’22), E-Branchformer vs Conformer (INTERSPEECH’23)
Speech model compression: I3D (ICASSP’23 Top 3%), HJ-Pruning (ICASSP’23 Top 3%), DPHuBERT (INTERSPEECH’23)
Open speech foundation models: OWSM (ASRU’23), OWSM v3.1 (INTERSPEECH’24), OWSM-CTC (ACL’24), OWSM v4 (INTERSPEECH’25 Best Student Paper)
Speech language models: SpeechLM analysis, MSLM-S2ST, VoiceTextBlender (NAACL’25), SLM Survey (TMLR’25)

News

Sep 09, 2025	A first-authored survey paper, On The Landscape of Spoken Language Models: A Comprehensive Survey, is accepted at Transactions on Machine Learning Research (TMLR)
Aug 21, 2025	My first-authored paper, OWSM v4, received the Best Student Paper Award at INTERSPEECH 2025
May 19, 2025	Six papers are accepted at INTERSPEECH 2025, including the latest OWSM v4 series
May 01, 2025	A co-authored paper, OWLS is accepted at ICML 2025. This work builds upon my Open Whisper-style Speech Models (OWSM) project and investigates the scaling laws of speech-to-text foundation models.
Apr 29, 2025	Attending NAACL 2025 in Albuquerque, New Mexico, USA to present my paper about speech language models, VoiceTextBlender
Mar 26, 2025	I defended my PhD thesis at CMU
Jan 22, 2025	A first-authored paper, VoiceTextBlender, is accepted at NAACL 2025 main conference
Jan 22, 2025	A co-authored paper is accepted at ICLR 2025. It extends my previous compression method, I3D, to our open foundation model, OWSM
Dec 09, 2024	A co-authored paper is accepted at AAAI 2025
Dec 04, 2024	A co-authored paper received Best Paper Award at SLT 2024
Nov 14, 2024	A co-authored paper received Best Paper Award at EMNLP 2024
Aug 30, 2024	3 papers are accepted at IEEE SLT 2024
Jun 04, 2024	4 papers (1 first-authored) are accepted at INTERSPEECH 2024
May 16, 2024	1 first-authored paper, OWSM-CTC, is accpeted at ACL 2024 (main)
May 13, 2024	Joining NVIDIA NeMo Speech in Santa Clara as AI Research Intern
Jan 01, 2024	We are hosting a special session at INTERSPEECH 2024 - Spoken Language Models for Universal Speech Processing (Official Site)
Dec 13, 2023	Check out the webpage for our Open Whisper-style Speech Models (OWSM)
Dec 13, 2023	3 papers are accepted at ICASSP 2024
Sep 22, 2023	2 papers (1 first-authored) are accepted at IEEE ASRU 2023
Jun 04, 2023	3 papers (2 first-authored) are recognized among the top 3% of all papers accepted at ICASSP 2023
May 22, 2023	Joining Meta AI (FAIR) in Seattle as Research Scientist Intern
May 17, 2023	5 papers (2 first-authored) are accepted at INTERSPEECH 2023
Feb 17, 2023	4 research papers (2 first-authored) and 3 co-authored challenge papers are accepted at ICASSP 2023
Sep 30, 2022	2 papers (1 first-authored) are accepted at IEEE SLT 2022
Jul 17, 2022	Attending ICML 2022 in Baltimore, Maryland, USA
May 31, 2022	Joining ASAPP remotely as Speech Recognition Intern
May 15, 2022	1 first-authored paper is accepted at ICML 2022

Select Publications

TMLR
Foundation Model

On The Landscape of Spoken Language Models: A Comprehensive Survey

Siddhant Arora^*, Kai-Wei Chang^*, Chung-Ming Chien^*, Yifan Peng^*, Haibin Wu^*, Yossi Adi, Emmanuel Dupoux, Hung-yi Lee, Karen Livescu, and Shinji Watanabe

Transactions on Machine Learning Research (TMLR), Sep 2025

arXiv PDF
INTERSPEECH
Foundation Model

OWSM v4: Improving Open Whisper-Style Speech Models via Data Scaling and Cleaning

Yifan Peng, Shakeel Muhammad, Yui Sudo, William Chen, Jinchuan Tian, Chyi-Jiunn Lin, and Shinji Watanabe

In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH) (Best Student Paper Award) , Aug 2025

Awarded arXiv PDF Code Website

ISCA Award for Best Student Paper at INTERSPEECH 2025
Thesis
Foundation Model

Towards Effective and Efficient Open Speech Foundation Models

Yifan Peng

Carnegie Mellon University, May 2025

Website
NAACL
Foundation Model

VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning

Yifan Peng^*, Krishna C. Puvvada^*, Zhehuai Chen^*, Piotr Zelasko, He Huang, Kunal Dhawan, Ke Hu, Shinji Watanabe, Jagadeesh Balam, and Boris Ginsburg

In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL), Apr 2025

arXiv PDF Code Poster Website
SLT
ASR

Contextualized Automatic Speech Recognition with Dynamic Vocabulary

Yui Sudo, Yosuke Fukumoto, Muhammad Shakeel, Yifan Peng, and Shinji Watanabe

In Proceedings of the IEEE Spoken Language Technology Workshop (SLT) (Best Paper Award) , Dec 2024

Awarded PDF

Best Paper Award
EMNLP
Foundation Model

Towards Robust Speech Representation Learning for Thousands of Languages

William Chen, Wangyou Zhang, Yifan Peng, Xinjian Li, Jinchuan Tian, Jiatong Shi, Xuankai Chang, Soumi Maiti, Karen Livescu, and Shinji Watanabe

In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (Best Paper Award) , Nov 2024

Awarded PDF

Best Paper Award
INTERSPEECH
Foundation Model

OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer

Yifan Peng, Jinchuan Tian, William Chen, Siddhant Arora, Brian Yan, Yui Sudo, Muhammad Shakeel, Kwanghee Choi, Jiatong Shi, Xuankai Chang, Jee-weon Jung, and Shinji Watanabe

In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), Sep 2024

arXiv PDF Code Poster Website
ACL
Foundation Model

OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification

Yifan Peng, Yui Sudo, Muhammad Shakeel, and Shinji Watanabe

In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Aug 2024

Abs arXiv PDF Code Poster Website

There has been an increasing interest in large speech models that can perform multiple tasks in a single model. Such models usually adopt an encoder-decoder or decoder-only architecture due to their popularity and good performance in many domains. However, autoregressive models can be slower during inference compared to non-autoregressive models and also have potential risks of hallucination. Though prior studies observed promising results of non-autoregressive models for certain tasks at small scales, it remains unclear if they can be scaled to speech-to-text generation in diverse languages and tasks. Inspired by the Open Whisper-style Speech Model (OWSM) project, we propose OWSM-CTC, a novel encoder-only speech foundation model based on Connectionist Temporal Classification (CTC). It is trained on 180k hours of public audio data for multilingual automatic speech recognition (ASR), speech translation (ST), and language identification (LID). Compared to encoder-decoder OWSM, our OWSM-CTC achieves competitive results on ASR and up to 24% relative improvement on ST, while it is more robust and 3 to 4 times faster for inference. OWSM-CTC also improves the long-form ASR result with 20x speed-up.We will publicly release our code, pre-trained model, and training logs to promote open science in speech foundation models.
ICASSP
Foundation Model

VoxtLM: Unified Decoder-Only Models for Consolidating Speech Recognition, Synthesis and Speech, Text Continuation Tasks

Soumi Maiti, Yifan Peng, Shukjae Choi, Jee-weon Jung, Xuankai Chang, and Shinji Watanabe

In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr 2024

PDF
arXiv Foundation Model

MSLM-S2ST: A Multitask Speech Language Model for Textless Speech-to-Speech Translation with Speaker Style Preservation

Yifan Peng, Ilia Kulikov, Yilin Yang, Sravya Popuri, Hui Lu, Changhan Wang, and Hongyu Gong

arXiv, Mar 2024

arXiv PDF
arXiv Foundation Model

An Empirical Study of Speech Language Models for Prompt-Conditioned Speech Synthesis

Yifan Peng, Ilia Kulikov, Yilin Yang, Sravya Popuri, Hui Lu, Changhan Wang, and Hongyu Gong

arXiv, Mar 2024

arXiv PDF
ASRU
Foundation Model

Reproducing Whisper-Style Training Using An Open-Source Toolkit And Publicly Available Data

Yifan Peng, Jinchuan Tian, Brian Yan, Dan Berrebbi, Xuankai Chang, Xinjian Li, Jiatong Shi, Siddhant Arora, William Chen, Roshan Sharma, Wangyou Zhang, Yui Sudo, Muhammad Shakeel, Jee-weon Jung, Soumi Maiti, and Shinji Watanabe

In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Dec 2023

arXiv PDF Website
INTERSPEECH
Efficient Model

DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models

Yifan Peng, Yui Sudo, Muhammad Shakeel, and Shinji Watanabe

In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), Aug 2023

arXiv PDF Code
INTERSPEECH
Architecture

A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks

Yifan Peng, Kwangyoun Kim, Felix Wu, Brian Yan, Siddhant Arora, William Chen, Jiyang Tang, Suwon Shon, Prashant Sridhar, and Shinji Watanabe

In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), Aug 2023

arXiv PDF Code
ICASSP
Efficient Model

I3D: Transformer Architectures with Input-Dependent Dynamic Depth for Speech Recognition

Yifan Peng, Jaesong Lee, and Shinji Watanabe

In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (Top 3% of all papers accepted) , Jun 2023

Awarded arXiv PDF

Recognized as one of the top 3% of all papers accepted at the International Conference on Acoustics Speech and Signal Processing (ICASSP) 2023
ICASSP
Efficient Model

Structured Pruning of Self-Supervised Pre-Trained Models for Speech Recognition and Understanding

Yifan Peng, Kwangyoun Kim, Felix Wu, Prashant Sridhar, and Shinji Watanabe

In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (Top 3% of all papers accepted) , Jun 2023

Awarded arXiv PDF

Recognized as one of the top 3% of all papers accepted at the International Conference on Acoustics Speech and Signal Processing (ICASSP) 2023
SLT
SLU

A Study on the Integration of Pre-Trained SSL, ASR, LM and SLU Models for Spoken Language Understanding

Yifan Peng^*, Siddhant Arora^*, Yosuke Higuchi, Yushi Ueda, Sujay S. Kumar, Karthik Ganesan, Siddharth Dalmia, Xuankai Chang, and Shinji Watanabe

In Proceedings of the 2022 IEEE Spoken Language Technology Workshop (SLT), Jan 2023

arXiv PDF
SLT
Architecture

E-Branchformer: Branchformer with Enhanced Merging for Speech Recognition

Kwangyoun Kim, Felix Wu, Yifan Peng, Jing Pan, Prashant Sridhar, Kyu J. Han, and Shinji Watanabe

In Proceedings of the 2022 IEEE Spoken Language Technology Workshop (SLT), Jan 2023

PDF
ICML
Architecture

Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding

Yifan Peng, Siddharth Dalmia, Ian Lane, and Shinji Watanabe

In Proceedings of the International Conference on Machine Learning (ICML), Jul 2022

Abs PDF Video Code Poster Slides

Conformer has proven to be effective in many speech processing tasks. It combines the benefits of extracting local dependencies using convolutions and global dependencies using self-attention. Inspired by this, we propose a more flexible, interpretable and customizable encoder alternative, Branchformer, with parallel branches for modeling various ranged dependencies in end-to-end speech processing. In each encoder layer, one branch employs self-attention or its variant to capture long-range dependencies, while the other branch utilizes an MLP module with convolutional gating (cgMLP) to extract local relationships. We conduct experiments on several speech recognition and spoken language understanding benchmarks. Results show that our model outperforms both Transformer and cgMLP. It also matches with or outperforms state-of-the-art results achieved by Conformer. Furthermore, we show various strategies to reduce computation thanks to the two-branch architecture, including the ability to have variable inference complexity in a single trained model. The weights learned for merging branches indicate how local and global dependencies are utilized in different layers, which benefits model designing.
SPIE Others

Microcalcification localization and cluster detection using unsupervised convolutional autoencoders and structural similarity index

Yifan Peng, Rui Hou, Yinhao Ren, Lars J. Grimm, Jeffrey R. Marks, E. Shelley Hwang, and Joseph Y. Lo

In Proceedings of the SPIE Medical Imaging 2020: Computer-Aided Diagnosis (Robert F. Wagner Best Student Paper Award Finalist) , May 2020

Awarded HTML

Robert F. Wagner Best Student Paper Award Finalist at SPIE Medical Imaging 2020