Yifan Peng
PhD Candidate, Carnegie Mellon University

I am a final-year Ph.D. student in the Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA. I am fortunate to be supervised by Prof. Shinji Watanabe (Sep 2021 - now) and Prof. Ian Lane (Aug 2020 - Aug 2021; now at UC, Santa Cruz). I received my bachelor’s degree from the Department of Electronic Engineering, Tsinghua University, Beijing, China in 2020.
In Summer 2024, I was an AI Research Intern at NVIDIA NeMo, where I worked on joint speech-text language models. In Summer 2023, I was a research scientist intern at Meta AI FAIR and worked on speech language models for voice-preserved textless speech-to-speech translation. In Summer 2022, I worked as a speech recognition intern at ASAPP about speech model compression.
My research area is speech and language processing. My Ph.D. thesis is to develop effective and efficient open speech foundation models. I have led the project of Open Whisper-style Speech Models (OWSM) at CMU WAVLab, developing the first large-scale, fully open speech foundation model from academia. Now I am interested in integrating speech capabilities into large language models. Most of my works are open sourced in a widely-used speech processing toolkit, ESPnet.
I published first-authored papers at top-tier AI/speech conferences, such as ICML, ACL, NAACL, ICASSP, and INTERSPEECH. Several projects that I was involved in received notable recognition, including EMNLP 2024 Best Paper Award, IEEE SLT 2024 Best Paper Award, ICASSP 2023 Top 3% Paper Recognition (3 papers), and SPIE Medical Imaging 2020 Best Student Paper Award Finalist.
I have been the primary contributor to several major projects:
- Novel speech encoder architecture: Branchformer (ICML’22), E-Branchformer vs Conformer (INTERSPEECH’23)
- Speech model compression: I3D (ICASSP’23 Top 3%), HJ-Pruning (ICASSP’23 Top 3%), DPHuBERT (INTERSPEECH’23)
- Open speech foundation models: OWSM (ASRU’23), OWSM v3.1 (INTERSPEECH’24), OWSM-CTC (ACL’24)
- Speech language models: SpeechLM analysis, MSLM-S2ST, VoiceTextBlender (NAACL’25), and more to follow
News
Mar 26, 2025 | ![]() |
---|---|
Jan 22, 2025 | ![]() |
Jan 22, 2025 | ![]() |
Dec 09, 2024 | ![]() |
Dec 04, 2024 | ![]() |
Nov 14, 2024 | ![]() |
Aug 30, 2024 | ![]() |
Jun 04, 2024 | ![]() |
May 16, 2024 | ![]() |
May 13, 2024 | ![]() |
Jan 01, 2024 | ![]() |
Dec 13, 2023 | ![]() |
Dec 13, 2023 | ![]() |
Sep 22, 2023 | ![]() |
Jun 04, 2023 | ![]() |
May 22, 2023 | ![]() |
May 17, 2023 | ![]() |
Feb 17, 2023 | ![]() |
Sep 30, 2022 | ![]() |
Jul 17, 2022 | ![]() |
May 31, 2022 | ![]() |
May 15, 2022 | ![]() |
Select Publications
- NAACLFoundation ModelVoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-TuningIn Proceedings of the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL) (accepted), Apr 2025
- ICASSPFoundation ModelVoxtLM: Unified Decoder-Only Models for Consolidating Speech Recognition, Synthesis and Speech, Text Continuation TasksIn Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr 2024
- SLTArchitectureE-Branchformer: Branchformer with Enhanced Merging for Speech RecognitionIn Proceedings of the 2022 IEEE Spoken Language Technology Workshop (SLT), Jan 2023