Yifan Peng
Ph.D., Carnegie Mellon University

I received my Ph.D. in Electrical and Computer Engineering from Carnegie Mellon University, Pittsburgh, PA, USA in 2025, where I was fortunate to be supervised by Prof. Shinji Watanabe (Sep 2021 - May 2025) and Prof. Ian Lane (Aug 2020 - Aug 2021; now at UC, Santa Cruz). Prior to that, I received my bachelor’s degree from the Department of Electronic Engineering, Tsinghua University, Beijing, China in 2020.
In Summer 2024, I was an AI Research Intern at NVIDIA NeMo, where I worked on joint speech-text language models. In Summer 2023, I was a research scientist intern at Meta AI FAIR and worked on speech language models for voice-preserved textless speech-to-speech translation. In Summer 2022, I worked as a speech recognition intern at ASAPP about speech model compression.
My research area is speech and language processing. My Ph.D. thesis is “Towards Effective and Efficient Open Speech Foundation Models”. Most of my works have been open sourced in a widely-used speech processing toolkit, ESPnet. During my Ph.D. at CMU WAVLab, I led the Open Whisper-style Speech Models (OWSM) project, developing the first large-scale, fully open speech foundation model from academia. Now I am interested in spoken language models.
I published first-authored papers at top-tier AI/speech conferences, such as ICML, ACL, NAACL, ICASSP, and INTERSPEECH. Several projects that I was involved in received notable recognition, including EMNLP 2024 Best Paper Award, IEEE SLT 2024 Best Paper Award, ICASSP 2023 Top 3% Paper Recognition (3 papers), and SPIE Medical Imaging 2020 Best Student Paper Award Finalist.
I have been the primary contributor to several projects:
- Speech encoder architecture design: Branchformer (ICML’22), E-Branchformer vs Conformer (INTERSPEECH’23)
- Speech model compression: I3D (ICASSP’23 Top 3%), HJ-Pruning (ICASSP’23 Top 3%), DPHuBERT (INTERSPEECH’23)
- Open speech foundation models: OWSM (ASRU’23), OWSM v3.1 (INTERSPEECH’24), OWSM-CTC (ACL’24), OWSM v4 (INTERSPEECH’25)
- Speech language models: SpeechLM analysis, MSLM-S2ST, VoiceTextBlender (NAACL’25), SLM Survey
News
May 19, 2025 | ![]() |
---|---|
May 01, 2025 | ![]() |
Apr 29, 2025 | ![]() |
Mar 26, 2025 | I defended my PhD thesis at CMU |
Jan 22, 2025 | ![]() |
Jan 22, 2025 | ![]() |
Dec 09, 2024 | ![]() |
Dec 04, 2024 | ![]() |
Nov 14, 2024 | ![]() |
Aug 30, 2024 | ![]() |
Jun 04, 2024 | ![]() |
May 16, 2024 | ![]() |
May 13, 2024 | ![]() |
Jan 01, 2024 | ![]() |
Dec 13, 2023 | ![]() |
Dec 13, 2023 | ![]() |
Sep 22, 2023 | ![]() |
Jun 04, 2023 | ![]() |
May 22, 2023 | ![]() |
May 17, 2023 | ![]() |
Feb 17, 2023 | ![]() |
Sep 30, 2022 | ![]() |
Jul 17, 2022 | ![]() |
May 31, 2022 | ![]() |
May 15, 2022 | ![]() |
Select Publications
- ThesisFoundation ModelTowards Effective and Efficient Open Speech Foundation ModelsCarnegie Mellon University, May 2025
- NAACLFoundation ModelVoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-TuningIn Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL), Apr 2025
- ICASSPFoundation ModelVoxtLM: Unified Decoder-Only Models for Consolidating Speech Recognition, Synthesis and Speech, Text Continuation TasksIn Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr 2024
- SLTArchitectureE-Branchformer: Branchformer with Enhanced Merging for Speech RecognitionIn Proceedings of the 2022 IEEE Spoken Language Technology Workshop (SLT), Jan 2023