Yifan Peng
PhD Candidate, Carnegie Mellon University

I am a final-year Ph.D. student in the Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA. I am fortunate to be supervised by Prof. Shinji Watanabe (Sep 2021 - now) and Prof. Ian Lane (Aug 2020 - Aug 2021; now at UC, Santa Cruz). I received my bachelor’s degree from the Department of Electronic Engineering, Tsinghua University, Beijing, China in 2020.
In Summer 2024, I was an AI Research Intern at NVIDIA NeMo, where I worked on joint speech-text language models. In Summer 2023, I was a research scientist intern at Meta AI FAIR and worked on speech language models for voice-preserved textless speech-to-speech translation. In Summer 2022, I worked as a speech recognition intern at ASAPP about speech model compression.
My research area is speech and language processing. My Ph.D. thesis is to develop effective and efficient open speech foundation models. I have led the project of Open Whisper-style Speech Models (OWSM) at CMU WAVLab, developing the first large-scale, fully open speech foundation model from academia. Now I am interested in integrating speech capabilities into large language models. Most of my works are open sourced in a widely-used speech processing toolkit, ESPnet.
I published first-authored papers at top-tier AI/speech conferences, such as ICML, ACL, ICASSP, and INTERSPEECH. Several projects that I’m involved in received notable recognition, including the Best Paper Award at SLT 2024, Best Paper Award at EMNLP 2024, Top 3% Paper Recognition at ICASSP 2023 (3 papers), and Best Student Paper Award Finalist at SPIE Medical Imaging 2020. Specifically, I have been the primary contributor to several major projects:
- Novel speech encoder architecture: Branchformer (ICML’22), E-Branchformer vs Conformer (INTERSPEECH’23)
- Speech model compression: I3D (ICASSP’23 Top 3%), HJ-Pruning (ICASSP’23 Top 3%), DPHuBERT (INTERSPEECH’23)
- Open speech foundation models: OWSM (ASRU’23), OWSM v3.1 (INTERSPEECH’24), OWSM-CTC (ACL’24)
- Speech language models: SpeechLM analysis, MSLM-S2ST, VoiceTextBlender (NAACL’25), and more to follow
News
Jan 22, 2025 | ![]() |
---|---|
Jan 22, 2025 | ![]() |
Dec 09, 2024 | ![]() |
Dec 04, 2024 | ![]() |
Nov 14, 2024 | ![]() |
Aug 30, 2024 | ![]() |
Jun 04, 2024 | ![]() |
May 16, 2024 | ![]() |
May 13, 2024 | ![]() |
Jan 01, 2024 | ![]() |
Dec 13, 2023 | ![]() |
Dec 13, 2023 | ![]() |
Sep 22, 2023 | ![]() |
Jun 04, 2023 | ![]() |
May 22, 2023 | ![]() |
May 17, 2023 | ![]() |
Feb 17, 2023 | ![]() |
Sep 30, 2022 | ![]() |
Jul 17, 2022 | ![]() |
May 31, 2022 | ![]() |
May 15, 2022 | ![]() |
Select Publications
- NAACLFoundation ModelVoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-TuningIn Proceedings of the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL) (accepted), Apr 2025
- ICASSPFoundation ModelVoxtLM: Unified Decoder-Only Models for Consolidating Speech Recognition, Synthesis and Speech, Text Continuation TasksIn Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr 2024
- SLTArchitectureE-Branchformer: Branchformer with Enhanced Merging for Speech RecognitionIn Proceedings of the 2022 IEEE Spoken Language Technology Workshop (SLT), Jan 2023