About me

I am a fourth-year Ph.D. student in the Department of Electrical and Computer Engineering at Carnegie Mellon University. I am fortunate to be supervised by Prof. Shinji Watanabe (Sep 2021 - now) and Prof. Ian Lane (Aug 2020 - Aug 2021; now at UC, Santa Cruz). I received my bachelor’s degree from the Department of Electronic Engineering at Tsinghua University in 2020.

I have also interned at FAIR at Meta (2023) and ASAPP (2022).

My research interests are speech processing, speech recognition, spoken language processing, and foundation models. Recently, I am particularly interested in developing effective and efficient open speech foundation models for various speech tasks. During my Ph.D. study, I have worked on the following major projects (only showing some of the first-authored papers):

Novel speech encoder architecture: Branchformer (ICML’22)
Speech model compression: I3D (ICASSP’23 Top 3%), HJ-Pruning (ICASSP’23 Top 3%), DPHuBERT (INTERSPEECH’23)
Open speech foundation models: OWSM (ASRU’23), OWSM v3.1, OWSM-CTC
Speech language models: SpeechLM analysis, MSLM-S2ST

Updates

Jan 2024: We are hosting a special session at INTERSPEECH 2024 - Spoken Language Models for Universal Speech Processing (Official Site)
Jan 2024: I made a webpage for our Open Whisper-style Speech Models (OWSM)
Dec 2023: I will join NVIDIA in Santa Clara as AI Research Intern in Summer 2024
Dec 2023: 3 papers are accepted at ICASSP 2024
Sep 2023: 2 papers (1 first-authored) are accepted at IEEE ASRU 2023
Jun 2023: 3 papers (2 first-authored) are recognized among the top 3% of all papers accepted at ICASSP 2023

Select publications

An Empirical Study of Speech Language Models for Prompt-Conditioned Speech Synthesis

Yifan Peng, Ilia Kulikov, Yilin Yang, Sravya Popuri, Hui Lu, Changhan Wang, Hongyu Gong
Preprint, 2024
[arxiv]

MSLM-S2ST: A Multitask Speech Language Model for Textless Speech-to-Speech Translation with Speaker Style Preservation

Yifan Peng, Ilia Kulikov, Yilin Yang, Sravya Popuri, Hui Lu, Changhan Wang, Hongyu Gong
Preprint, 2024
[arxiv]

OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification

Yifan Peng, Yui Sudo, Muhammad Shakeel, Shinji Watanabe
Preprint, 2024
[arxiv]

OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer

Yifan Peng, Jinchuan Tian, William Chen, Siddhant Arora, Brian Yan, Yui Sudo, Muhammad Shakeel, Kwanghee Choi, Jiatong Shi, Xuankai Chang, Jee-weon Jung, Shinji Watanabe
Preprint, 2024
[webpage] [demo] [arxiv] [code]

Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data

Yifan Peng, Jinchuan Tian, Brian Yan, Dan Berrebbi, Xuankai Chang, Xinjian Li, Jiatong Shi, Siddhant Arora, William Chen, Roshan Sharma, Wangyou Zhang, Yui Sudo, Muhammad Shakeel, Jee-weon Jung, Soumi Maiti, Shinji Watanabe
Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2023
[demo] [paper] [arxiv] [code]

DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models

Yifan Peng, Yui Sudo, Muhammad Shakeel, Shinji Watanabe
Proceedings of the 24th Annual Conference of the International Speech Communication Association (INTERSPEECH), 2023
[paper] [arxiv] [code]

A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks

Yifan Peng, Kwangyoun Kim, Felix Wu, Brian Yan, Siddhant Arora, William Chen, Jiyang Tang, Suwon Shon, Prashant Sridhar, Shinji Watanabe
Proceedings of the 24th Annual Conference of the International Speech Communication Association (INTERSPEECH), 2023
[paper] [arxiv] [code]

Structured Pruning of Self-Supervised Pre-trained Models for Speech Recognition and Understanding

Yifan Peng, Kwangyoun Kim, Felix Wu, Prashant Sridhar, Shinji Watanabe
Proceedings of the 48th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023
Top 3% of all papers accepted
[paper] [arxiv] [certificate]

I3D: Transformer Architectures with Input-Dependent Dynamic Depth for Speech Recognition

Yifan Peng, Jaesong Lee, Shinji Watanabe
Proceedings of the 48th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023
Top 3% of all papers accepted
[paper] [arxiv] [certificate]

A Study on the Integration of Pre-trained SSL, ASR, LM and SLU Models for Spoken Language Understanding

Yifan Peng*, Siddhant Arora*, Yosuke Higuchi, Yushi Ueda, Sujay Kumar, Karthik Ganesan, Siddharth Dalmia, Xuankai Chang, Shinji Watanabe
Proceedings of the IEEE Spoken Language Technology Workshop (SLT), 2022
[paper] [arxiv]

Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding

Yifan Peng, Siddharth Dalmia, Ian Lane, Shinji Watanabe
Proceedings of the 39th International Conference on Machine Learning (ICML), 2022 (21.9% acceptance rate)
[poster] [video] [slides] [paper] [arxiv] [code]

Microcalcification Localization and Cluster Detection Using Unsupervised Convolutional Autoencoders and Structural Similarity Index

Yifan Peng, Rui Hou, Yinhao Ren, Lars J. Grimm, Jeffrey R. Marks, E. Shelley Hwang, Joseph Y. Lo
Proceedings of the SPIE Medical Imaging 2020: Computer-Aided Diagnosis, 2020
Robert F. Wagner Best Student Paper Award Finalist
[award] [paper]