Yifan Peng

PhD Candidate, Carnegie Mellon University

prof_pic.jpg

Now seeking full-time positions in speech and language processing (expected to start in Summer 2025)

I am a final-year Ph.D. student in the Department of Electrical and Computer Engineering at Carnegie Mellon University. I am fortunate to be supervised by Prof. Shinji Watanabe (Sep 2021 - now) and Prof. Ian Lane (Aug 2020 - Aug 2021; now at UC, Santa Cruz). I received my bachelor’s degree from the Department of Electronic Engineering at Tsinghua University in 2020.

In Summer 2024, I was an AI Research Intern at NVIDIA NeMo, where I worked on joint speech-text language models. In Summer 2023, I was a research scientist intern at Meta AI FAIR and worked on speech language models for voice-preserved textless speech-to-speech translation. In Summer 2022, I worked as a speech recognition intern at ASAPP about speech model compression.

My research area is speech and language processing. My Ph.D. thesis is to develop effective and efficient open speech foundation models. I have led the project of Open Whisper-style Speech Models (OWSM) at CMU WAVLab, developing the first large-scale, fully open speech foundation model from academia. Recently, I am also interested in integrating speech capabilities into large language models.

Throughout my Ph.D. program, I have published first-authored papers in top-tier ML/NLP/Speech conferences, including ICML, ACL, ICASSP, INTERSPEECH, ASRU, and SLT. I am also a contributor to a widely used speech processing toolkit, ESPnet. Specifically, I have been the primary contributor to several major projects:


News

Nov 14, 2024 :trophy: A co-authored paper received Best Paper Award at EMNLP 2024
Aug 30, 2024 :scroll: 3 papers are accepted at IEEE SLT 2024
Jun 04, 2024 :scroll: 4 papers (1 first-authored) are accepted at INTERSPEECH 2024
May 16, 2024 :scroll: 1 first-authored paper, OWSM-CTC, is accpeted at ACL 2024 (main)
May 13, 2024 :man_office_worker: Joining NVIDIA NeMo Speech in Santa Clara as AI Research Intern
Jan 01, 2024 :sparkles: We are hosting a special session at INTERSPEECH 2024 - Spoken Language Models for Universal Speech Processing (Official Site)
Dec 13, 2023 :spider_web: Check out the webpage for our Open Whisper-style Speech Models (OWSM)
Dec 13, 2023 :scroll: 3 papers are accepted at ICASSP 2024
Sep 22, 2023 :scroll: 2 papers (1 first-authored) are accepted at IEEE ASRU 2023
Jun 04, 2023 :trophy: 3 papers (2 first-authored) are recognized among the top 3% of all papers accepted at ICASSP 2023
May 22, 2023 :man_office_worker: Joining Meta AI (FAIR) in Seattle as Research Scientist Intern
May 17, 2023 :scroll: 5 papers (2 first-authored) are accepted at INTERSPEECH 2023
Feb 17, 2023 :scroll: 4 research papers (2 first-authored) and 3 co-authored challenge papers are accepted at ICASSP 2023
Sep 30, 2022 :scroll: 2 papers (1 first-authored) are accepted at IEEE SLT 2022
Jul 17, 2022 :flight_departure: Attending ICML 2022 in Baltimore, Maryland, USA
May 31, 2022 :man_office_worker: Joining ASAPP remotely as Speech Recognition Intern
May 15, 2022 :scroll: 1 first-authored paper is accepted at ICML 2022


Select Publications

  1. EMNLP
    Foundation Model
    Towards Robust Speech Representation Learning for Thousands of Languages
    William Chen, Wangyou Zhang, Yifan Peng, Xinjian Li, Jinchuan Tian, Jiatong Shi, Xuankai Chang, Soumi Maiti, Karen Livescu, and Shinji Watanabe
    In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (Best Paper Award) , Nov 2024
  2. arXiv Foundation Model
    VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning
    Yifan Peng*, Krishna C. Puvvada*, Zhehuai Chen*, Piotr Zelasko, He Huang, Kunal Dhawan, Ke Hu, Shinji Watanabe, Jagadeesh Balam, and Boris Ginsburg
    ArXiv, Oct 2024
  3. INTERSPEECH
    Foundation Model
    OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer
    Yifan Peng, Jinchuan Tian, William Chen, Siddhant Arora, Brian Yan, Yui Sudo, Muhammad Shakeel, Kwanghee Choi, Jiatong Shi, Xuankai Chang, Jee-weon Jung, and Shinji Watanabe
    In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), Sep 2024
  4. ACL
    Foundation Model
    OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification
    Yifan Peng, Yui Sudo, Muhammad Shakeel, and Shinji Watanabe
    In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Aug 2024
  5. ICASSP
    Foundation Model
    VoxtLM: Unified Decoder-Only Models for Consolidating Speech Recognition, Synthesis and Speech, Text Continuation Tasks
    Soumi Maiti, Yifan Peng, Shukjae Choi, Jee-weon Jung, Xuankai Chang, and Shinji Watanabe
    In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr 2024
  6. arXiv Foundation Model
    MSLM-S2ST: A Multitask Speech Language Model for Textless Speech-to-Speech Translation with Speaker Style Preservation
    Yifan Peng, Ilia Kulikov, Yilin Yang, Sravya Popuri, Hui Lu, Changhan Wang, and Hongyu Gong
    ArXiv, Mar 2024
  7. arXiv Foundation Model
    An Empirical Study of Speech Language Models for Prompt-Conditioned Speech Synthesis
    Yifan Peng, Ilia Kulikov, Yilin Yang, Sravya Popuri, Hui Lu, Changhan Wang, and Hongyu Gong
    ArXiv, Mar 2024
  8. ASRU
    Foundation Model
    Reproducing Whisper-Style Training Using An Open-Source Toolkit And Publicly Available Data
    Yifan Peng, Jinchuan Tian, Brian Yan, Dan Berrebbi, Xuankai Chang, Xinjian Li, Jiatong Shi, Siddhant Arora, William Chen, Roshan Sharma, Wangyou Zhang, Yui Sudo, Muhammad Shakeel, Jee-weon Jung, Soumi Maiti, and Shinji Watanabe
    In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Dec 2023
  9. INTERSPEECH
    Efficient Model
    DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models
    Yifan Peng, Yui Sudo, Muhammad Shakeel, and Shinji Watanabe
    In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), Aug 2023
  10. INTERSPEECH
    Architecture
    A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks
    Yifan Peng, Kwangyoun Kim, Felix Wu, Brian Yan, Siddhant Arora, William Chen, Jiyang Tang, Suwon Shon, Prashant Sridhar, and Shinji Watanabe
    In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), Aug 2023
  11. ICASSP
    Efficient Model
    I3D: Transformer Architectures with Input-Dependent Dynamic Depth for Speech Recognition
    Yifan Peng, Jaesong Lee, and Shinji Watanabe
    In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (Top 3% of all papers accepted) , Jun 2023
  12. ICASSP
    Efficient Model
    Structured Pruning of Self-Supervised Pre-Trained Models for Speech Recognition and Understanding
    Yifan Peng, Kwangyoun Kim, Felix Wu, Prashant Sridhar, and Shinji Watanabe
    In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (Top 3% of all papers accepted) , Jun 2023
  13. SLT
    SLU
    A Study on the Integration of Pre-Trained SSL, ASR, LM and SLU Models for Spoken Language Understanding
    Yifan Peng*, Siddhant Arora*, Yosuke Higuchi, Yushi Ueda, Sujay S. Kumar, Karthik Ganesan, Siddharth Dalmia, Xuankai Chang, and Shinji Watanabe
    In Proceedings of the 2022 IEEE Spoken Language Technology Workshop (SLT), Jan 2023
  14. SLT
    Architecture
    E-Branchformer: Branchformer with Enhanced Merging for Speech Recognition
    Kwangyoun Kim, Felix Wu, Yifan Peng, Jing Pan, Prashant Sridhar, Kyu J. Han, and Shinji Watanabe
    In Proceedings of the 2022 IEEE Spoken Language Technology Workshop (SLT), Jan 2023
  15. ICML
    Architecture
    Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding
    Yifan Peng, Siddharth Dalmia, Ian Lane, and Shinji Watanabe
    In Proceedings of the International Conference on Machine Learning (ICML), Jul 2022
  16. SPIE Others
    Microcalcification localization and cluster detection using unsupervised convolutional autoencoders and structural similarity index
    Yifan Peng, Rui Hou, Yinhao Ren, Lars J. Grimm, Jeffrey R. Marks, E. Shelley Hwang, and Joseph Y. Lo
    In Proceedings of the SPIE Medical Imaging 2020: Computer-Aided Diagnosis (Robert F. Wagner Best Student Paper Award Finalist) , May 2020