Publications | Yifan Peng

I have published papers in several research areas:

Speech Foundation Model (SFM)
Speech Model Architecture
Efficient Speech Models
Speech Applications
- Automatic Speech Recognition (ASR)
- Speech Translation (ST)
- Spoken Language Understanding (SLU)

Please check my Google Scholar or Semantic Scholar page for more information.

2025

ASRU
Foundation Model

Open Fully-duplex Voice Agent with Speech-to-Speech Language Model

Zhehuai Chen, Edresson Casanova, Chen Chen, Kevin Hu, Ankita Pasad, Elena Rastorgueva, Seelan Lakshmi Narasimhan, Slyne Deng, Ehsan Hosseini Asl, Piotr Zelasko, Valentin Mendelev, Subhankar Ghosh, Yifan Peng, Jason Li, Jagadeesh Balam, Vitaly Lavrukhin, and Boris Ginsburg

In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop: Demo (ASRU Demo), Dec 2025
ASRU
Foundation Model

Unifying Diarization, Separation, and ASR with Multi-Speaker Encoder

Muhammad Shakeel, Yui Sudo, Yifan Peng, Chyi-Jiunn Lin, and Shinji Watanabe

In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Dec 2025
TMLR
Foundation Model

On The Landscape of Spoken Language Models: A Comprehensive Survey

Siddhant Arora^*, Kai-Wei Chang^*, Chung-Ming Chien^*, Yifan Peng^*, Haibin Wu^*, Yossi Adi, Emmanuel Dupoux, Hung-yi Lee, Karen Livescu, and Shinji Watanabe

Transactions on Machine Learning Research (TMLR), Sep 2025

arXiv PDF
INTERSPEECH
Foundation Model

OWSM v4: Improving Open Whisper-Style Speech Models via Data Scaling and Cleaning

Yifan Peng, Shakeel Muhammad, Yui Sudo, William Chen, Jinchuan Tian, Chyi-Jiunn Lin, and Shinji Watanabe

In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH) (Best Student Paper Award) , Aug 2025

Awarded arXiv PDF Code Website

ISCA Award for Best Student Paper at INTERSPEECH 2025
INTERSPEECH
ASR

DYNAC: Dynamic Vocabulary based Non-Autoregressive Contextualization for Speech Recognition

Yui Sudo, Yosuke Fukumoto, Shakeel Muhammad, Yifan Peng, Chyi-Jiunn Lin, and Shinji Watanabe

In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), Aug 2025

arXiv PDF
INTERSPEECH
Foundation Model

Granary: Speech Recognition and Translation Dataset in 25 European Languages

Nithin Rao Koluguri, Monica Sekoyan, George Zelenfroynd, Sasha Meister, Shuoyang Ding, Sofia Kostandian, He Huang, Nikolay Karpov, Jagadeesh Balam, Vitaly Lavrukhin, Yifan Peng, Sara Papi, Marco Gaido, Alessio Brutti, and Boris Ginsburg

In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), Aug 2025

arXiv PDF
INTERSPEECH
Foundation Model

OpusLM: A Family of Open Unified Speech Language Models

Jinchuan Tian, William Chen, Yifan Peng, Jiatong Shi, Siddhant Arora, Shikhar Bharadwaj, Maekaku Takashi, Yusuke Shinohara, Keita Goto, Xiang Yue, Chao-Han Huck Yang, and Shinji Watanabe

In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), Aug 2025
INTERSPEECH
Foundation Model

Exploring Linear Variant Transformers and k-NN Memory Inference for Long-Form ASR

Carlos Ferreira Carvalho, Jinchuan Tian, William Chen, Yifan Peng, Alberto Abad, and Shinji Watanabe

In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), Aug 2025
INTERSPEECH
Foundation Model

Improving Multilingual Speech Models on ML-SUPERB 2.0: Fine-tuning with Data Augmentation and LID-Aware CTC

Qingzheng Wang, Jiancheng Sun, Yifan Peng, and Shinji Watanabe

In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), Aug 2025

arXiv PDF
ICML
Foundation Model

OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models

William Chen, Jinchuan Tian, Yifan Peng, Brian Yan, Chao-Han Huck Yang, and Shinji Watanabe

In Proceedings of the International Conference on Machine Learning (ICML), Jul 2025

arXiv
Thesis
Foundation Model

Towards Effective and Efficient Open Speech Foundation Models

Yifan Peng

Carnegie Mellon University, May 2025

Website
NAACL Demo
Foundation Model

ESPnet-SDS: Unified Toolkit and Demo for Spoken Dialogue Systems

Siddhant Arora, Yifan Peng, Jiatong Shi, Jinchuan Tian, William Chen, Shikhar Bharadwaj, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, Shuichiro Shimizu, Vaibhav Srivastav, and Shinji Watanabe

In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: System Demonstrations (NAACL Demo), Apr 2025

arXiv PDF Website
NAACL Demo
Foundation Model

ESPnet-SpeechLM: An Open Speech Language Model Toolkit

Jinchuan Tian, Jiatong Shi, William Chen, Siddhant Arora, Yoshiki Masuyama, Takashi Maekaku, Yihan Wu, Junyi Peng, Shikhar Bharadwaj, Yiwen Zhao, Samuele Cornell, Yifan Peng, Xiang Yue, Chao-Han Huck Yang, Graham Neubig, and Shinji Watanabe

In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: System Demonstrations (NAACL Demo), Apr 2025

arXiv
NAACL
Foundation Model

VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning

Yifan Peng^*, Krishna C. Puvvada^*, Zhehuai Chen^*, Piotr Zelasko, He Huang, Kunal Dhawan, Ke Hu, Shinji Watanabe, Jagadeesh Balam, and Boris Ginsburg

In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL), Apr 2025

arXiv PDF Code Poster Website
ICLR
Foundation Model

Context-aware Dynamic Pruning for Speech Foundation Models

Masao Someki, Yifan Peng, Siddhant Arora, Markus Müller, Athanasios Mouchtaris, Grant Strimel, Jing Liu, and Shinji Watanabe

In Proceedings of the Thirteenth International Conference on Learning Representations (ICLR), Apr 2025

PDF Website
AAAI
ASR

Enhancing Audiovisual Speech Recognition through Bifocal Preference Optimization

Yihan Wu, Yichen Lu, Yifan Peng, Xihua Wang, Ruihua Song, and Shinji Watanabe

In Proceedings of the 39th Annual AAAI Conference on Artificial Intelligence (AAAI), Mar 2025

PDF Website
TASLP
ASR

Joint Beam Search Integrating CTC, Attention, and Transducer Decoders

Yui Sudo, Muhammad Shakeel, Yosuke Fukumoto, Brian Yan, Jiatong Shi, Yifan Peng, and Shinji Watanabe

IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), Jan 2025

PDF

2024

SLT
ASR

Contextualized Automatic Speech Recognition with Dynamic Vocabulary

Yui Sudo, Yosuke Fukumoto, Muhammad Shakeel, Yifan Peng, and Shinji Watanabe

In Proceedings of the IEEE Spoken Language Technology Workshop (SLT) (Best Paper Award) , Dec 2024

Awarded PDF

Best Paper Award
SLT
Others

ESPnet-EZ: Python-only ESPnet for Easy Fine-tuning and Integration

Masao Someki, Kwanghee Choi, Siddhant Arora, William Chen, Samuele Cornell, Jionghao Han, Yifan Peng, Jiatong Shi, Vaibhav Srivastav, and Shinji Watanabe

In Proceedings of the IEEE Spoken Language Technology Workshop (SLT), Dec 2024

PDF
SLT
ASR

Robust Audiovisual Speech Recognition Models with Mixture-of-Experts

Yihan Wu, Yifan Peng, Yichen Lu, Xuankai Chang, Ruihua Song, and Shinji Watanabe

In Proceedings of the IEEE Spoken Language Technology Workshop (SLT), Dec 2024

PDF
EMNLP
Foundation Model

Towards Robust Speech Representation Learning for Thousands of Languages

William Chen, Wangyou Zhang, Yifan Peng, Xinjian Li, Jinchuan Tian, Jiatong Shi, Xuankai Chang, Soumi Maiti, Karen Livescu, and Shinji Watanabe

In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (Best Paper Award) , Nov 2024

Awarded PDF

Best Paper Award
INTERSPEECH
Foundation Model

OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer

Yifan Peng, Jinchuan Tian, William Chen, Siddhant Arora, Brian Yan, Yui Sudo, Muhammad Shakeel, Kwanghee Choi, Jiatong Shi, Xuankai Chang, Jee-weon Jung, and Shinji Watanabe

In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), Sep 2024

arXiv PDF Code Poster Website
INTERSPEECH
ASR

Contextualized End-to-end Automatic Speech Recognition with Intermediate Biasing Loss

Muhammad Shakeel, Yui Sudo, Yifan Peng, and Shinji Watanabe

In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), Sep 2024

PDF
INTERSPEECH
Foundation Model

On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models

Jinchuan Tian, Yifan Peng, William Chen, Kwanghee Choi, Karen Livescu, and Shinji Watanabe

In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), Sep 2024

PDF
INTERSPEECH
Architecture

Multi-Convformer: Extending Conformer with Multiple Convolution Kernels

Darshan Prabhu, Yifan Peng, Preethi Jyothi, and Shinji Watanabe

In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), Sep 2024

PDF
ACL
Foundation Model

OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification

Yifan Peng, Yui Sudo, Muhammad Shakeel, and Shinji Watanabe

In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Aug 2024

Abs arXiv PDF Code Poster Website

There has been an increasing interest in large speech models that can perform multiple tasks in a single model. Such models usually adopt an encoder-decoder or decoder-only architecture due to their popularity and good performance in many domains. However, autoregressive models can be slower during inference compared to non-autoregressive models and also have potential risks of hallucination. Though prior studies observed promising results of non-autoregressive models for certain tasks at small scales, it remains unclear if they can be scaled to speech-to-text generation in diverse languages and tasks. Inspired by the Open Whisper-style Speech Model (OWSM) project, we propose OWSM-CTC, a novel encoder-only speech foundation model based on Connectionist Temporal Classification (CTC). It is trained on 180k hours of public audio data for multilingual automatic speech recognition (ASR), speech translation (ST), and language identification (LID). Compared to encoder-decoder OWSM, our OWSM-CTC achieves competitive results on ASR and up to 24% relative improvement on ST, while it is more robust and 3 to 4 times faster for inference. OWSM-CTC also improves the long-form ASR result with 20x speed-up.We will publicly release our code, pre-trained model, and training logs to promote open science in speech foundation models.
NAACL
SLU

UniverSLU: Universal Spoken Language Understanding for Diverse Tasks with Natural Language Instructions

Siddhant Arora, Hayato Futami, Jee-weon Jung, Yifan Peng, Roshan S. Sharma, Yosuke Kashiwagi, Emiru Tsunoo, Karen Livescu, and Shinji Watanabe

In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), Jun 2024

PDF
ICASSPW ASR

Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation

Muhammad Shakeel, Yui Sudo, Yifan Peng, and Shinji Watanabe

In IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), Apr 2024

PDF
ICASSP
Foundation Model

VoxtLM: Unified Decoder-Only Models for Consolidating Speech Recognition, Synthesis and Speech, Text Continuation Tasks

Soumi Maiti, Yifan Peng, Shukjae Choi, Jee-weon Jung, Xuankai Chang, and Shinji Watanabe

In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr 2024

PDF
ICASSP
Foundation Model

Dynamic-Superb: Towards a Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark For Speech

Chien-yu Huang, Ke-Han Lu, Shi Wang, Chi-Yuan Hsiao, Chun-Yi Kuan, Haibin Wu, Siddhant Arora, Kai-Wei Chang, Jiatong Shi, Yifan Peng, Roshan Sharma, Shinji Watanabe, Bhiksha Ramakrishnan, Shady Shehata, and Hung-yi Lee

In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr 2024

PDF
ICASSP
ASR

Contextualized Automatic Speech Recognition With Attention-Based Bias Phrase Boosted Beam Search

Yui Sudo, Muhammad Shakeel, Yosuke Fukumoto, Yifan Peng, and Shinji Watanabe

In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr 2024

PDF
arXiv Foundation Model

MSLM-S2ST: A Multitask Speech Language Model for Textless Speech-to-Speech Translation with Speaker Style Preservation

Yifan Peng, Ilia Kulikov, Yilin Yang, Sravya Popuri, Hui Lu, Changhan Wang, and Hongyu Gong

arXiv, Mar 2024

arXiv PDF
arXiv Foundation Model

An Empirical Study of Speech Language Models for Prompt-Conditioned Speech Synthesis

Yifan Peng, Ilia Kulikov, Yilin Yang, Sravya Popuri, Hui Lu, Changhan Wang, and Hongyu Gong

arXiv, Mar 2024

arXiv PDF
arXiv Foundation Model

SpeechComposer: Unifying Multiple Speech Tasks with Prompt Composition

Yihan Wu, Soumi Maiti, Yifan Peng, Wangyou Zhang, Chenda Li, Yuyue Wang, Xihua Wang, Shinji Watanabe, and Ruihua Song

ArXiv, Jan 2024

PDF

2023

ASRU
Foundation Model

Reproducing Whisper-Style Training Using An Open-Source Toolkit And Publicly Available Data

Yifan Peng, Jinchuan Tian, Brian Yan, Dan Berrebbi, Xuankai Chang, Xinjian Li, Jiatong Shi, Siddhant Arora, William Chen, Roshan Sharma, Wangyou Zhang, Yui Sudo, Muhammad Shakeel, Jee-weon Jung, Soumi Maiti, and Shinji Watanabe

In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Dec 2023

arXiv PDF Website
ASRU
Foundation Model

Joint Prediction and Denoising for Large-Scale Multilingual Self-Supervised Learning

William Chen, Jiatong Shi, Brian Yan, Dan Berrebbi, Wangyou Zhang, Yifan Peng, Xuankai Chang, Soumi Maiti, and Shinji Watanabe

In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Dec 2023

PDF
ASR

End-to-end integration of online and offline encoders using auxiliary losses for automatic speech recognition

Muhammad Shakeel, Yui Sudo, Yifan Peng, and Shinji Watanabe

In 人工知能学会第二種研究会資料, Nov 2023

PDF
INTERSPEECH
Efficient Model

DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models

Yifan Peng, Yui Sudo, Muhammad Shakeel, and Shinji Watanabe

In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), Aug 2023

arXiv PDF Code
INTERSPEECH
Architecture

A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks

Yifan Peng, Kwangyoun Kim, Felix Wu, Brian Yan, Siddhant Arora, William Chen, Jiyang Tang, Suwon Shon, Prashant Sridhar, and Shinji Watanabe

In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), Aug 2023

arXiv PDF Code
INTERSPEECH
Foundation Model

Reducing Barriers to Self-Supervised Learning: HuBERT Pre-training with Academic Compute

William Chen, Xuankai Chang, Yifan Peng, Zhaoheng Ni, Soumi Maiti, and Shinji Watanabe

In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), Aug 2023

PDF
INTERSPEECH
SLU

Tensor decomposition for minimization of E2E SLU model toward on-device processing

Yosuke Kashiwagi, Siddhant Arora, Hayato Futami, Jessica Huynh, Shih-Lun Wu, Yifan Peng, Brian Yan, Emiru Tsunoo, and Shinji Watanabe

In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), Aug 2023

PDF
INTERSPEECH
ASR

Time-synchronous one-pass Beam Search for Parallel Online and Offline Transducers with Dynamic Block Training

Yui Sudo, Muhammad Shakeel, Yifan Peng, and Shinji Watanabe

In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), Aug 2023

PDF
ACL Demo
ST

ESPnet-ST-v2: Multipurpose Spoken Language Translation Toolkit

Brian Yan, Jiatong Shi, Yun Tang, Hirofumi Inaguma, Yifan Peng, Siddharth Dalmia, Peter Polák, Patrick Fernandes, Dan Berrebbi, Tomoki Hayashi, Xiaohui Zhang, Zhaoheng Ni, Moto Hira, Soumi Maiti, Juan Pino, and Shinji Watanabe

In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), System Demonstrations, Jul 2023

Abs PDF

ESPnet-ST-v2 is a revamp of the open-source ESPnet-ST toolkit necessitated by the broadening interests of the spoken language translation community. ESPnet-ST-v2 supports 1) offline speech-to-text translation (ST), 2) simultaneous speech-to-text translation (SST), and 3) offline speech-to-speech translation (S2ST) – each task is supported with a wide variety of approaches, differentiating ESPnet-ST-v2 from other open source spoken language translation toolkits. This toolkit offers state-of-the-art architectures such as transducers, hybrid CTC/attention, multi-decoders with searchable intermediates, time-synchronous blockwise CTC/attention, Translatotron models, and direct discrete unit models. In this paper, we describe the overall design, example models for each task, and performance benchmarking behind ESPnet-ST-v2, which is publicly available at \urlhttps://github.com/espnet/espnet.
IWSLT ST

CMU’s IWSLT 2023 Simultaneous Speech Translation System

Brian Yan, Jiatong Shi, Soumi Maiti, William Chen, Xinjian Li, Yifan Peng, Siddhant Arora, and Shinji Watanabe

In Proceedings of the International Workshop on Spoken Language Translation (IWSLT), Jul 2023

PDF
ICASSP
Efficient Model

I3D: Transformer Architectures with Input-Dependent Dynamic Depth for Speech Recognition

Yifan Peng, Jaesong Lee, and Shinji Watanabe

In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (Top 3% of all papers accepted) , Jun 2023

Awarded arXiv PDF

Recognized as one of the top 3% of all papers accepted at the International Conference on Acoustics Speech and Signal Processing (ICASSP) 2023
ICASSP
Efficient Model

Structured Pruning of Self-Supervised Pre-Trained Models for Speech Recognition and Understanding

Yifan Peng, Kwangyoun Kim, Felix Wu, Prashant Sridhar, and Shinji Watanabe

In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (Top 3% of all papers accepted) , Jun 2023

Awarded arXiv PDF

Recognized as one of the top 3% of all papers accepted at the International Conference on Acoustics Speech and Signal Processing (ICASSP) 2023
ICASSP
SLU

A Study on the Integration of Pipeline and E2E SLU Systems for Spoken Semantic Parsing Toward Stop Quality Challenge

Siddhant Arora, Hayato Futami, Shih-Lun Wu, Jessica Huynh, Yifan Peng, Yosuke Kashiwagi, Emiru Tsunoo, Brian Yan, and Shinji Watanabe

In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Jun 2023

arXiv PDF
ICASSP
ASR

Improving Massively Multilingual ASR with Auxiliary CTC Objectives

William Chen, Brian Yan, Jiatong Shi, Yifan Peng, Soumi Maiti, and Shinji Watanabe

In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (Top 3% of all papers accepted) , Jun 2023

Awarded PDF

Recognized as one of the top 3% of all papers accepted at the International Conference on Acoustics Speech and Signal Processing (ICASSP) 2023
ICASSP
SLU

The Pipeline System of ASR and NLU with MLM-based data Augmentation Toward Stop Low-Resource Challenge

Hayato Futami, Jessica Huynh, Siddhant Arora, Shih-Lun Wu, Yosuke Kashiwagi, Yifan Peng, Brian Yan, Emiru Tsunoo, and Shinji Watanabe

In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Jun 2023

PDF
ICASSP
Foundation Model

SpeechLMScore: Evaluating Speech Generation Using Speech Language Model

Soumi Maiti, Yifan Peng, Takaaki Saeki, and Shinji Watanabe

In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Jun 2023

PDF
ICASSP
SLU

E-Branchformer-Based E2E SLU Toward Stop on-Device Challenge

Yosuke Kashiwagi, Siddhant Arora, Hayato Futami, Jessica Huynh, Shih-Lun Wu, Yifan Peng, Brian Yan, Emiru Tsunoo, and Shinji Watanabe

In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Jun 2023
SLT
SLU

A Study on the Integration of Pre-Trained SSL, ASR, LM and SLU Models for Spoken Language Understanding

Yifan Peng^*, Siddhant Arora^*, Yosuke Higuchi, Yushi Ueda, Sujay S. Kumar, Karthik Ganesan, Siddharth Dalmia, Xuankai Chang, and Shinji Watanabe

In Proceedings of the 2022 IEEE Spoken Language Technology Workshop (SLT), Jan 2023

arXiv PDF
SLT
Architecture

E-Branchformer: Branchformer with Enhanced Merging for Speech Recognition

Kwangyoun Kim, Felix Wu, Yifan Peng, Jing Pan, Prashant Sridhar, Kyu J. Han, and Shinji Watanabe

In Proceedings of the 2022 IEEE Spoken Language Technology Workshop (SLT), Jan 2023

PDF

2022

INTERSPEECH
Architecture

Attention Weight Smoothing Using Prior Distributions for Transformer-Based End-to-End ASR

Takashi Maekaku, Yuya Fujita, Yifan Peng, and Shinji Watanabe

In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), Sep 2022

PDF
ICML
Architecture

Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding

Yifan Peng, Siddharth Dalmia, Ian Lane, and Shinji Watanabe

In Proceedings of the International Conference on Machine Learning (ICML), Jul 2022

Abs PDF Video Code Poster Slides

Conformer has proven to be effective in many speech processing tasks. It combines the benefits of extracting local dependencies using convolutions and global dependencies using self-attention. Inspired by this, we propose a more flexible, interpretable and customizable encoder alternative, Branchformer, with parallel branches for modeling various ranged dependencies in end-to-end speech processing. In each encoder layer, one branch employs self-attention or its variant to capture long-range dependencies, while the other branch utilizes an MLP module with convolutional gating (cgMLP) to extract local relationships. We conduct experiments on several speech recognition and spoken language understanding benchmarks. Results show that our model outperforms both Transformer and cgMLP. It also matches with or outperforms state-of-the-art results achieved by Conformer. Furthermore, we show various strategies to reduce computation thanks to the two-branch architecture, including the ability to have variable inference complexity in a single trained model. The weights learned for merging branches indicate how local and global dependencies are utilized in different layers, which benefits model designing.
IWSLT ST

CMU’s IWSLT 2022 Dialect Speech Translation System

Brian Yan, Patrick Fernandes, Siddharth Dalmia, Jiatong Shi, Yifan Peng, Dan Berrebbi, Xinyi Wang, Graham Neubig, and Shinji Watanabe

In International Workshop on Spoken Language Translation (IWSLT), May 2022

PDF
ICASSP
SLU

ESPnet-SLU: Advancing Spoken Language Understanding Through ESPnet

Siddhant Arora, Siddharth Dalmia, Pavel Denisov, Xuankai Chang, Yushi Ueda, Yifan Peng, Yuekai Zhang, Sujay S. Kumar, Karthik Ganesan, Brian Yan, Ngoc Thang Vu, Alan W. Black, and Shinji Watanabe

In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2022

PDF

2021

TBME Others

Anomaly Detection of Calcifications in Mammography Based on 11,000 Negative Cases

Rui Hou, Yifan Peng, Lars J. Grimm, Yinhao Ren, Maciej A. Mazurowski, Jeffrey R. Marks, Lorraine M. King, Carlo C. Maley, Eun-Sil Shelley Hwang, and Joseph Y. Lo

IEEE Transactions on Biomedical Engineering, Nov 2021

Website

2020

SPIE Others

Microcalcification localization and cluster detection using unsupervised convolutional autoencoders and structural similarity index

Yifan Peng, Rui Hou, Yinhao Ren, Lars J. Grimm, Jeffrey R. Marks, E. Shelley Hwang, and Joseph Y. Lo

In Proceedings of the SPIE Medical Imaging 2020: Computer-Aided Diagnosis (Robert F. Wagner Best Student Paper Award Finalist) , May 2020

Awarded HTML

Robert F. Wagner Best Student Paper Award Finalist at SPIE Medical Imaging 2020