Publications
Publications in reversed chronological order.
I have published papers in several research areas:
- Speech Foundation Model (SFM)
- Speech Model Architecture
- Efficient Speech Models
- Speech Applications
- Automatic Speech Recognition (ASR)
- Speech Translation (ST)
- Spoken Language Understanding (SLU)
Please check my Google Scholar or Semantic Scholar page for more information.
2025
-
ASRUFoundation ModelOpen Fully-duplex Voice Agent with Speech-to-Speech Language ModelIn Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop: Demo (ASRU Demo) (accepted), Dec 2025
-
ASRUFoundation ModelUnifying Diarization, Separation, and ASR with Multi-Speaker EncoderIn Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (accepted), Dec 2025
-
INTERSPEECHFoundation ModelOpusLM: A Family of Open Unified Speech Language ModelsIn Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), Aug 2025
-
INTERSPEECHFoundation ModelExploring Linear Variant Transformers and k-NN Memory Inference for Long-Form ASRIn Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), Aug 2025
-
ICMLFoundation ModelOWLS: Scaling Laws for Multilingual Speech Recognition and Translation ModelsIn Proceedings of the International Conference on Machine Learning (ICML), Jul 2025
-
ThesisFoundation ModelTowards Effective and Efficient Open Speech Foundation ModelsCarnegie Mellon University, May 2025
-
NAACL DemoFoundation ModelESPnet-SpeechLM: An Open Speech Language Model ToolkitIn Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: System Demonstrations (NAACL Demo), Apr 2025
-
NAACLFoundation ModelVoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-TuningIn Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL), Apr 2025
-
TASLPASRJoint Beam Search Integrating CTC, Attention, and Transducer DecodersIEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), Jan 2025
2024
-
SLTOthersESPnet-EZ: Python-only ESPnet for Easy Fine-tuning and IntegrationIn Proceedings of the IEEE Spoken Language Technology Workshop (SLT), Dec 2024
-
SLTASRRobust Audiovisual Speech Recognition Models with Mixture-of-ExpertsIn Proceedings of the IEEE Spoken Language Technology Workshop (SLT), Dec 2024
-
INTERSPEECHASRContextualized End-to-end Automatic Speech Recognition with Intermediate Biasing LossIn Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), Sep 2024
-
INTERSPEECHFoundation ModelOn the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation ModelsIn Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), Sep 2024
-
INTERSPEECHArchitectureMulti-Convformer: Extending Conformer with Multiple Convolution KernelsIn Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), Sep 2024
-
NAACLSLUUniverSLU: Universal Spoken Language Understanding for Diverse Tasks with Natural Language InstructionsIn Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), Jun 2024
-
ICASSPW ASRJoint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge DistillationIn IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), Apr 2024
-
ICASSPFoundation ModelVoxtLM: Unified Decoder-Only Models for Consolidating Speech Recognition, Synthesis and Speech, Text Continuation TasksIn Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr 2024
-
ICASSPFoundation ModelDynamic-Superb: Towards a Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark For SpeechIn Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr 2024
-
ICASSPASRContextualized Automatic Speech Recognition With Attention-Based Bias Phrase Boosted Beam SearchIn Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr 2024
-
arXiv Foundation Model
2023
-
ASRUFoundation ModelJoint Prediction and Denoising for Large-Scale Multilingual Self-Supervised LearningIn Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Dec 2023
-
ASREnd-to-end integration of online and offline encoders using auxiliary losses for automatic speech recognitionIn 人工知能学会第二種研究会資料, Nov 2023
-
INTERSPEECHFoundation ModelReducing Barriers to Self-Supervised Learning: HuBERT Pre-training with Academic ComputeIn Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), Aug 2023
-
INTERSPEECHSLUTensor decomposition for minimization of E2E SLU model toward on-device processingIn Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), Aug 2023
-
INTERSPEECHASRTime-synchronous one-pass Beam Search for Parallel Online and Offline Transducers with Dynamic Block TrainingIn Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), Aug 2023
-
IWSLT STCMU’s IWSLT 2023 Simultaneous Speech Translation SystemIn Proceedings of the International Workshop on Spoken Language Translation (IWSLT), Jul 2023
-
ICASSPSLUThe Pipeline System of ASR and NLU with MLM-based data Augmentation Toward Stop Low-Resource ChallengeIn Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Jun 2023
-
ICASSPFoundation ModelSpeechLMScore: Evaluating Speech Generation Using Speech Language ModelIn Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Jun 2023
-
ICASSPSLUE-Branchformer-Based E2E SLU Toward Stop on-Device ChallengeIn Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Jun 2023
-
SLTArchitectureE-Branchformer: Branchformer with Enhanced Merging for Speech RecognitionIn Proceedings of the 2022 IEEE Spoken Language Technology Workshop (SLT), Jan 2023
2022
-
INTERSPEECHArchitectureAttention Weight Smoothing Using Prior Distributions for Transformer-Based End-to-End ASRIn Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), Sep 2022
-
IWSLT STCMU’s IWSLT 2022 Dialect Speech Translation SystemIn International Workshop on Spoken Language Translation (IWSLT), May 2022
-
ICASSPSLUESPnet-SLU: Advancing Spoken Language Understanding Through ESPnetIn Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2022
2021
-
TBME OthersAnomaly Detection of Calcifications in Mammography Based on 11,000 Negative CasesIEEE Transactions on Biomedical Engineering, Nov 2021