Abstract:The existing self-supervised speech representation learning methods based on reconstruction are trained by restoring and rebuilding speech frames. However, the phoneme category information contained in the speech frame is underutilized. Combining self-supervised learning and noisy student training, a clustering and retraining based self-supervised speech representation learning method is proposed. Firstly, based on an initial self-supervised speech representation model (the teacher model),the pseudo-label reflecting the phoneme class information is obtained via unsupervised clustering. Secondly, the pseudo-label prediction task and the original masked frame reconstruction task are combined to retrain the speech representation model(the student model). Finally, the new student model is taken as the new teacher model to optimize pseudo-labels and representation models continually by iterating the whole clustering and retraining processes. Experimental results show that the speech representation model after clustering and retraining achieves better performance in downstream phoneme recognition and speaker recognition tasks.
[1] 陈虹洁. 面向低资源场景的语音表示学习及其应用.博士学位论文.西安:西北工业大学, 2018. (CHEN H J. Low-Resource Speech Representation Learning and Its Applications. Ph.D.Dissertation. Xi'an, China: Northwestern Po-lytechnical University, 2018.) [2] VAN DEN OORD A, LI Y Z, VINYALS O. Representation Lear-ning with Contrastive Predictive Coding[C/OL].[2022-01-15]. https://arxiv.org/pdf/1807.03748.pdf. [3] SCHNEIDER S, BAEVSKI A, COLLOBERT R, et al. wav2vec: Unsupervised Pre-Training for Speech Recognition[C/OL].[2022-01-15]. https://arxiv.org/pdf/1904.05862.pdf. [4] BAEVSKI A, SCHNEIDER S, AULI M. vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations[C/OL].[2022-01-15]. https://arxiv.org/pdf/1910.05453.pdf. [5] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding // Proc of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Long and Short Papers). Stroudsburg, USA: ACL, 2019: 4171-4186. [6] BAEVSKI A, ZHOU H, MOHAMED A, et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations[C/OL].[2022-01-15]. https://arxiv.org/pdf/2006.11477.pdf. [7] CONNEAU A, BAEVSKI A, COLLOBERT R, et al. Unsupervised Cross-Lingual Representation Learning for Speech Recognition // Proc of the INTERSPEECH 2021. Berlin, Germany: Springer, 2021: 2426-2430. [8] XU Q T, BAEVSKI A, LIKHOMANENKO T, et al. Self-Training and Pre-training are Complementary for Speech Recognition[C/OL].[2022-01-15]. https://arxiv.org/pdf/2010.11430.pdf. [9] CHUNG Y A, HSU W N, TANG H, et al. An Unsupervised Autoregressive Model for Speech Representation Learning // Proc of the INTERSPEECH 2019. Berlin, Germany: Springer, 2019:.146-150. [10] CHUNG Y A, TANG H, GLASS J. Vector-Quantized Autoregre-ssive Predictive Coding // Proc of the INTERSPEECH 2020. Berlin, Germany: Springer, 2020: 3760-3764. [11] LIU A T, YANG S W, CHI P H, et al. Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transfor-mer Encoders // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Washington, USA: IEEE, 2020: 6419-6423. [12] LIU A T, LI S W, LEE H Y. TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech. IEEE/ACM Tran-sactions on Audio, Speech and Language Processing, 2021, 29: 2351-2366. [13] CHI P H, CHUNG P H, WU T H, et al. Audio Albert: A Lite Bert for Self-Supervised Learning of Audio Representation // Proc of the IEEE Spoken Language Technology Workshop. Washington, USA: IEEE, 2021: 344-350. [14] LAN Z Z, CHEN M D, GOODMAN S, et al. ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations[C/OL].[2022-01-15]. https://arxiv.org/pdf/1909.11942.pdf. [15] XIE Q Z, LUONG M T, HOVY E, et al. Self-Training with Noisy Student Improves Imagenet Classification // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2020: 10684-10695. [16] SRIVASTAVA N, HINTON G, KRIZHEVSKY A, et al. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research, 2014, 15(1): 1929-1958. [17] CUBUK E D, ZOPH B, SHLENS J, et al. RandAugment: Practical Data Augmentation with no Separate Search[C/OL].[2022-01-15]. https://arxiv.org/pdf/1909.13719v1.pdf. [18] PARK D S, CHAN W, ZHANG Y, et al. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition // Proc of the INTERSPEECH 2019. Berlin, Germany: Springer, 2019: 2613-2617. [19] PARK D S, ZHANG Y, JIA Y, et al. Improved Noisy Student Training for Automatic Speech Recognition // Proc of the INTERSPEECH 2020. Berlin, Germany: Springer, 2020: 2817-2821. [20] PANAYOTOV V, CHEN G G, POVEY D, et al. Librispeech: An ASR Corpus Based on Public Domain Audio Books // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Washington, USA: IEEE, 2015: 5206-5210. [21] GAROFOLO J S, LAMEL L F, FISHER W M, et al. DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus CD-ROM.NIST Speech Disc 1-1.1. NASA STI/Recon Technical Report N, 1993, 93. DOI: 10.6028/NIST.IR.4930. [22] ARTHUR D, VASSILVITSKII S. K-means++: The Advantages of Careful Seeding // Proc of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms. Philadelphia, USA: SIAM, 2007: 1027-1035. [23] VAN DER MAATEN L, HINTON G. Visualizing Data Using t-SNE. Journal of Machine Learning Research, 2008, 9: 2579-2605.