In recent developments in speech processing technology, researchers have introduced an innovative method that synergizes Large Language Models (LLMs) with traditional acoustic-based speaker diarization systems. This methodology leverages contextual cues in human dialogues, enhancing the accuracy and efficiency of machines in processing human speech.
Speaker diarization, the technique of distinguishing and labelling individual voices within an audio recording, has primarily relied on acoustic indicators. However, this conventional method has observed certain constraints. The new research seeks to address these limitations by incorporating lexical insights from LLMs during the system’s inference stage. This integration not only refines the precision in identifying different speakers but also optimizes the word error rate attributed to each voice.
Historically, multi-speaker speech recognition efforts have predominantly focused on segmenting speaker-specific sections for processing or executing recognition and diarization simultaneously. Traditional practices have utilized lexical information, specifically through beam search decoding, to augment the accuracy of single-speaker automatic speech recognition (ASR) systems.
The contemporary technique, labeled as ‘contextual beam search,’ amalgamates audio and textual modalities. It aims to determine the most probable word-speaker mapping, factoring in context from both sources. Notably, this method demonstrates adaptability. Contrary to preceding models that necessitated paired audio-text datasets, the current system facilitates training of acoustic-only diarization models on mixed audio and language models on extensive, independent text-only datasets. This strategy effectively addresses the prevalent data-sparsity challenge in speaker diarization research, thereby ensuring a comprehensive and varied training foundation.
Additionally, the method is not restricted by the number of speakers on which the language models are trained, offering enhanced scalability. This modular approach affords flexibility, permitting modifications or replacements to the ASR or the language model without disrupting the foundational diarization system. For instance, transitioning the model for an alternative language merely requires a switch in ASR and language models, retaining the primary acoustic-only diarization model intact. For empirical validation, the researchers employed cutting-edge models, including the ConformerCTC for ASR and an advanced iteration of the Multi-scale Diarization Decoder (MSDD) for speaker diarization. A Large Language Model, trained on an expansive dataset, was also incorporated to underscore the potential of LLMs in this realm.
The preliminary results are indicative of the method’s efficacy. The integration of LLMs into the diarization system yielded a significant 39.8% relative improvement from the established baseline in speaker-attributed word error rate.
In summation, this progressive approach to speaker diarization, with the integration of LLMs, has paved the way for novel avenues in speech processing. By adeptly utilizing both acoustic and lexical data, it holds the potential to revolutionize the precision and operational efficiency of multi-speaker speech recognition systems. Future endeavors in this domain are anticipated to further refine and amalgamate these mechanisms, ushering in an era of advanced multi-speaker ASR systems. Read paper.