Researchers from INESC-ID, Lisbon; University of Lisbon; Carnegie Mellon University, Pittsburgh; Phrase, Pittsburgh have developed a novel approach to address the lack of multilingual data and open-sourced multilingual dialogue systems, a key challenge in developing robust multilingual open-domain dialogue evaluation metrics. The team leveraged a multilingual pretrained encoder-based Language Model and augmented existing English dialogue data using Machine Translation (MT). Their findings indicate that simply finetuning a pretrained multilingual encoder model with translated data does not outperform the existing baseline. Instead, a more effective approach involves carefully curating translated data using MT Quality Estimation (QE) metrics and excluding low-quality translations.
The study extends the approach of training using MT-generated data for the development of multilingual models for evaluating open-domain dialogue responses. The authors experimented with several workarounds and proposed using an MT QE model to rank translations and finetune models with varying amounts of quality-ranked data. This approach resulted in finetuned multilingual dialogue evaluation models that exhibited strong correlations with human judgements, indicating the potential to leverage multilingual dialogue evaluation metrics without the constraints associated with Large Language Models (LLMs).
Additionally, the authors addressed the issue of noise introduced to the training data by low-quality translations. They proposed using QE scores for response ranking for each target language, ensuring a standardized method for filtering and improving the method’s scalability to new languages.
In conclusion, the study presents a significant advancement in the field of automatic multilingual dialogue evaluation by demonstrating that filtering out low-quality translations can reduce the performance gap on ChatGPT and outperform it on select correlation metrics. The authors suggest that future research could involve evaluating generative model responses in different languages using annotators exposed to the culture associated with a given language, enabling a qualitative analysis of the differences in quality perception between languages. Check paper.