DeepMind recently unveiled an innovative algorithm known as Reinforced Self-Training (ReST). This technique is poised to enhance the efficiency and quality of large language models (LLMs) by better aligning them with human preferences.
ReST’s methodology is quite distinctive. It begins by generating a dataset from an established LLM policy. This dataset is then pivotal in the subsequent refinement of the LLM, utilizing offline reinforcement learning (RL) algorithms. What sets ReST apart from its contemporaries is its efficiency. While many current algorithms rely on online RL from human feedback (RLHF) methods, ReST optimizes the process by producing the training dataset offline. This strategic approach not only speeds up the training cycle but also offers the advantage of data reuse.
Though the ReST algorithm has broader applications across various domains of generative learning, DeepMind’s study emphasized its transformative potential in the realm of machine translation. The results are indeed promising. With the integration of ReST, translation quality witnessed significant enhancement, a fact corroborated by both state-of-the-art automated metrics and comprehensive human evaluations on benchmarked machine translation platforms.
For those keen on delving deeper into the specifics and technicalities of this groundbreaking approach, the detailed research is accessible at arXiv:2308.08998.