Researchers have introduced TEXT2REWARD, a unique framework aimed at simplifying the challenge of designing reward functions in reinforcement learning (RL). Traditionally, this process has been inconvenient, relying heavily on domain-specific knowledge and resulting in high developmental costs.
TEXT2REWARD leverages large language models (LLMs) to automatically generate dense reward functions. Given a goal described in natural language, this framework produces an executable program which interprets the goal in the context of the given environment. This approach is in contrast to conventional methods such as inverse RL. Unlike previous models which produce sparse reward codes, TEXT2REWARD generates dense reward codes that are easily interpretable, can be adapted to a variety of tasks, and are designed for iterative refinement based on human feedback.
In evaluations, TEXT2REWARD was tested on robotic manipulation benchmarks including MANISKILL2 and METAWORLD, as well as two locomotion environments in MUJOCO. Impressively, in 13 out of 17 manipulation tasks, the policies trained with TEXT2REWARD-generated reward codes matched or even surpassed the performance of policies trained with expert-designed codes. In locomotion tasks, the framework learned six innovative behaviors, achieving a success rate of over 94%. Furthermore, these policies demonstrated real-world application, successfully being deployed in robotic simulations.
A standout feature of TEXT2REWARD is its capacity for iterative improvement. Recognizing the ambiguities inherent in natural language and the potential downfalls in RL training, the system actively seeks human feedback post-training. This feedback serves to refine the reward functions, ensuring they are aligned with human intentions and preferences.
In tests, RL policies trained with TEXT2REWARD’s generated codes outperformed those trained with human-designed codes, hinting at the vast potential of LLMs in this domain. The framework’s adaptability was highlighted in its performance in locomotion tasks and its real-world deployment on a Franka Panda robot arm.
However, like all systems, TEXT2REWARD is not without its challenges. A manual error analysis revealed an error rate of around 10%, with a significant portion stemming from syntax or shape mismatches in the code. Despite these challenges, the results are promising and highlight the potential of LLMs in the realm of RL.
In conclusion, TEXT2REWARD represents a significant forward in the domain of reinforcement learning. By harnessing the power of large language models, it offers an innovative solution to the long-standing challenge of reward function design. The system’s ability to iterate and refine based on human feedback ensures its adaptability and relevance in real-world scenarios. As the intersection of reinforcement learning and code generation continues to evolve, TEXT2REWARD stands as a testament to the potential in this field.
Check full paper here.