A team of researchers from the Institute of Software, Chinese Academy of Sciences, Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education, School of Artificial Intelligence, University of Posts and Telecommunications, China has introduced STEAM, a stage-wise framework designed to simulate the interactive behavior of multiple programmers across various stages of a bug’s life cycle. The framework aims to enhance the bug-fixing capabilities of large language models (LLMs) by decomposing the bug-fixing task into four distinct stages: bug reporting, bug diagnosis, patch generation, and patch verification. STEAM employs the dialogue-based LLM, ChatGPT, and is inspired by traditional bug management practices.
Software systems often contain bugs that can lead to substantial losses, including financial impacts and potential risks to human life. Automatic bug fixing has been proposed as a means to expedite the resolution of software bugs and facilitate timely software maintenance. Recent advancements in deep learning and generative artificial intelligence have spurred researchers to utilize LLMs for various software engineering tasks. However, existing approaches often treat bug fixing as a single-stage process, neglecting the interactive and collaborative nature of programmers during the resolution of software bugs.
STEAM addresses this limitation by simulating the collaborative problem-solving abilities exhibited by programmers (i.e., tester, developer, and reviewer) throughout the entire life cycle of a bug. The framework relies on the tester’s comprehensive understanding of the bug to file a detailed bug report, which provides essential information for the developer to resolve the bug successfully. The developer then engages in the diagnosis process, generates the candidate patch, and the tester provides review feedback to ensure the correctness of the generated patch.
The researchers conducted extensive experiments to demonstrate the effectiveness and generalizability of STEAM. The evaluation involved the widely adopted bug-fixing benchmark, BFP, and four additional APR benchmarks to enhance evaluation diversity. The results showed that STEAM achieved a new state-of-the-art level of bug-fixing performance. However, the researchers acknowledged several threats to the validity of their approach, including the quality of selected experimental subjects, the generalizability of STEAM, and the sensitivity of LLMs to prompts and hyper-parameters.
In conclusion, the researchers believe that aligning the collaborative problem-solving abilities of programmers with LLMs represents a pivotal stride toward intelligent software engineering. They plan to conduct additional human evaluations in the future to further validate the models.
The complete details can be found in the research paper available here.