The research focused on the application of AI in educational assessment, a domain that has increasing interest due to its potential to revolutionize large-enrollment courses. The study’s premise was to understand how GPT-4, a general-purpose tool, fares against specialized models in grading short-answer responses. It shows the performance of GPT-4, a pre-trained Large Language Model (LLM), in the field of Automated Short Answer Grading (ASAG).
Two benchmark datasets, SciEntsBank and Beetle, which encompass general science questions for grades 3 to 6 and queries related to basic electricity and electronics, respectively, were used. The research examined GPT-4’s ability to grade based on alignment with a reference answer and, intriguingly, without it. The latter assessment required GPT-4 to draw upon its extensive training to independently judge the correctness of a student’s response.
The findings revealed that GPT-4’s performance was robust. In the SciEntsBank dataset, it achieved its best results on the 2-way task, with an F1 score of 0.744. However, it was the Beetle dataset that presented an unexpected outcome. Here, GPT-4 performed better when the reference answer was withheld, achieving an F1 score of 0.651.
With specialized ASAG models, GPT-4’s performance was reminiscent of hand-engineered systems from half a decade ago. Models from the BERT family, which undergo both pre-training and task-specific training, still outpace GPT-4.The research brings to light the phenomenal advancements in deep-learning models for ASAG in the last five years. While GPT-4’s capabilities are impressive, especially without requiring reference answers, the BERT family’s models have showcased the benefits of task-specific training.
One significant takeaway from Dr. Kortemeyer’s study is GPT-4’s potential in higher education. Preliminary indications suggest that automated grading of comprehensive content, extending beyond short answers, is achievable. However, concerns around data security and privacy with cloud-based models like GPT-4 persist. Alternatives such as Llama 2, which can be locally installed, are being explored, though they currently lag behind GPT-4 in performance.|
As AI continues its foray into educational assessment, the balance between performance, adaptability, and data security remains a pivotal point of discussion. Only time will tell which models will emerge as frontrunners in this evolving landscape. Read full paper.