GPT-4's Performance in Educational Assessment Benchmarked Against Specialized Models

The research focused on the application of AI in educational assessment, a domain that has increasing interest due to its potential to revolutionize large-enrollment courses. The study’s premise was to understand how GPT-4, a general-purpose tool, fares against specialized models in grading short-answer responses. It shows the performance of GPT-4, a pre-trained Large Language Model (LLM), in the field of Automated Short Answer Grading (ASAG).

Two benchmark datasets, SciEntsBank and Beetle, which encompass general science questions for grades 3 to 6 and queries related to basic electricity and electronics, respectively, were used. The research examined GPT-4’s ability to grade based on alignment with a reference answer and, intriguingly, without it. The latter assessment required GPT-4 to draw upon its extensive training to independently judge the correctness of a student’s response.

The findings revealed that GPT-4’s performance was robust. In the SciEntsBank dataset, it achieved its best results on the 2-way task, with an F1 score of 0.744. However, it was the Beetle dataset that presented an unexpected outcome. Here, GPT-4 performed better when the reference answer was withheld, achieving an F1 score of 0.651.

With specialized ASAG models, GPT-4’s performance was reminiscent of hand-engineered systems from half a decade ago. Models from the BERT family, which undergo both pre-training and task-specific training, still outpace GPT-4.The research brings to light the phenomenal advancements in deep-learning models for ASAG in the last five years. While GPT-4’s capabilities are impressive, especially without requiring reference answers, the BERT family’s models have showcased the benefits of task-specific training.

One significant takeaway from Dr. Kortemeyer’s study is GPT-4’s potential in higher education. Preliminary indications suggest that automated grading of comprehensive content, extending beyond short answers, is achievable. However, concerns around data security and privacy with cloud-based models like GPT-4 persist. Alternatives such as Llama 2, which can be locally installed, are being explored, though they currently lag behind GPT-4 in performance.|

As AI continues its foray into educational assessment, the balance between performance, adaptability, and data security remains a pivotal point of discussion. Only time will tell which models will emerge as frontrunners in this evolving landscape. Read full paper.

GPT-4’s Performance in Educational Assessment Benchmarked Against Specialized Models

Related News

Integration of LLMs and Neuroimaging Sheds Light on Cognitive Processes in Reading Comprehension

Researchers Introduce RankVicuna, An Open-Source Model Elevating Zero-Shot Reranking in Information Retrieval

LLM-Based Code Generators on CS1 Coding Tasks and Learning Trajectories

Speech Technology with Tencent AI Lab’s AutoPrep for Optimal Unstructured Speech Data Processing

Researchers Fine-Tune LLMs to Reduce Vulnerabilities in Auto-Completed Smart Contract Code

Leave a Reply Cancel reply