In recent explorations, the balance between “speciality” and “generality” during the fine-tuning of foundation models, Vision Language Models (VLMs) and Large Language Models (LLMs), has emerged as a focal point. This balance has implications for how these models perform and adapt across diverse tasks and distributions.
Foundation models, recognized for their extensive pre-training datasets, showcase impressive adaptability across varied distributions and tasks. However, while fine-tuning often enhances performance for specific tasks, it may compromise the model’s overarching generality. This phenomenon shows the challenges of “catastrophic forgetting” observed in deep learning, wherein models, when learning new tasks, might underperform in previously learned ones.
To illustrate, when VLMs like CLIP are fine-tuned on datasets such as ImageNet, there’s a drop in their adaptability across diverse distributions. Similarly, LLMs like Galactica, when fine-tuned for medical domain tasks, tend to struggle in areas like instruction-following and common sense reasoning.
The study delved into methods to mediate this trade-off. Among the explored techniques were regularization methods from continual learning, the weight averaging method, Wise-FT, and parameter-efficient techniques like Low-Rank Adaptation (LoRA). The findings suggest that while continual learning methods do mitigate some of the generality loss, Wise-FT stands out, offering an optimal balance between maintaining generality and achieving task-specific speciality. LoRA’s effectiveness varied based on the complexity and nature of the fine-tuning task.
While the research provides valuable insights, it acknowledges that certain methodologies, like the rehearsal methods, remain unexplored. The findings underscore the importance of understanding the dynamics of foundation models, made the way for further studies that could shape the future of Natural Language Generation. Read more here.