A groundbreaking visual language model, IDEFICS (Image-aware Decoder Enhanced à la Flamingo with Interleaved Cross-attentionS), has been unveiled to the public. Modeled on DeepMind’s proprietary Flamingo, IDEFICS showcases the prowess of seamlessly integrating image and text sequences to generate text outputs, a capability reminiscent of GPT-4.
Designed to amplify transparency within the AI landscape, IDEFICS is uniquely constructed using publicly accessible data and models, specifically LLaMa v1 and OpenCLIP. The model is available in two distinct configurations: the base and the instructed versions, both presented in 9 billion and 80 billion parameter dimensions.
With IDEFICS, the overarching objective is to emulate and avail to the AI fraternity systems that parallel the competencies of proprietary giants like Flamingo. Emphasizing transparent development, the initiative incorporates publicly sourced data, offers tools for dataset exploration, disseminates technical insights, and even shares challenges faced during its creation. To ensure the model’s reliability and safety, it underwent rigorous adversarial prompting assessments prior to its launch.
IDEFICS is anticipated to cement its place as a cornerstone for open research in multimodal AI, joining the ranks of other pioneering models like OpenFlamingo, which also stands as a public reproduction of Flamingo with a 9 billion parameter capacity.