Can the smaller artificial intelligence outperform the giants?
:::information
Authors:
(1) Hugo Laurençon, Huging Face and Sorbonne Université, (the request was randomly chosen);
(2)
(3) Mattheu Cord, Sorbonne Université;
(4) Victor San, embraced face.
:::
Links table
Abstract and 1 introduction
2 terms
3 Explore the design space for the vision language models and 3.1 Is all the pre -trained spine equivalent to VLMS?
3.2 How do you compare the fully automatic architecture to the architecture of the attack?
3.3 Where are the gains of efficiency?
3.4 How can one account for performance?
4 IDEFICS2- An open founding model of the latest model and 4.1 pre-training training
4.2 Impact instructions and 4.3 Improvement for chat scenarios
5 conclusion, approval, and references
\
extension
A.1 Additional experimental details of pressure
A.2 Details of Crushing Instructions
A.3 Review details
A.4 The red circle
a summary
The increasing interest in the VLMS models were driven by improvements in large language models and vision transformers. Despite the abundance of literature on this topic, we note that the critical decisions related to the VLMS design are often unjustified. We affirm that these uninterrupted decisions hinder the progress in this field by making it difficult to determine the options that improve the performance of the model. To address this problem, we do extensive experiences about the pre -training models, choose architecture, data, and training methods. The unification of our results includes the development of IDEFICS2, which is an effective, effective VLM VLM of 8 billion parameters. IDEFICS2 achieves a newer performance in the category of volume across various multimedia standards, often on an equal foot with four times its size. We export the form (the base, patients, and chat) along with the data sets created for training.
\
1 introduction
Vlms that take pictures and texts as text and output text text Web to code (Laurençon et al., 2024). Researchers have been able to researchers on these non -advanced models.
\ This position makes it difficult to distinguish between decisions that really explain the typical performance, which makes it difficult for society to make meaningful and founding progress. For example, (Alyraac et al., 2022; Laurençon et al To our knowledge, this choice has not been properly reviewed, the arguments in terms of account, data efficiency and performance are not well understood. In this work, we aim to achieve the experimental clarity of some of these basic design options and ask the question: What matters when building models in the language of vision?
\ We define two fields where different works adopt different design options: (a) The structure of the model, in particular, the Mosul units that integrate the methods of vision and the text and their impact on the efficiency of reasoning, (B) Multimedia training and its impact on the stability of training. For each of these areas, we strictly compare the different design options in an control environment and extract experimental results. It is worth noting that we find that (a) provides the models of the language of vision to a large extent driven by the progress of the pre -trained spine, (b) the most modern architecture exceeds the automatic decline on stable training, (C) for cross architecture and pre -vision in the equal vision of the metal vision in delivery and adaptation to the vision presented to the vision represented in the amendment in the vision Causing adaptation and modification. The time on one side, and dealing with the images in its original ratio and size without damaging the power of the estuary on the other side, and (D) modifications to the image processing allows the cost of the trading conclusion to perform the estuary.
\ Our results are complementary with those offered in (Karamcheti et al We specifically delve into uncomplicated aspects such as typical architecture, training, stability and improvement of efficiency when inference.
\ Learning from these ideas, we train IDEFICS2, which is the founding VLM with 8 billion teachers. IDEFICS2 achieves a newer performance in its size category on different standards while being more efficient in inferring, for both the base and the seized version. He is equal to the artist’s models on the latest 4 times on some visionary criteria and coincides with the performance of Gueini 1.5 Pro on some difficult standards. We release the base copies, symbolize, and chat IDEFICS2 versions[1] As resources for the VLM community along with the data created to train the model.
\
::: Information about this paper Available on Arxiv Under CC by 4.0 verb license.
:::
[1] https://hugingface.co/collesches