Price Prediction

Comparing the chameleon with GPT-4V and Gemini

News FetcherMay 19, 2025

0 0 4 minutes read

Comparing the chameleon with GPT-4V and Gemini

Links table

Abstract and 1 introduction

2 before training

2.1 Distinguished symbol

2.2 Pre -training data

2.3 Stability

2.4 Inference

3 alignment data and 3.1 data

3.2 Refining strategy

4 Human reviews and safety test, and 4.1 Claims for evaluation

4.2 basic lines and evaluation

4.3 Inter-Anotator Agreement

4.4 Safety test

4.5 Discussion

5 measurements measuring and 5.1 text

5.2 Images to text

6 related work

7 Conclusion, Decisions, shareholders and references

Excessive

A. Samples

for. Additional information on human assessments

4 Human reviews and safety test

Chameleon has mixed conditional understanding and unnecessary generation capabilities with current standards. In this section, we detail how we conduct human assessments on large multimedia models. Responses For a group of various Claims Ordinary users may ask daily. First, we offer how to collect claims, then half of the foundation lines and our evaluation methods, as well as evaluation and analysis results. Safety study is also included in this section.

4.1 Claims for evaluation

We are working with a third party to collect a third party to collect a range of various and natural claims from the human host. Specifically, we ask broadcasters creatively thinking about what they want a multimedia model to create different life scenarios. For example, for the scenario “Imagine that you are in a kitchen”, broadcasters may come with demands such as “How to cook pasta?” Or “How can I design my island design? Show me some examples.” The claims can only be text or text with some images, and the expected responses must be mixed, which contain text and images.

After collecting a preliminary set of claims, we ask three random reviews to evaluate whether the claims are clear and whether they expect the responses to contain pictures. We use the majority vote to liquidate unclear claims and claims that do not expect mixed responses. In the end, our final evaluation set contains 1,048 claims: 441 (42.1 %) mixed media (i.e. containing both the text and images), and the remaining 607 (57.9 %) is only the text.

To better understand the tasks, users want to get a multimedia system, we study manually

Figure 9 Performance of chameleon against basic lines, to understand and generate a mixed deposit on a set of various and natural claims of human broadcasters.

Claims and classify them to 12 categories. Description of these tasks categories[1]As well as their example, an example, can be found in Figure 8.

4.2 basic lines and evaluation

Chameleon 34b with Openai GPT-4V and Google Gemini Pro by calling their application programming facades. While these models can take mixed claims as inputs, their responses are only the text. We create additional basic lines by increasing GPT-4V and Gemini responses with pictures to have stronger basic lines. Specifically, we direct these models to create an image clarifications by adding the following sentence at the end of each original input mentor: “If the question requires creating an image, then creating a picture comment instead and raising the illustrations in a pair of caution ⟨/caption⟩.” Then we use Openai Dall-E3 to create conditional images on these illustrations and replace the illustrations in the original responses with those created images. We refer to improved responses like GPT-4V+ and Gemini+ in this section. By working with a third -party collective outfit, we perform two types of assessments to measure the performance of the form: absolute and relative.

4.2.1 Absolute evaluation

For absolute assessments, the output of each model is judged separately by asking three different reviews on a set of questions related to the importance and quality of answers. Below, we offer detailed results and analysis on the most important question, Whether the response meets the task shown in the claim.

When achieving tasks, we ask the conditions whether the response is Fulfilland partially FulfillOr Incomparable The task shown in the claim. As shown in the 9A Figure, more chameleon responses have completely achieved tasks: 55.2 % for nature compared to 37.6 % of Gemini+ and 44.7 % of GPT-4V+. When judging the original responses of Gemini and GPT-4V, broadcasters are much lower than the claims that are fully fulfilled: GEMINI completely fulfills 17.6 % of the tasks and GPT-4V 23.1 %. We doubt that since all claims expect a mixed output, only textual responses can be considered from Gemini and GPT-4V as partially complementing the tasks only by the meals.

Task achievement rates can be found in each category and in each method of insertion in the appendix B. The tasks performed by the chameleon well include well Brainstormand comparisonAnd VirtualAnd the chameleon categories include to improve it identification and Thinking. On the other hand, we do not see that the performance of the model varies greatly when comparing mixed roads and text demands, although the chameleon looks a little better for text claims only, while Gemini+ and GPT-4V+ is a little better than that mixed. Figure 2 shows an example of a chameleon response to demand brainstorming.

4.2.2 The relative evaluation

As for relative assessments, we directly compare the chameleon with each basic model by submitting their responses to the same claim in random arrangement and asking human commentators about the response they prefer. The options include Firstly Response, second Response, and About the same thing. Figure 9b shows the chameleon winning rates[2] On the foundation lines. Compared to Gemini+, CHAMELEON responses are better in 41.5 % of cases, 34.5 % of a tie, 24.0 % lower. Broadcasters also believe that CHAMELEON responses are a little better than GPT-4V+, with 35.8 % win, 31.6 % tie, and 32.6 % loss. In general, chameleon winter rates are 60.4 % and 51.6 % on Gemini+ and GPT-4V+, respectively. When compared to the original responses of Gemini without enhanced images, the chameleon responses are better in 53.5 % of cases, 31.2 % associated, and 15.3 % less. Chameleon’s responses are also better than GPT-4V frequently, with 46.0 % win, 31.4 % tie, and 22.6 % loss. CHELEONON ROM rates on Gemini and GPT-4V are 69.1 % and 61.7 %, respectively.

author:

(1) The chameleon team, exhibition in Meta.

[1] Although it is not specifically directed, some of the tasks of understanding images that require the text to be determined in an image, such as identifying optical letters (identifying visual letters), do not appear in our assessment group of claims.

News FetcherMay 19, 2025

0 0 4 minutes read