What is the reliability of human provisions in the artificial intelligence model test?

Links table
Abstract and 1 introduction
2 before training
2.1 Distinguished symbol
2.2 Pre -training data
2.3 Stability
2.4 Inference
3 alignment data and 3.1 data
3.2 Refining strategy
4 Human reviews and safety test, and 4.1 Claims for evaluation
4.2 basic lines and evaluation
4.3 Inter-Anotator Agreement
4.4 Safety test
4.5 Discussion
5 measurements measuring and 5.1 text
5.2 Images to text
6 related work
7 Conclusion, Decisions, shareholders and references
Excessive
A. Samples
for. Additional information on human assessments
4.3 Inter-Anotator Agreement
Each question is answered in our evaluation by three different human reviews, and we take the majority voices as a final answer. To understand the quality of human conditions and whether the questions we asked are reasonably designed, we are studying the level of agreement between different reviews.
To get questions about the simple and objective characteristics of answers, we rarely see three of the meals that do not agree with each other. For example, broadcasters unanimously have rulings on whether the typical responses contain rejected content (for example, hate speech); In this case, all models produce safe responses. For some questions, such as whether the response meets the task or whether the model explains the claim correctly, when the ruling of one of the two indicators differs from the two, the decision remains soon (for example, Fulfill Opposite Partially fulfills) Instead of the opposite (for example, Fulfill Opposite Incomparable).[3]
As for the proportional evaluation, the table shows 4 number of cases in which all the three metrics agree, and two broadcasters, and there is no agreement. For each pair of models, we have slightly higher than 10 % of cases where there is no agreement between the three commentators (considered a link in our evaluation) that can be explained as a chameleon that leads similar to other basic lines in many cases, which makes the relative evaluation a challenge.[4]
4.4 Safety test
We demand Crowdssource to raise the model to create unsafe content in pre -specific categories such as self -harm, violence, hatred and criminal planning. These claims cover both mixed texts and inputs, in addition to producing unsafe text, images or outputs. We create the form of the form for each mentor, and we ask the balconies to name whether the response is security or insecure Regarding the definition of each group of safety; and not sure The option is also provided for the border line responses. Table 5 shows that the vast majority of CHEMELEON responses are safe, with 78 (0.39 %) responses not safe for 7B and 19 (0.095 %) for a 30B model.
We also assess the model’s ability to withstand the hostile student in an interactive session. For this purpose, an internal red team investigated the 30B model on 445 rapid response reactions, including multinational reactions. Table 5 shows that from those responses, 7 (1.6 %) were considered unsafe and 20 (4.5 %) is classified as not sure. While it turns out that there is more safety control using RLHF/RLAIF for further stiffness of the model against the intentional protection and malicious protection attacks, these results show that our current safety synthesis approach provides great protection for the rational use of this research.
4.5 Discussion
Compared to Gemini and GPT-4V, CHAMELEON is very competitive when dealing with claims that expect overlapping and natural responses. The pictures created by CHAMELEON are usually related to context, making documents with intertwined text and very attractive images for users. However, readers should be aware of the restrictions of human evaluation. First, the claims used to evaluate came from the collective outsourcing instead of real users who interact with a model. Although we definitely have a variety of claims, coverage may still be limited, given the size of the data set. Second, partly because our claims focus on natural mixed output, and some of the tasks of visual understanding, such as identifying letters or graphs (i.e. the interpretation of a specific plan or conspiracy), which are naturally excluded from our evaluation. Finally, at this moment, the current multimedia LLMS applications provide text responses only. Although we are strengthening the foundation lines by increasing their output using separately created images, it is still preferred if we can compare the chameleon models with other local mixed media.
author:
(1) The chameleon team, exhibition in Meta.
[3] On the issue of achieving tasks, the reliability between the ovens that was accommodated by the Kripependorf’s alpha (Krippendorf, 2018; Marzi et al., 2024) is 0.338; Confidence break 95 % [0.319, 0.356]Based on taking samples of 1000 repetitions. [4] When comparing Chameleon with Gemini+ and GPT-4V+, Alpha Kripendorf’s values are 0.337 [0.293, 0.378] And 0.396 [0.353, 0.435]respectively.