How does the context change the way we evaluate artificial intelligence responses

Authors:
(1) Clementia Siro, University of Amsterdam, Amsterdam, Netherlands;
(2) Mohamed Elianjadi, University of Amsterdam, Amsterdam, Netherlands;
(3) Martin de Regic, University of Amsterdam, Amsterdam, Netherlands.
Links table
Abstract and 1 introduction
2 methodology and 2.1 experimental data and tasks
2.2 Automatic generation from the various dialogue contexts
2.3 Sourdsource experiments
2.4 Experimental conditions
2.5 Participants
3 results and analysis and 3.1 data statistics
3.2 RQ1: an impact of varying quantity of the context of the dialogue
3.3 RQ2: The effect of the context of the dialogue that was automatically created
4 discussion and implications for
5 relevant work
6 Conclusion, restrictions and ethical considerations
7 thanks, appreciation and references
Appetite
2.3 Sourdsource experiments
After (Kazai, 2011; Kazai et al., 2013; Roitero et al We publish visits in changing conditions to understand how contextual information affects the provisions of commentators. Our study has two phases: In the first stage, we differ the amount of contextual information; In stage 2 we differ from the type of contextual information. At each stage and condition, the amounts were paid at the same amount that this study focused on understanding the impact of incentives on the quality of the collective stickers. Such as (Kazai et al This helps to prevent potential biases while finishing the strike.
Stage 1. In the first stage, the focus focuses on understanding how the quantity of the context of the dialogue affects the quality and consistency of relevant stickers and interest. We differ the length of the context of the dialogue for treatment (RQ1). Thus, we design our experience with three differences: C0, C3 and C7 (see section 2.4). The strike consists of describing a general task, instructions, examples, and part of the main task. For each difference, we collect stickers for two main dimensions (connection and interest) and include an open question to ask for notes on the task. Each dimension is evaluated with 3 reviews in a separate blow, with the same response of the system that was evaluated by each of them. This guarantees a consistent assessment of both importance and interest.
Stage 2. In stage 2, the focus is transmitted to the type of contextual information, for the answer (RQ2). We follow the device’s approach to the collective outsourcing loop. We support our experiences on the experimental difference C0 (specified below), as there is no context of a previous dialogue available to the sertron. We aim to enhance the quality of the C0 collective stickers by including additional contextual information along with the rated junction. Our assumption is that without a prior context, teachers may face challenges in providing accurate and consistent stickers. By providing an additional context, such as the need for user information or a summary of a dialogue, we expect an increase in the accuracy of the evaluation. Through this, we aim to approach the level of performance similar to reaching the context of the entire dialogue while reducing the required explanatory comments efforts. We strengthen the dialogue 40 of the first stage with the user’s need for the user or a summary of the dialogue, as is detailed in section 2.2. Thus, in stage 2, we have three experimental settings: C0-LLM, C0-Heu, and C0-Sum. Table 3 in section A.1 summarizes the settings.
The successful design of the first stage reflects the first stage. The main task remains unchanged, except for the inclusion of the user information need or the summary of the dialogue. Answer the two questions about the two questions about the importance and the benefit of separate strikes. Although we do not strictly depend on the additional information provided, commentators are encouraged to use it when they see that the current response lacks sufficient information to an enlightened rule.
2.4 Experimental conditions
We focus on two main features: the quantity and type of context of the dialogue. For each feature, we explore three distinct settings, which leads to 6 differences, for each of the importance and benefit; Each of them was applied to the same dialogue 40:
• The amount of context. We explore three deduction strategies: non -context (C0), the partial context (C3), and the full context (C7), designed to include scenarios where the context of a previous dialogue cannot be reached to reach the teacher (C0), where some context of the previous dialogue is available but not comprehensively (C3), and when it is to reach the previous center in the full dialogue full dialogue.
Type of context. Using the contexts created in section 2.2, we experience three differences of context type: the information needs created in an investor (C0-Heu), the need for information created by LLM (C0-LLM), and a summary of dialogue (C0-SUM).
Table 3 in the A.1 section of the appendix summarizes experimental conditions.
2.5 Participants
We have recruited the main workers from the United States over Amazon Mechangal Turk (Amazon Mechantical Turk, 2023) to ensure the understanding of the efficiency language. Teachers were nominated on the basic system qualifications, which requires the minimum accuracy of 97 % through 5,000 visits. To alleviate any educational bias from the task, each comment was limited to completing 10 times for each batch and participating in 3 experimental conditions as a maximum. A total of 78 unique broadcasters participated in the 1 and 2 phases, and each worker was paid $ 0.4 per strike, at a rate of $ 14 per hour. The average age was 35-44 years. The gender distribution was 46 % of females and 54 % of males. The majority obtained a Bachelor’s degree for a period of four years (48 %), followed by two years and a master’s degree (15 % and 14 %, respectively).
We perform quality control on collective stickers to ensure reliability as shown in the A.2 section in the appendix.