GPT evaluation and open source forms on code mutation tasks

Authors:
(1) Bo Wang, Beijing Jiaotong University, Beijing, China ([email protected]);
(2) Mingda Chen, University of Beijing Jiaotong, Beijing, China ([email protected]);
(3) YouFang Lin, Beijing Jiaotong University, Beijing, China ([email protected]);
(4) Mike Papadakis, University of Luxembourg, Luxembourg ([email protected]);
(5) Jie M. Zhang, King’s College London, London, UK ([email protected]).
Links table
Abstract and 1 introduction
2 background and relevant work
3 study design
3.1 Overview of questions and research
3.2 Data sets
3.3 generation of a mutation via llms
3.4 Evaluation measures
3.5 Experience settings
4 evaluation results
4.1 RQ1: Performance for cost and ease of use
4.2 RQ2: Similar behavior
4.3 RQ3: The effects of different claims
4.4 RQ4: Various LLMS effects
4.5 RQ5: Root causes and types of errors from non -applicable mutations
5 discussion
5.1 Allergy to the chosen experience settings
5.2 Archeology
5.3 Frontal threats
6 conclusion and references
4.4 RQ4: Various LLMS effects
To answer this RQ, add two additional LLMS, GPT-4 and StarChat16B, and compare their results with two virtual LLMS, GPT-3.5 and Code Lama-13B. The right half of table 7 shows results compared to models using the virtual claim. We note that the closed llms source generally outperforms others on most of the standards. GPT-3.5 excels in the number of mutations, the generation cost per mutations of 1K, and the average generation time, which is ideal for generating many mutations quickly. GPT-4 leads to all metaphysical measures, behavior similarities, which indicates its effectiveness in the tasks related to the country, although its improvement on GPT-3.5 in the scales of behavior is trivial. Between two open source Llms, although there are more parameters, Codelalama-13B is superior to all standards. This indicates that the quality of the model data and training data greatly affect the performance, exceeding the number of parameters.
4.5 RQ5: Root causes and types of errors from non -applicable mutations
The non -applicable mutations require a group assembly, which leads to lost mathematical resources. As mentioned in section 4.1, LLMS generates a large number of non -applicable mutations. This RQ analyzes the types of errors and potential radical causes of these non -applicable mutations. After preparing the previous steps, we first tried 384 non-trawan mutations from GPT-3.5 outputs, ensuring that the level of confidence is 95 % and the margin of error is 5 %. From the manual analysis of these non -applicable mutations, we have identified 9 distinct error types, as shown in Table 8.
It appears as a table 8, the most common type of error, the use of unknown roads, represents 27.34 % of total errors [30]. The structural destruction of the state is the second most common mistake, representing 22.92 %, indicating that making sure that the generated codes are correct in a bee in a beehpha still represents a challenge on the current LLMS. This result indicates that there is still an important field for improvement in the current LLMS.
To analyze the types of software instructions that cause non-applicable mutations, we have examined code sites for all non-executable mutations created by GPT-3.5, Codellama, LEAM, and 𝜇Bert in section 4.1, as shown in Figure 3. For all methods, and the software instructions sites of the way that Methodinform enjoys and Mrederrender. In particular, there are more than 30 % of non -applicable mutations that occur on the site with Methodinvocation, and 20 % occur on the site with members. This is likely to be the result of the inherent complexity of these processes, which often includes multiple dependencies and references. If any method or organ required is missing or incorrectly defined, this may easily lead to non -applicable mutations. The errors shed light on the need to generate a better mutation on the context, ensuring that the methods of method and members references are in line with the intended program structure. In addition, we are examining the deletion mutations that the translator rejected and finds that for GPT-3.5, Codellama, Leam, 𝜇bert, and major, these mutations represent 7.1 %, 0.2 %, 45.3 %, 0.14 %, and 14.4 % of all its non-assembled mutations, respectively. Thus for LLMS, deletion is not the main reason for not canceling.
This paper