Price Prediction

GPT evaluation and open source forms on code mutation tasks

News FetcherJune 3, 2025

0 0 3 minutes read

GPT evaluation and open source forms on code mutation tasks

Authors:

(1) Bo Wang, Beijing Jiaotong University, Beijing, China ([email protected]);

(2) Mingda Chen, University of Beijing Jiaotong, Beijing, China ([email protected]);

(3) YouFang Lin, Beijing Jiaotong University, Beijing, China ([email protected]);

(4) Mike Papadakis, University of Luxembourg, Luxembourg ([email protected]);

(5) Jie M. Zhang, King’s College London, London, UK ([email protected]).

Links table

Abstract and 1 introduction

2 background and relevant work

3 study design

3.1 Overview of questions and research

3.2 Data sets

3.3 generation of a mutation via llms

3.4 Evaluation measures

3.5 Experience settings

4 evaluation results

4.1 RQ1: Performance for cost and ease of use

4.2 RQ2: Similar behavior

4.3 RQ3: The effects of different claims

4.4 RQ4: Various LLMS effects

4.5 RQ5: Root causes and types of errors from non -applicable mutations

5 discussion

5.1 Allergy to the chosen experience settings

5.2 Archeology

5.3 Frontal threats

6 conclusion and references

4.4 RQ4: Various LLMS effects

To answer this RQ, add two additional LLMS, GPT-4 and StarChat16B, and compare their results with two virtual LLMS, GPT-3.5 and Code Lama-13B. The right half of table 7 shows results compared to models using the virtual claim. We note that the closed llms source generally outperforms others on most of the standards. GPT-3.5 excels in the number of mutations, the generation cost per mutations of 1K, and the average generation time, which is ideal for generating many mutations quickly. GPT-4 leads to all metaphysical measures, behavior similarities, which indicates its effectiveness in the tasks related to the country, although its improvement on GPT-3.5 in the scales of behavior is trivial. Between two open source Llms, although there are more parameters, Codelalama-13B is superior to all standards. This indicates that the quality of the model data and training data greatly affect the performance, exceeding the number of parameters.

4.5 RQ5: Root causes and types of errors from non -applicable mutations

The non -applicable mutations require a group assembly, which leads to lost mathematical resources. As mentioned in section 4.1, LLMS generates a large number of non -applicable mutations. This RQ analyzes the types of errors and potential radical causes of these non -applicable mutations. After preparing the previous steps, we first tried 384 non-trawan mutations from GPT-3.5 outputs, ensuring that the level of confidence is 95 % and the margin of error is 5 %. From the manual analysis of these non -applicable mutations, we have identified 9 distinct error types, as shown in Table 8.

It appears as a table 8, the most common type of error, the use of unknown roads, represents 27.34 % of total errors [30]. The structural destruction of the state is the second most common mistake, representing 22.92 %, indicating that making sure that the generated codes are correct in a bee in a beehpha still represents a challenge on the current LLMS. This result indicates that there is still an important field for improvement in the current LLMS.

Table 7: The results of the comparison between different claims and llms

Table 8: Types of errors from non -applicable mutations

Figure 3: Distribution of AST knots from the icon of origin to non -applicable mutations

To analyze the types of software instructions that cause non-applicable mutations, we have examined code sites for all non-executable mutations created by GPT-3.5, Codellama, LEAM, and 𝜇Bert in section 4.1, as shown in Figure 3. For all methods, and the software instructions sites of the way that Methodinform enjoys and Mrederrender. In particular, there are more than 30 % of non -applicable mutations that occur on the site with Methodinvocation, and 20 % occur on the site with members. This is likely to be the result of the inherent complexity of these processes, which often includes multiple dependencies and references. If any method or organ required is missing or incorrectly defined, this may easily lead to non -applicable mutations. The errors shed light on the need to generate a better mutation on the context, ensuring that the methods of method and members references are in line with the intended program structure. In addition, we are examining the deletion mutations that the translator rejected and finds that for GPT-3.5, Codellama, Leam, 𝜇bert, and major, these mutations represent 7.1 %, 0.2 %, 45.3 %, 0.14 %, and 14.4 % of all its non-assembled mutations, respectively. Thus for LLMS, deletion is not the main reason for not canceling.

Table 9: Performing different context lengths

Table 10: Similar to the performance of different context lengths

This paper Available on Arxiv Under CC with a license of 4.0 bonds (internationally 4.0 support).

News FetcherJune 3, 2025

0 0 3 minutes read

Links table

4.4 RQ4: Various LLMS effects

4.5 RQ5: Root causes and types of errors from non -applicable mutations

News Fetcher

Related Articles

Cryptocurrency prices on January 21: BTC resilience above $101,000, and Trump is the biggest loser

Solana adds $ 4 billion a day with the progress of the first United States

Zelensky says that the Trump administration’s request worth $ 500 billion is outside the table where progress has been made with the United States in the rare metal deal.

Crypto Crafts: This is the reason why Bitcoin and Altcoins decrease today

Leave a Reply Cancel reply