Comparison of costs, ease of use and results diversity of mutation test techniques
Authors:
(1) Bo Wang, Beijing Jiaotong University, Beijing, China ([email protected]);
(2) Mingda Chen, University of Beijing Jiaotong, Beijing, China ([email protected]);
(3) YouFang Lin, Beijing Jiaotong University, Beijing, China ([email protected]);
(4) Mike Papadakis, University of Luxembourg, Luxembourg ([email protected]);
(5) Jie M. Zhang, King’s College London, London, UK ([email protected]).
Links table
Abstract and 1 introduction
2 background and relevant work
3 study design
3.1 Overview of questions and research
3.2 Data sets
3.3 generation of a mutation via llms
3.4 Evaluation measures
3.5 Experience settings
4 evaluation results
4.1 RQ1: Performance for cost and ease of use
4.2 RQ2: Similar behavior
4.3 RQ3: The effects of different claims
4.4 RQ4: Various LLMS effects
4.5 RQ5: Root causes and types of errors from non -applicable mutations
5 discussion
5.1 Allergy to the chosen experience settings
5.2 Archeology
5.3 Frontal threats
6 conclusion and references
4 evaluation results
We conducted our experimental analysis on two cloud serums, each of which was equipped with 2 Nvidia GeForce RTX 3090 TI, 168 GB memory, and 64 CPU (R) Platinum 8350c.
4.1 RQ1: Performance for cost and ease of use
4.1.1 cost. For each technology, we record the generation period and the cost of using API or GPU servers that appear, which appear as Table 4 cost lines. In addition, we also report the number of mutation and mutation degree, although they do not fully reflect the total quality of the boom group.
From the table, we can find that GPT-3.5 and Codellama-30Binstruct are born 351,557 and 303,166 booms[1]With a mutation score of 0.731 and 0.734, respectively. Traditional styles hole and main leadership is the lowest phase of all methods.
In terms of cost, GPT-3.5, Codellama-13B 1.79S and 9.06s cost mutations, respectively. While Leam, 𝜇Bert, Pit, Cost Cost 3.06s, 2.23s, 0.017s and 0.083s, respectively, indicate that previous methods are faster than LLM styles in the generation of mutations. Regarding the economic cost, we only register the cost of API use for GPT-3.5 and the cost of the cloud server to run Codellama and Leam, because the remaining approach barely needs GPU resources. The result indicates that the rental of cloud servers to operate open source models costs more than API for GPT3.5.
4.1.2 The ability to use the mutation. The ability to use mutations (i.e., whether the mutations can be used to calculate the degree of mutation) LLMS capabilities properly understand the task and generate assembly, unparalleled, and non -equivalent mutations. The results are shown as the rows of use in Table 4. Because of the potential threat to data leakage, we represent 4J and Condefections data separately. Note that the hole mutation in the form of Java bytecode and cannot be converted to the source level, so we skip the hole when responding to the RQ.
Assembly rate: We note that GPT-3.5 achieves a 61.7 % assembly rate for 4J defects and 79.6 % for Condefects. While Codellama-13B achieves a higher assembly rate of 75.8 % for 4J defects and 73.3 % for parking. The main bases approach achieves the highest assembly rate (i.e. 98.3 % on 4J and 92.8 % defects on the communication). Note that although Major works under simple sentence construction rules, he can still create non -applicable mutations. For example, Java’s software analyst rejects structures such as (True)
Useful mutation rate: We also note that LLMS generates a large part of repetitive and useless mutations. For example, for 4j defects, GPT-3.5 generates 11.6 % useful mutations, which represents 18.8 % of all assembly mutations (i.e. 11.6 %/61.7 %). In contrast, Leam, 𝜇bert and rarely are rarely generated by duplicate mutations, most likely to these methods governed by strict rules rules that limit the production of excessive mutations. For example, the dedication is applied repeatedly operating a different mutation that does not produce any excessive mutations.
Fire mutation rate: Changing the rewards of rewards change the source code in a bee without affecting the functions of the program, and thus affects the accuracy of the mutation degree accounts. Due to the non -collapsed nature of the identification of equivalent mutations, it is not possible to design an algorithm that it can completely discover. Therefore, we try a sub -set of mutations for each technique and manually evaluate if it is equivalent. To keep the level of confidence 95 % and the error margin of 5 %, we took samples from a fixed number of mutations from the output of each approach, to calculate the equivalent mutation rate. After the current studies protocol [26, 31]Two of the authors, each of them more than five years of the Java programming experience, first received training on the equivalent mutant, and then described the mutations independently. Cohen Cohen’s final laboratory degree [75] More than 90 %, indicating a high level of agreement. Table 5 shows the number of mutations from which samples were taken and the number of specific reward mutations.
Note again that Major performs the best in generating non -equivalent mutations. In particular, as shown in the lower row of the user-use section in Table 4, for the GPT-3.5, 2.3 % of the 4J defects are equivalent, while the rate is 2.1 % on the wins. Codellama-13B shows lower rates, with 1.3 % and 0.7 % on 4J and Condefects, respectively. LEAM generates 1.6 % equivalent mutations on 4J and 1.0 % defects on the dead. 𝜇Bert produces 2.1 % on 4J and 3.1 % defects on the intrusive. Major achieves the lowest rates, with only 0.5 % on 4J and 0.8 % defects on the conscience.
When comparing the two data set of disadvantages and recommending to the three aspects of use above, we note that the scales differ between 4j and Condefections permit. This difference may be attributed to the differences in the complexity and structure of errors in each data group. The mutations in defects may include 4J applications of applications and global variables that have not been explicitly mentioned in the claims, which makes
The context is more complex. This complexity is the result that each of the scales is lower than 4J defects compared to equal materials.
Figure 2 shows the results, which reveal that the LLM -based mutation methods offer more types of the new AST nodes than other methods. GPT-3.5 displays the largest diversity, introduces 45 new species of AST node, and requires all other methods. Traditional approach offers only two new types of AST
The contract, which is not surprising because the boom operators are a fixed set of simple operations.
We also verify the diversity of the AST node for the examples that we have made in LLM claims (see Table 3), as shown in “Little Style Shield” in Figure 2. It is interesting, that these examples offer only a new type of AST knot. This indicates that LLMS is creative in generating mutations, resulting in much more diverse mutations than the examples stipulated.
We analyzed the deletion distribution and the types of AST knots through different methods, as shown in Table 6, which highlights the best most common AST node. For deletion, GPT-3.5 and Codellama-13B are born 5.1 % and 0.3 % deletion, respectively. Nearly half of the LeAM mutations are deletions, which may be the cause of generating many unprofitable mutations. Major is born deleted on restricted sites and deletes 17.6 %. 𝜇Bert replaces the masked symbol elements with bert, and therefore rarely delete.
Types of Garding AST, the most common AST knot for all literal learning methods (for example, 𝑎> 𝑏 ↦ → right). However, for the rules -based approach, the most common knot type is a phrase (for example, replacing an empty statement). Note that although these mutations are not with the deletion operator, there is a large part of the production of blank data, which leads to a large percentage.
This paper
[1] Although GPT-3.5 and Codellama-13B are directed to generate the same number of mutations, they do not constantly follow the instructions, leading to different numbers of mutations.