Niklas RisseSoftware Security Group
MPI-SP
Bochum, Germany Marcel BöhmeSoftware Security Group
MPI-SP
Bochum, Germany
Abstract
According to our survey of the machine learning for vulnerability detection (ML4VD) literature published in the top Software Engineering conferences, every paper in the past 5 years defines ML4VD as a binary classification problem:
Given a function, does it contain a security flaw?
In this paper, we ask whether this decision can really be made without further context and study both vulnerable and non-vulnerable functions in the most popular ML4VD datasets. A function is vulnerable if it was involved in a patch of an actual security flaw and confirmed to cause the vulnerability. It is non-vulnerable otherwise. We find that in almost all cases this decision cannot be made without further context. Vulnerable functions are often vulnerable only because a corresponding vulnerability-inducing calling context exists while non-vulnerable functions would often be vulnerable if a corresponding context existed.
But why do ML4VD techniques perform so well even though there is demonstrably not enough information in these samples? Spurious correlations: We find that high accuracy can be achieved even when only word counts are available. This shows that these datasets can be exploited to achieve high accuracy without actually detecting any security vulnerabilities.
We conclude that the current problem statement of ML4VD is ill-defined and call into question the internal validity of this growing body of work. Constructively, we call for more effective benchmarking methodologies to evaluate the true capabilities of ML4VD, propose alternative problem statements, and examine broader implications for the evaluation of machine learning and programming analysis research.
I Introduction
In recent years, the number of papers published on the topic of machine learning for vulnerability detection (ML4VD) has dramatically increased. Because of this rise in popularity, the validity and soundness of the underlying methodologies and datasets becomes increasingly important.So then, how exactly is the problem of ML4VD defined and thus evaluated?
In our survey of all 22 ML4VD papers published at the Top-4 Software Engineering conferences over the last five years [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22], we find that state-of-the-art ML4VD techniques exclusively define ML4VD as a binary classification problem: Given an isolated function ,decide whether contains a security vulnerability. The technique with the lowest classification error on the evaluation datasetis considered the best at detecting security vulnerabilities.
However, based on our experience, we hypothesized that it might not always be possible to determine whether a function is vulnerable or not without additional context. We call these vulnerabilities context-dependent. Consider the example in Figure1. If this function from the DiverseVul benchmark dataset [23] is called with num_splits set to zero, it will crash with a division-by-zero in Line7.However, without knowing whether this function can ever be called with num_splits set to zero, we cannot reliably decide if the division-by-zero could actually be observed. This function parameter might as well be properly validated by every caller function. This context-dependency problem is well-known in static analysis [24, 25], and software testing [26].
In this paper, we set out to quantify the prevalence of context-dependent vulnerabilities in the most popular datasets and end up revealing a fundamental flaw in the most-widely used evaluation methodology that is underpinning the progress of the nascent research area of ML4VD. We find that the vulnerability of a function cannot be decided without further context for more than 90% of functions. This includes functions with both types of labels, vulnerable or secure: If the right context existed, the function would be considered vulnerable. Respectively, only because the right context exists, the function is considered vulnerable.
Given our findings, we conclude that the current problem statement of ML4VD as a function-level classification problem is inadequate. The reported results in the literature, which are based on this problem statement, may not accurately reflect the true capabilities of the evaluated techniques at the task of vulnerability detection. In other words, there is currently no evidence that ML4VD techniques are actually capable of identifying security vulnerabilities in functions.
But why do ML4VD techniques perform well at this binary classification task when there is demonstrably not enough information in over 90% of samples (even after addressing label inaccuracies)? We identify spuriously correlated features as a potential reason. Training simple models, like a gradient boosting classifier using only word counts and disregarding code structure, we achieved results comparable to those of state-of-the-art ML4VD models. This suggests that ML4VD techniques only appear to perform well due to the chosen evaluation methodology. During classification,ML4VD techniques rely on spuriously correlated features to achieve high scores and do not genuinely detect vulnerabilities.
To shift the field towards more context-aware evaluation of vulnerability detection methods, we discuss potential alternative problem statements and suggest ideas for future work. We also examine the broader implications for ML4VD and other fields. By discussing these aspects we aim to foster more valid and reliable research in the area of machine learning for vulnerability detection.
In summary, this paper makes the following contributions:
We analyze all papers published at the Top-4 Software Engineering conferences over the last five years and find that state-of-the-art ML4VD techniques exclusively define ML4VD as a function-level classification problem.
We reveal a fundamental flaw of the function-level classification problem: The vulnerability of a function cannot be decided without further context for more than 90% of functions in the top-most widely-used datasets.
Why do ML4VD techniques still perform well at the function-level classification problem? We demonstrate that they may rely on spurious features to achieve high scores without genuinely detecting vulnerabilities.
We publish all of our code and results for reproducibility. They are available at https://github.com/niklasrisse/TopScoreWrongExam.
II Background
The problem that the vulnerability of a function may depend on context that is external to the given function (i.e., the context-dependency problem) is well known in static analysis and software testing.
Static Analysis. We distinguish between inter-procedural and intra-procedural analysis, where the former is concerned with the analysis of the entire system and the latter with the analysis of individual functions.Intra-procedural static analysis tools often struggle with false positives due to context-dependency. For example, a static analyzer might flag potential bugs based on the analysis of code patterns but cannot always discern the specific conditions under which a bug manifests [27, 28]. Hence, Le et al. [24] propose to differentiate between manifest bugs, which are context-independent, and latent bugs, which depend on preconditions in the calling context. Manifest bugs can be reported without making strong assumptions about the calling context, whereas latent bugs cannot. To address these challenges, tools like PhASAR [25] leverage an inter-procedural analysis to detect security vulnerabilities that depend on external context.
Software Testing. We distinguish between system-level testing and unit-level testing, where the former is concerned with test cases for the entire system and the latter with test cases for individual system units, like functions.Automated unit test generation often falls short of identifying issues that only occur under specific conditions or in particular environments. As noted by Harrold and Orso [29], unit tests can produce false positives when they are not adequately designed to account for the broader context in which a function operates. This can lead to an inflated number of reported issues that are not actual bugs, thereby complicating the debugging process. Hence, property-based testing [30, 31, 32] requires users to define function preconditions in addition to assertions to ensure that the assertions fire only under valid function parameters (i.e., under valid context). To summarize, both the static analysis and software testing research communities acknowledge the problem of context-dependency and address it through various techniques aimed at reducing the number of false positives.
Machine Learning (ML4VD).Sejfia et al. [4] realize that a focus on individual base units (e.g., functions) prevents us from learning to find vulnerabilities that emerge from the interaction of multiple base units (MBU). Given only vulnerabilities that required multiple functions to be fixed (called MBU vulnerabilities), they observe a drastic drop in detection accuracy of the evaluated ML4VD models.Croft et al. [33] study the quality of benchmark datasets in the ML4VD literature and, as part of their manual re-labeling policy, decide that a function is conservatively also considered as vulnerable if it invokes a function that is known to be vulnerable–which raises an interesting context-dependency question for us. They find that a large percentage of functions in the most widely used benchmarks are incorrectly labeled as vulnerable. In this paper, we reproduce their experiment to identify and study actually vulnerable functions in terms of their context-dependence.We study the prevalence of ML4VD as a binary classification problem separately in the next section.
However, to the best of our knowledge, no work exists that studies whether the vulnerability of the base units in the available datasets can be decided without further context in the first place. Addressing this gap, our paper aims to uncover the shortcomings of current evaluation methodologies and emphasizes the urgent need for more context-aware approaches to accurately assess the true capabilities of ML4VD techniques in detecting security vulnerabilities.
Spurious features. Arp et al. [34] provide initial evidence that machine learning techniques can learn to predict labels correctly based on artifacts in code snippets without addressing the actual security task at hand (spurious correlation). Similarly, Risse et al. [35] demonstrate that ML4VD techniques overfit to label-unrelated features by utilizing semantic preserving transformations of code. Building on this, our work further explores the extent to which these techniques depend on spurious features, suggesting that the reliance on such features could provide an alternative explanation for the high accuracy scores reported in the literature.
III Literature Survey
Based on our prior knowledge of the literature, we hypothesized that the majority of recent studies in machine learning for vulnerability detection (ML4VD) define ML4VD as a binary classification problem; given a function, determine whether the function contains a security vulnerability. To determine the prevalence of this approach in the ML4VD literature, we conducted a literature survey. Our goal was to address the following two research questions:
- 1.
Problem Statement: What proportion of ML4VD publications defines ML4VD as deciding whether a given function contains a vulnerability?
- 2.
Datasets: Which datasets do they use to evaluate their techniques empirically?
III-A Methodology
To ensure a thorough and unbiased analysis of the recent literature on machine learning for vulnerability detection (ML4VD) in the field of software engineering, we adopted a systematic methodology based on established guidelines for conducting literature reviews in software engineering research [36, 37]. Our approach encompassed three primary phases: defining the scope, selecting relevant papers, and systematically analyzing the selected papers.
Scope. Our literature survey focuses on the past five years, targeting publications from 2020 to 2024. Given the prominence and impact of certain venues in the field of software engineering, we selected the four top-tier conferences—International Conference on Software Engineering (ICSE), Foundations of Software Engineering (FSE), International Symposium on Software Testing and Analysis (ISSTA), and Automated Software Engineering (ASE). These conferences were chosen due to their reputation for publishing high-quality and influential research.
Criteria for Paper Selection. To identify relevant papers, we established clear selection criteria. The papers must either propose a novel ML4VD technique or evaluate existing ML4VD techniques. To facilitate the identification process, we defined a set of keywords that are indicative of the research focus in this area: vulnerability, vulnerable, detect, detection, discovery, machine, learning, artificial, intelligence, AI, ML, deep, graph, neural, network, large, language, model.
Process. We conducted a systematic search of the accepted papers lists from the targeted conferences. For each conference, we utilized the Safari browser search function to scan the titles of all accepted papers for the presence of our predefined keywords. Any title containing one of the keywords was flagged for further review.
In the next step, we filtered the flagged titles to specifically identify those that propose or evaluate machine learning techniques for vulnerability detection (ML4VD). We manually checked each of these titles and, where necessary, reviewed the abstracts and full texts to assess their relevance based on our selection criteria.
To answer research questions 1) and 2), we documented how ML4VD is defined, and the datasets used for empirical evaluation within each identified ML4VD paper.
III-B Results
Our literature survey identified 22 papers that met our selection criteria and were included in our analysis [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]. Note that accepted papers for ASE 2024 have not been released at the submission time of this paper, which is why they are not included in the results.
The distribution of these papers across the years and conferences, as displayed in Figure2, indicates a clear trend: the number of papers focused on ML4VD is increasing annually. This upward trajectory underscores the growing interest and significance of this research area within the software engineering community. Additionally, we found a large variety of machine learning techniques being employed, utilizing different data representations (graph-based, token-based), model architectures (e.g., large language models, graph neural networks, convolutional neural networks), and learning algorithms (e.g., contrastive learning).
Problem Statement. To our surprise, all 22 papers define ML4VD as a binary function-level classification problem. This problem statement, which focuses on determining whether a given function contains a security vulnerability, has become the de facto standard in the field.
Datasets. Regarding datasets, the majority of the papers relied on a limited set of popular datasets for empirical evaluation. Figure3 shows all datasets that were used by the 22 papers and the number of times they were used. Specifically, BigVul [38] was used by 16 papers, Devign [39] was used by 14 papers, and ReVeal [40] by 10 papers. Notably, 21 out of the 22 papers utilized at least one of these three datasets, reflecting their dominance in the field.
IV Empirical Study Design
We study how prevalent context-dependent vulnerabilities are in the top-most widely-used benchmarks for ML4VD and evaluate the degree to which we can correctly classify even when the vulnerable code is hidden, i.e., even when only coarse feature values such as word counts are available to a simple classifier.More generally, we are interested in the threats to validity of the benchmarking methodology that is used most-widely in ML4VD research: Are empirical claims from benchmarking results about the real-world performance of ML4VD techniques actually reliable?
Specifically, we ask the following research questions:
RQ.1 Can we cast vulnerability detection as a function-level binary classification problem in ML4VD?
- (a)
Noisy Labels. What proportion of functions labeled as vulnerable actually contain security vulnerabilities? Before we can study the prevalence of context-dependency, we first address the noisy-label problem [33] and identify those functions that are actually vulnerable.
- (b)
Context-dependent Vulnerability. What proportion of vulnerable functions would not be vulnerable if the appropriate external context did not exist?Given a function that is actually vulnerable (because it is later fixed to remove a vulnerability), how often can we decide vulnerability based on the function’s code alone?
- (c)
Context-dependent Security. What proportion of non-vulnerable functions could be vulnerable if an appropriate external context would exist?Given a function that is not vulnerable within the context of this program, how often can we find a setting in which this function would be considered to contain a vulnerability?
- (a)
RQ.2 Can we achieve a high classification performance even when the root cause of the vulnerability is hidden? How can we explain the excellent performance of ML4VD on these widely-used benchmarks despite this severe flaw in the problem statement? Do the popular datasets contain properties that ML4VD techniques can exploit to achieve high accuracy without actually detecting vulnerabilities?
IV-A Methodology
Selection Criteria. In order to select datasets that represent the state-of-the-art of ML4VD benchmarks, we chose the top-most widely-used datasets from the software engineering literature (via our literature survey) for which the authors provide links to patch commits and/or CVE websites (required for manual labeling). Additionally, we reviewed unpublished literature for emerging datasets and included all publicly available datasets with more than 50 citations in 2024.
Selected datasets. Based on our selection criteria, we chose the BigVul [38], Devign [39], and DiverseVul [23] datasets. The selection of BigVul and Devign resulted from our literature survey, as these datasets were the most popular among the 22 ML4VD papers we analyzed. However, we excluded ReVeal [40] from our study since the authors did not publish patch commit IDs or CVEs, which are necessary for determining vulnerability and context dependence. Additionally, we included the DiverseVul dataset [23], a recently published dataset (2023) that has seen significant use in many yet unpublished papers based on our review of unpublished literature. Figure4 illustrates the citations of the selected datasets measured by Google Scholar on July 31, 2024. All three datasets are becoming increasingly popular, with each having more than 50 citations in 2024 alone.
Sampling. From each of the three datasets, we randomly selected 100 samples labeled as vulnerable, using a sample size inspired by related studies on data quality, such as the one by Croft et al., which used a sample size of 70 [33]. We have published the reproducible script and the sampled functions as part of our artifact, which is available at https://github.com/niklasrisse/TopScoreWrongExam. The resulting 300 functions come from patch commits published between 2010 and 2022, covering 80 unique open-source projects (BigVul: 25, Devign: 2, DiverseVul: 60). Figure6 illustrates the size of these functions across the three datasets. While we did not formally categorize the vulnerability types, we frequently observed issues such as out-of-bounds writes/reads, improper restriction of operations within memory buffer bounds, improper input validation, and use-after-free vulnerabilities.
Labeling Process. In a time-intensive process (more than 120 hours), one of the authors of our paper (a software engineering researcher) reviewed each of the 300 functions. The process involved opening the corresponding patch commit on GitHub to understand the vulnerability. If available, we also reviewed the CVE (Common Vulnerabilities and Exposures) report for additional context. For full transparency, we release all labels and explanations we generated as part of our artifact available at https://github.com/niklasrisse/TopScoreWrongExam.
Figure5 visualizes the complete process that we employed for each individual function. In all cases where the patch commit was available, and a decision could be made within 15 minutes, we assigned one of the following labels:
Secure (0): The function was not the source of the security vulnerability, or there was no vulnerability addressed by the patch commit.
Vulnerable (1): The function was the source of the security vulnerability.
Additionally, we added a short explanation in natural language to justify the assigned label, which we used to analyze the results more in-depth. To ensure the accuracy of this first labeling step, 200 out of the 300 functions were independently cross-labeled by another software engineering researcher. The two raters achieved a Cohen Kappa value of 0.7 [41] (85% of agreement), which implies substantial agreement according to the guidelines provided by Landis and Koch [42]. All cases of disagreement were resolved through discussion and resulted in a single label/explanation for each function.
For functions labeled as vulnerable (indicating they actually were the source of the vulnerability), we conducted a second round of labeling to determine context dependency:
Context-independent (0): The vulnerability could be detected without considering any additional context beyond the function itself. These vulnerabilities are self-contained within the function’s code.
Context-dependent (1): The security vulnerability cannot be accurately identified without considering additional context beyond the function. This category includes vulnerabilities that rely on external functions, global variables, or interactions with other parts of the codebase.
Again, we provided a short explanation in natural language to justify the context label.
Reproducibility. To ensure the reproducibility of our empirical study and to provide transparency in our research, we have made all related scripts and data publicly available. This includes all of the intermediate and final results of our empirical study. All resources can be accessed at https://github.com/niklasrisse/TopScoreWrongExam.
V Results
The goal of our empirical study was to investigate whether vulnerability detection as a function-level binary classification problem is an adequate problem statement for ML4VD.
RQ.1-a Noisy Labels
The first decision for each function included in our empirical study was to determine whether the function, in fact, contains a security vulnerability. This is necessary since one of the main goals of our study is to determine the proportion of actually vulnerable functions that are dependent on additional context. From Croft et al. [33], we know that for popular datasets only a subset of functions labeled as vulnerable actually contains security vulnerabilities. Specifically, they found at least 20% of labels for the Devign dataset and 45.7% of labels for the BigVul dataset to be inaccurate.
Patch Commit | Structural | Unrelated | |
---|---|---|---|
Dataset | Identification | Changes | Changes |
BigVul | 9 (16%) | 29 (51%) | 19 (33%) |
Devign | 31 (68%) | 14 (30%) | 1 (2%) |
DiverseVul | 1 (3%) | 20 (67%) | 9 (30%) |
Results. Figure7 shows the results of our empirical study. For RQ.1-a), we are interested in the proportion of functions labeled as vulnerable that actually contain security vulnerabilities, which are displayed in the left part of each subfigure (vulnerable vs. secure). Out of the 100 functions per dataset that were originally labeled as vulnerable, we found only 38%-64% to actually contain security vulnerabilities. 30%-57% do not contain security vulnerabilities and were therefore labeled to be secure. For 2%-4%, we were not able to make a decision after 15 minutes, and for 1%-4%, we could find neither the patch commit nor the CVE report (not available).
Based on our evidence, we can confirm the findings of Croft et al. [33]. The differences in label accuracy, especially for the Devign dataset (80% accurate labels found by them vs. 50% found by us), might be explained by two reasons:First, Croft et al. considered label accuracy for all functions, and secure functions are more likely correctly labeled. In RQ.1-a, we only establish label accuracy for functions labeled as vulnerable.Second, Croft et al. establish label accuracy more conservative than us, where a function can be considered correctly labeled as vulnerable if it calls a function known to be vulnerable. In this case, we decided that only the actual vulnerable function should be labeled as vulnerable because the calling function would not be vulnerable if the actual vulnerable function would be fixed.
Function | External | Type | Execution | ||
---|---|---|---|---|---|
Dataset | Argument | Function | Declaration | Globals | Environment |
BigVul | 16 (42%) | 19 (50%) | 1 (3%) | 2 (5%) | 0 (0%) |
Devign | 23 (47%) | 21 (43%) | 0 (0%) | 5 (10%) | 0 (0%) |
DiverseVul | 25 (39%) | 34 (53%) | 2 (3%) | 1 (2%) | 2 (3%) |
Based on our quantitative results and the explanations we generated for each of the functions, we performed an in-depth analysis to determine potential reasons for the label inaccuracy we observed. The results are displayed in TableI.
Patch Commit Identification. The first reason for label inaccuracy are errors during the process of identifying patch commits. From the original papers [39, 23, 38], we know that all three datasets start their data collection process by identifying vulnerability-patching commits in popular open-source software repositories. However, the Devign dataset identifies these commits only by filtering the commit messages for security-related keywords. We observe that 68% of falsely labeled functions in our sample of the Devign dataset originate from this automatic identification process. The triangulation via other data sources seems to address this source of inaccuracy for the other datasets.
⬇
1static int mwifiex_update_vs_ie(const u8 *ies, int ies_len,
2 struct mwifiex_ie **ie_ptr, u16 mask,
3 unsigned int oui, u8 oui_type)
4{
5 struct ieee_types_header *vs_ie;
6 struct mwifiex_ie *ie = *ie_ptr;
7 const u8 *vendor_ie;
8
9 vendor_ie = cfg80211_find_vendor_ie(oui, oui_type, ies, ies_len);
10 if (vendor_ie) {
11 if (!*ie_ptr) {
12 *ie_ptr = kzalloc(sizeof(struct mwifiex_ie),
13 GFP_KERNEL);
14 if (!*ie_ptr)
15 return -ENOMEM;
16 ie = *ie_ptr;
17 }
18
19 vs_ie = (struct ieee_types_header *)vendor_ie;
⬇
20static int mwifiex_update_vs_ie(const u8 *ies, int ies_len,
21 struct mwifiex_ie **ie_ptr, u16 mask,
⬇
22static int mwifiex_update_vs_ie(const u8 *ies, int ies_len,
23 struct mwifiex_ie **ie_ptr, u16 mask,
24 unsigned int oui, u8 oui_type)
25{
26 struct ieee_types_header *vs_ie;
27 struct mwifiex_ie *ie = *ie_ptr;
28 const u8 *vendor_ie;
29
30 vendor_ie = cfg80211_find_vendor_ie(oui, oui_type, ies, ies_len);
31 if (vendor_ie) {
32 if (!*ie_ptr) {
33 *ie_ptr = kzalloc(sizeof(struct mwifiex_ie),
34 GFP_KERNEL);
35 if (!*ie_ptr)
36 return -ENOMEM;
37 ie = *ie_ptr;
38 }
39
40 vs_ie = (struct ieee_types_header *)vendor_ie;
41 memcpy(ie->ie_buffer + le16_to_cpu(ie->ie_length),
42 vs_ie, vs_ie->len + 2);
43 le16_unaligned_add_cpu(&ie->ie_length, vs_ie->len + 2);
44 ie->mgmt_subtype_mask = cpu_to_le16(mask);
45 ie->ie_index = cpu_to_le16(MWIFIEX_AUTO_IDX_MASK);
46 }
47
48 *ie_ptr = ie;
49 return 0;
50}
⬇
1PP_Flash_Menu* ReadMenu(int depth,
2 const IPC::Message* m,
3 PickleIterator* iter) {
4 if (depth > kMaxMenuDepth)
5 return NULL;
6 ++depth;
7
8 PP_Flash_Menu* menu = new PP_Flash_Menu;
9 menu->items = NULL;
10
11 if (!m->ReadUInt32(iter, &menu->count)) {
12 FreeMenu(menu);
13 return NULL;
14 }
15
16 if (menu->count == 0)
17 return menu;
18
19 menu->items = new PP_Flash_MenuItem[menu->count];
⬇
20PP_Flash_Menu* ReadMenu(int depth,
⬇
21PP_Flash_Menu* ReadMenu(int depth,
22 const IPC::Message* m,
23 PickleIterator* iter) {
24 if (depth > kMaxMenuDepth)
25 return NULL;
26 ++depth;
27
28 PP_Flash_Menu* menu = new PP_Flash_Menu;
29 menu->items = NULL;
30
31 if (!m->ReadUInt32(iter, &menu->count)) {
32 FreeMenu(menu);
33 return NULL;
34 }
35
36 if (menu->count == 0)
37 return menu;
38
39 menu->items = new PP_Flash_MenuItem[menu->count];
40 memset(menu->items, 0, sizeof(PP_Flash_MenuItem) * menu->count);
41 for (uint32_t i = 0; i < menu->count; ++i) {
42 if (!ReadMenuItem(depth, m, iter, menu->items + i)) {
43 FreeMenu(menu);
44 return NULL;
45 }
46 }
47 return menu;
48}
⬇
1setup_server_realm(krb5_principal sprinc)
2{
3 krb5_error_code kret;
4 kdc_realm_t *newrealm;
5
6 kret = 0;
7 if (kdc_numrealms > 1) {
⬇
8setup_server_realm(krb5_principal sprinc)
⬇
9setup_server_realm(krb5_principal sprinc)
10{
11 krb5_error_code kret;
12 kdc_realm_t *newrealm;
13
14 kret = 0;
15 if (kdc_numrealms > 1) {
16 if (!(newrealm = find_realm_data(sprinc->realm.data,
17 (krb5_ui_4) sprinc->realm.length)))
18 kret = ENOENT;
19 else
20 kdc_active_realm = newrealm;
21 }
22 else
23 kdc_active_realm = kdc_realmlist[0];
24 return(kret);
25}
⬇
1void jpc_qmfb_join_col(jpc_fix_t *a, int numrows, int stride,
2 int parity)
3{
4
5 int bufsize = JPC_CEILDIVPOW2(numrows, 1);
6#if !defined(HAVE_VLA)
⬇
7void jpc_qmfb_join_col(jpc_fix_t *a, int numrows, int stride,
⬇
8void jpc_qmfb_join_col(jpc_fix_t *a, int numrows, int stride,
9 int parity)
10{
11
12 int bufsize = JPC_CEILDIVPOW2(numrows, 1);
13#if !defined(HAVE_VLA)
14 jpc_fix_t joinbuf[QMFB_JOINBUFSIZE];
15#else
16 jpc_fix_t joinbuf[bufsize];
17#endif
18 jpc_fix_t *buf = joinbuf;
19
20 // [...]
21
22}
Structural Changes. The second reason for label inaccuracy is structural changes. All three datasets included in our study assume that all functions changed by a vulnerability-patching commit were vulnerable before the patch was applied. However, according to our results, only a subset of the functions changed by a patch commit is actually vulnerable, while other functions could be changed to address structural changes that are a consequence of fixing the actual vulnerable function. For instance, fixing a buffer overflow may require adding a buffer size parameter to the function call wherever the function is called.This can lead to false labels if all functions changed by the patch commit are considered to be vulnerable before the patch was applied. In fact, 30%-67% of falsely labeled functions can be attributed to structural changes.
Unrelated Changes. The third reason for label inaccuracy are other unrelated changes to functions in vulnerability patch commits. These include stylistic changes, e.g. removing whitespace or adding comments. According to our study, 2%-33% of falsely labeled functions can be attributed to such unrelated changes.
Croft et al. [33] also investigate reasons for label inaccuracy and list irrelevant code changes (our structural changes), inaccurate fix identification (our patch commit identification), and clean-up changes (our unrelated changes). Based on our evidence, we can confirm these findings.
RQ.1-b Context-Dependent Vulnerability
The main goal of our empirical study was to find out what proportion of the functions labeled as vulnerable in the top-most widely-used datasets actually can be detected without considering additional context. In other words, what proportion of vulnerable functions would not be vulnerable if the appropriate external context did not exist?
Results. Each of the actual vulnerable functions that resulted from the first step of our empirical study (RQ.1 (a)) was assigned one of two labels: Context-independent or context-dependent. Figure7 show the results of this second labeling round. To our surprise, all 152 vulnerable functions in our study required additional context to be accurately identified (context-dependent). Not a single function could be detected without considering any additional context beyond the function itself (context-independent)
Based on the explanations we generated for each of the functions, we performed an in-depth analysis and identified the most prevalent types of context dependence in our sample. The results are shown in TableII.
⬇
1#include <stdio.h>
2#include <stdlib.h>
3
4struct http {
5 int vsl;
6 int ws;
7};
8
9void HTTP_Dup(struct http *to, const struct http *fm) {
10 return;
11}
⬇
12#include <stdio.h>
13#include <stdlib.h>
14
15struct http {
16 int vsl;
17 int ws;
⬇
18#include <stdio.h>
19#include <stdlib.h>
20
21struct http {
22 int vsl;
23 int ws;
24};
25
26void HTTP_Dup(struct http *to, const struct http *fm) {
27 return;
28}
29void HTTP_Clone(struct http *to, const struct http *fm) {
30 HTTP_Dup(to, fm);
31 to->vsl = fm->vsl;
32 to->ws = fm->ws;
33}
34
35int main() {
36 struct http *source = NULL;
37 struct http destination;
38
39 HTTP_Clone(&destination, source);
40
41 return 0;
42}
Dependence on External Functions. The first and most prevalent type of dependence we identified is dependence on external functions. Consider the example in Figure Figure8a. The heap-based buffer overflow in lines 20-21 of this function from the BigVul dataset depends on the external function cfg80211_find_vendor_ie. Without knowing this external function, we do not know what values vs_ie can have, and consequently, we do not know whether the buffer overflow can ever be triggered. In our empirical study, 43%-53% of context-dependent vulnerabilities can be attributed to dependence on external functions.
Dependence on Function Arguments. The second type of dependence we identified is dependence on function arguments. Consider the example in Figure Figure8c. The null-pointer-dereference vulnerability in line 8 of this function from the BigVul dataset depends on the function argument sprinc. However, without knowing the context in which this function can be called, we do not know whether sprinc can ever be NULL. For example, it could be properly validated before passing it to the function in all cases where this function is actually called. In that case, there would be no vulnerability. In our empirical study, 39%-47% of context-dependent vulnerabilities can be attributed to dependence on function arguments.
Dependence on Type Declarations. The third type of dependence we identified is dependence on type declarations. Consider the example in Figure Figure8b. The integer overflow vulnerability in line 20 depends on the type declaration of PP_Flash_MenuItem. Only if sizeof(PP_Flash_MenuItem) * menu->count exceeds therange of menu->items, the operation overflows. In our empirical study, 0-3% of context-dependent vulnerabilities can be attributed to dependence on type declarations.
Dependence on Globals. The fourth type of dependence we identified is dependence on globals, such as macros and global variables. Consider the example in Figure Figure8d. Line 7 of this function from the Devign dataset is the cause of multiple buffer overflows in the JasPer repository (CVE-2014-8158). If HAVE_VLA is not defined and QMFB_JOINBUFSIZE is smaller than bufsize but still used without successful dynamic allocation, any attempt to use joinbuf for storing data will result in writing beyond its allocated size (QMFB_JOINBUFSIZE). In our empirical study, 2-10% of context-dependent vulnerabilities can be attributed to dependence on globals.
Dependence on the Execution Environment. The fifth type of dependence we identified is dependence on the execution environment. For example, a vulnerability may depend on the fulfillment of specific conditions in the system of the user (e.g., the presence of a file in a directory) outside of the code. In our empirical study, 0-3% of context-dependent vulnerabilities can be attributed to dependence on the execution environment.
RQ.1-c Context-dependent Security
Inspired by our finding that the vulnerability of all 152 vulnerable functions in our empirical study could not be detected without considering additional context, we were also interested whether the same is true for secure functions. In other words, what proportion of non-vulnerable functions could be vulnerable if an appropriate external context existed?
Methodology.From each of the three datasets (BigVul, Devign, and DiverseVul), we randomly selected 30 samples that were labeled as secure. We publish the reproducible script and the sampled functions as part of our artifact. To find out what proportion of these functions could be vulnerable if an appropriate external context would exist, we tried to construct an artificial vulnerable context for each of them. Consider the example in Figure9. In this artificially crafted context, the function HTTP_Clone (marked in green) contains a null pointer dereference vulnerability. However, it is labeled as ’secure’ in the DiverseVul dataset because in its original context111https://github.com/varnishcache/varnish-cache/commit/c5fd097e it does not contain a security vulnerability. Without knowing the context, we can not decide whether this function is secure. Similar to this example, we manually tried to construct a vulnerable context for all 90 functions in our sample. For full transparency, we include the generated vulnerable settings (as explanations in natural language) in our artifact.
Results. For 82 out of the 90 functions that were labeled as secure in the original datasets, we were able to construct a context in which they contain a security vulnerability.
RQ.1 Result Summary
Within our sample of functions, about half of those labeled as vulnerable did not actually contain security vulnerabilities. All of those that do are context-dependent vulnerabilities, i.e., these functions are vulnerable only because an appropriate context exists under which the function is vulnerable. Similarly, the majority of secure functions contain context-dependent vulnerabilities that would make that function vulnerable if an appropriate context existed.As there is evidently insufficient information in the base units—that are used for training, validation, and testing in these datasets—to decide their vulnerability without further context, we conclude that ML4VD cannot be soundly evaluated as a classic binary classification problem on the function-level.
RQ.2 Classifier Performance on Spurious Features
Since it is impossible for most functions to decide without further context whether they contain a vulnerability, there is currently no evidence that ML4VD techniques are actually capable of identifying security vulnerabilities in functions. But why do ML4VD papers still report high accuracy when evaluating their techniques using function-level datasets? What happens if we hide the code and only expose some features of the code, such as word counts?
Methodology. According to the CodeXGLUE benchmark for vulnerability detection [43], the state-of-the-art reported accuracy for the Devign dataset is 69%, achieved by the UniXcoder technique. UniXcoder [44] is a publicly available222https://github.com/microsoft/CodeBERT/tree/master/UniXcoder large language model (LLM), pre-trained on 2.3 million functions and finetuned on the training subset of the Devign dataset for the task of vulnerability detection. During our initial experiments, we were able to reproduce the 69% accuracy using the evaluation subset of the Devign dataset.In order to test whether the Devign dataset contains spuriously correlated features that can be exploited to achieve a high accuracy without actually detecting security vulnerabilities, we then trained a simple classifier (Gradient Boosting Classifier) to detect vulnerabilities in Devign based on word counts only, completely disregarding the structure of the code. After training, we evaluated the resulting model on the evaluation subset of Devign to compute accuracy when no information on code structure and/or semantics is available during training.
Results. The Gradient Boosting Classifier achieved an accuracy of 63.2% on the evaluation subset of Devign, which is only 5.8% lower than the state-of-the-art accuracy of UniXcoder (and within the Top-10 on CodeXGLUE). Surprisingly, for an effective vulnerability detection model, the whole process of training and evaluation only took 10 minutes on a MacBook Pro, which is extremely fast compared to state-of-the-art training times for LLMs. State-of-the-art LLMs require expensive Hardware (GPUs), weeks of computing time to pre-train, and at least multiple hours of computing time to finetune. Additionally, the Gradient Boosting Classifier was only trained on the training subset of Devign (21k functions) without any pre-training. For comparison, the UniXcoder technique utilized 2.3 million functions during pre-training. These results show that it is possible to achieve high accuracy on a function-level dataset while completely disregarding the structure and semantics of the code.
Our results provide an alternative explanation for the results of ML4VD techniques that were reported in the literature. While we did not prove that state-of-the-art ML4VD techniques actually achieve their high scores by relying on spuriously correlated features, we have shown that it is possible to exploit these datasets to achieve high accuracy without actually detecting security vulnerabilities.
VI Threats to the Validity
As for any empirical study, there are various threats to the validity of our results and conclusions.
VI-A Internal validity
Selection Bias. The random sampling of 100 functions from each dataset (BigVul, Devign, and DiverseVul) for our empirical study could introduce selection bias. Although random sampling aims to create a representative subset, it is possible that our sample may not fully capture the diversity and characteristics of the entire dataset. To mitigate this, we ensured that our sampling method was strictly random and publish the reproducible script and the sampled functions as part of our artifact333Artifact: https://github.com/niklasrisse/TopScoreWrongExam.
Manual Labeling Errors. Labeling functions as vulnerable or secure and distinguishing between context-dependent and context-independent vulnerabilities involves subjective judgment. To address this, we employed a cross-labeling process where 200 out of the 300 functions were independently labeled for vulnerability by a second software engineering researcher, achieving a Cohen Kappa value of 0.7, indicating substantial agreement. Discrepancies were resolved through discussion, but some human error may still be present. For full transparency, we publish all labels and explanations as part of our artifact.
VI-B External validity
Generalizability. Our study focuses on three specific datasets (BigVul, Devign, and DiverseVul), which are widely used in the ML4VD community. However, the findings may not generalize to other datasets or real-world software systems. Future studies should replicate our methodology on additional datasets and real-world codebases to validate our conclusions.
Dataset Composition. The datasets analyzed primarily contain C code. Our findings might not generalize to other programming languages with different syntactic and semantic properties. Future research should include datasets from various programming languages to evaluate the broader applicability of our results.
VI-C Construct Validity
Performance Metrics. We used accuracy as the primary metric for evaluating classifier performance. While accuracy is standard, it does not differentiate between false positives and false negatives, which can have different implications in security contexts. Future studies should consider additional metrics such as precision, recall, and F1-score to provide a more nuanced evaluation of ML4VD techniques.
Spurious Correlations. Our results suggest that ML4VD techniques might achieve high accuracy by exploiting spurious correlations rather than genuinely detecting vulnerabilities. While we demonstrated this using word counts, further research should investigate other spuriously correlated features that might be exploited.
VII Discussion and Future Work
The goal of empirical evaluation in ML4VD is to assess the capabilities of specific techniques by putting them to the test on real-world data. However, if the evaluation is based on incorrect assumptions, the results become meaningless. Our findings demonstrate that this is precisely the issue with ML4VD. Techniques can achieve ‘perfect scores’ by exploiting spurious features, even without possessing the capabilities the test aims to measure. In our specific case, the only plausible explanation for the good performance of ML4VD techniques are spurious correlations.
Implications for general ML. Are we truly measuring the effectiveness of ML techniques in solving the specific tasks we expect them to solve? The issue of spurious correlations extends beyond ML4VD to other areas of machine learning, where the influence of spurious correlations may be less obvious. Therefore, it is crucial to develop methods to identify and measure these correlations during the benchmarking of ML techniques to ensure that we are genuinely evaluating their performance in solving the tasks they are intended to address.
Implications for Program Analysis. The context-dependency problem is a strong case for interprocedural analyses over intraprocedural analysis. While the latter is a common problem statement in static analysis, it is evidently insufficient for tasks requiring a comprehensive understanding of code contexts. Interprocedural analyses, which consider interactions between multiple procedures, are better suited to address the context-dependent nature of security vulnerabilities.
Abstention. What could be possible ways forward for ML4VD? A simple possible solution to the context-dependency problem of function-level vulnerability detection could be to cast it as binary classification with abstention. Given a function, an ML4VD technique could either decide the vulnerability of the function or abstain from this decision. However, the obvious disadvantage of this approach is that only context-independent security vulnerabilities could actually be detected, which appear to be very rare. The context-dependency labels generated by our empirical study could be used as a starting point to explore this direction.
Alternative Granularities. Another alternative approach to function-level vulnerability detection could be to rely on higher granularities, such as file-level, module-level, or even repository-level. However, higher granularities could suffer from the same problems as the function-level approach. For example, deciding whether a file, a module, or even a complete repository contains a security vulnerability might still depend on external context. Additionally, the classifiers might still rely on spurious correlations within the larger context, leading to misleading results. These potential issues need to be empirically verified in future work to determine if higher granularities offer a true advantage or merely shift the problems to a different level of abstraction.
Context-conditional Classification. Due to the hierarchical nature of software, determining the vulnerability of a given base unit (e.g., function) may require considering all necessary contexts, such as the complete state of a repository at a given commit ID and all other dependencies that might exist outside of the code (e.g. system of the user, dependencies of the repository, etc.). In the past, this approach was practically infeasible due to limitations of existing approaches (e.g., context size of LLMs). However, recent advances might make this approach feasible in the near future. Since Devign, BigVul, and DiverseVul all include patch commit IDs, which allow for the reconstruction of complete context for a given function, they could be used to evaluate ML4VD based on this alternative problem statement. Future research is needed to find optimal ways to include this context into state-of-the-art techniques and to optimally make decisions based on large contexts.
However, adding repository-level context alone may not resolve the benchmarking problem, as an ML technique that disregards this context could still appear to perform well. Addressing this challenge will require ensuring that the evaluation metrics and benchmarks actually capture the vulnerability detection capabilities of the techniques that are tested. The context-dependency explanations generated by our empirical study could be used as a starting point to explore this direction.
Overcoming Classification. Even with higher granularities, the essence of ML4VD would still be binary classification: Given an input, decide whether the input is vulnerable or secure. An alternative to classification could be generation of context. A context-generating ML4VD technique could generate the conditions under which a context-dependent vulnerability would be an actual vulnerability. For example, given a function, ML4VD techniques may generate a complete executable program in which the function contains a security vulnerability. This generated context could then be compared with the actual context to see whether the vulnerability exists in the real context.
In conclusion, addressing the issues of context dependency and spurious correlations is critical for the advancement of ML4VD and other ML applications. By exploring alternative methodologies and improving our evaluation frameworks, we can ensure more robust and reliable assessments, ultimately leading to more secure and effective solutions.
References
- [1]B.Steenhoek, H.Gao, and W.Le, “Dataflow analysis-inspired deep learning for efficient vulnerability detection,” in Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, ser. ICSE ’24.New York, NY, USA: Association for Computing Machinery, 2024. [Online]. Available: https://doi.org/10.1145/3597503.3623345
- [2]S.Cao, X.Sun, X.Wu, D.Lo, L.Bo, B.Li, and W.Liu, “Coca: Improving and explaining graph neural network-based vulnerability detection systems,” in Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, ser. ICSE ’24.New York, NY, USA: Association for Computing Machinery, 2024. [Online]. Available: https://doi.org/10.1145/3597503.3639168
- [3]Z.Liu, Z.Tang, J.Zhang, X.Xia, and X.Yang, “Pre-training by predicting program dependencies for vulnerability analysis tasks,” in Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, ser. ICSE ’24.New York, NY, USA: Association for Computing Machinery, 2024. [Online]. Available: https://doi.org/10.1145/3597503.3639142
- [4]A.Sejfia, S.Das, S.Shafiq, and N.Medvidović, “Toward improved deep learning-based vulnerability detection,” in Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, ser. ICSE ’24.New York, NY, USA: Association for Computing Machinery, 2024. [Online]. Available: https://doi.org/10.1145/3597503.3608141
- [5]M.M. Rahman, I.Ceka, C.Mao, S.Chakraborty, B.Ray, and W.Le, “Towards causal deep learning for vulnerability detection,” in Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, ser. ICSE ’24.New York, NY, USA: Association for Computing Machinery, 2024. [Online]. Available: https://doi.org/10.1145/3597503.3639170
- [6]S.B. Hossain, N.Jiang, Q.Zhou, X.LI, W.-H. Chiang, Y.Lyu, H.Nguyen, and O.Tripp, “A deep dive into large language models for automated bug localization and repair,” in Conference on the Foundations of Software Engineering (FSE) 2024, 2024. [Online]. Available: https://www.amazon.science/publications/a-deep-dive-into-large-language-models-for-automated-bug-localization-and-repair
- [7]J.Zhang, S.Liu, X.Wang, T.Li, and Y.Liu, “Learning to locate and describe vulnerabilities,” in 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2023, pp. 332–344.
- [8]X.Wen, X.Wang, C.Gao, S.Wang, Y.Liu, and Z.Gu, “When less is enough: Positive and unlabeled learning model for vulnerability detection,” in 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE).Los Alamitos, CA, USA: IEEE Computer Society, sep 2023, pp. 345–357. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/ASE56229.2023.00144
- [9]B.Steenhoek, M.M. Rahman, R.Jiles, and W.Le, “An empirical study of deep learning models for vulnerability detection,” in Proceedings of the 45th International Conference on Software Engineering, ser. ICSE ’23.IEEE Press, 2023, p. 2237–2248. [Online]. Available: https://doi.org/10.1109/ICSE48619.2023.00188
- [10]W.Wang, T.N. Nguyen, S.Wang, Y.Li, J.Zhang, and A.Yadavally, “Deepvd: Toward class-separation features for neural network vulnerability detection,” in Proceedings of the 45th International Conference on Software Engineering, ser. ICSE ’23.IEEE Press, 2023, p. 2249–2261. [Online]. Available: https://doi.org/10.1109/ICSE48619.2023.00189
- [11]X.Yang, S.Wang, Y.Li, and S.Wang, “Does data sampling improve deep learning-based vulnerability detection? yeas! and nays!” in Proceedings of the 45th International Conference on Software Engineering, ser. ICSE ’23.IEEE Press, 2023, p. 2287–2298. [Online]. Available: https://doi.org/10.1109/ICSE48619.2023.00192
- [12]B.Yuan, Y.Lu, Y.Fang, Y.Wu, D.Zou, Z.Li, Z.Li, and H.Jin, “Enhancing deep learning-based vulnerability detection by building behavior graph model,” in Proceedings of the 45th International Conference on Software Engineering, ser. ICSE ’23.IEEE Press, 2023, p. 2262–2274. [Online]. Available: https://doi.org/10.1109/ICSE48619.2023.00190
- [13]X.-C. Wen, Y.Chen, C.Gao, H.Zhang, J.M. Zhang, and Q.Liao, “Vulnerability detection with graph simplification and enhanced graph representation learning,” in Proceedings of the 45th International Conference on Software Engineering, ser. ICSE ’23.IEEE Press, 2023, p. 2275–2286. [Online]. Available: https://doi.org/10.1109/ICSE48619.2023.00191
- [14]C.Ni, X.Yin, K.Yang, D.Zhao, Z.Xing, and X.Xia, “Distinguishing look-alike innocent and vulnerable code by subtle semantic representation learning and explanation,” in Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE 2023.New York, NY, USA: Association for Computing Machinery, 2023, p. 1611–1622. [Online]. Available: https://doi.org/10.1145/3611643.3616358
- [15]Y.Wu, D.Zou, S.Dou, W.Yang, D.Xu, and H.Jin, “Vulcnn: an image-inspired scalable vulnerability detection system,” in Proceedings of the 44th International Conference on Software Engineering, ser. ICSE ’22.New York, NY, USA: Association for Computing Machinery, 2022, p. 2365–2376. [Online]. Available: https://doi.org/10.1145/3510003.3510229
- [16]Y.Li, S.Wang, and T.N. Nguyen, “Vulnerability detection with fine-grained interpretations,” in Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE 2021.New York, NY, USA: Association for Computing Machinery, 2021, p. 292–303. [Online]. Available: https://doi.org/10.1145/3468264.3468597
- [17]Z.Chu, Y.Wan, Q.Li, Y.Wu, H.Zhang, Y.Sui, G.Xu, and H.Jin, “Graph neural networks for vulnerability detection: A counterfactual explanation,” arXiv preprint arXiv:2404.15687, 2024.
- [18]X.-C. Wen, C.Gao, S.Gao, Y.Xiao, and M.R. Lyu, “Scale: Constructing structured natural language comment trees for software vulnerability detection,” 2024.
- [19]Y.Hu, S.Wang, W.Li, J.Peng, Y.Wu, D.Zou, and H.Jin, “Interpreters for gnn-based vulnerability detection: Are we there yet?” in Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA 2023.New York, NY, USA: Association for Computing Machinery, 2023, p. 1407–1419. [Online]. Available: https://doi.org/10.1145/3597926.3598145
- [20]X.Nie, N.Li, K.Wang, S.Wang, X.Luo, and H.Wang, “Understanding and tackling label errors in deep learning-based vulnerability detection (experience paper),” in Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA 2023.New York, NY, USA: Association for Computing Machinery, 2023, p. 52–63. [Online]. Available: https://doi.org/10.1145/3597926.3598037
- [21]Y.Ding, S.Chakraborty, L.Buratti, S.Pujar, A.Morari, G.Kaiser, and B.Ray, “Concord: Clone-aware contrastive learning for source code,” in Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA 2023.New York, NY, USA: Association for Computing Machinery, 2023, p. 26–38. [Online]. Available: https://doi.org/10.1145/3597926.3598035
- [22]X.Cheng, G.Zhang, H.Wang, and Y.Sui, “Path-sensitive code embedding via contrastive learning for software vulnerability detection,” in Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA 2022.New York, NY, USA: Association for Computing Machinery, 2022, p. 519–531. [Online]. Available: https://doi.org/10.1145/3533767.3534371
- [23]Y.Chen, Z.Ding, L.Alowain, X.Chen, and D.Wagner, “Diversevul: A new vulnerable source code dataset for deep learning based vulnerability detection,” in Proceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses, ser. RAID ’23.New York, NY, USA: Association for Computing Machinery, 2023, p. 654–668. [Online]. Available: https://doi.org/10.1145/3607199.3607242
- [24]Q.L. Le, A.Raad, J.Villard, J.Berdine, D.Dreyer, and P.W. O’Hearn, “Finding real bugs in big programs with incorrectness logic,” Proc. ACM Program. Lang., vol.6, no. OOPSLA1, apr 2022. [Online]. Available: https://doi.org/10.1145/3527325
- [25]P.D. Schubert, B.Hermann, and E.Bodden, “Phasar: An inter-procedural static analysis framework for c/c++,” in International Conference on Tools and Algorithms for Construction and Analysis of Systems, 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:93002205
- [26]S.Desikan and G.Ramesh, Software Testing: Principles and Practice.Pearson Education Canada, 2006. [Online]. Available: https://books.google.de/books?id=Yt2yRW6du9wC
- [27]S.Lipp, S.Banescu, and A.Pretschner, “An empirical study on the effectiveness of static c code analyzers for vulnerability detection,” in Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA 2022.New York, NY, USA: Association for Computing Machinery, 2022, p. 544–555. [Online]. Available: https://doi.org/10.1145/3533767.3534380
- [28]B.Johnson, Y.Song, E.Murphy-Hill, and R.Bowdidge, “Why don’t software developers use static analysis tools to find bugs?” in Proceedings of the 2013 International Conference on Software Engineering, ser. ICSE ’13.IEEE Press, 2013, p. 672–681.
- [29]M.J. Harrold and A.Orso, “Retesting software during development and maintenance,” in 2008 Frontiers of Software Maintenance, 2008, pp. 99–108.
- [30]K.Claessen and J.Hughes, “Quickcheck: a lightweight tool for random testing of haskell programs,” SIGPLAN Not., vol.35, no.9, p. 268–279, sep 2000. [Online]. Available: https://doi.org/10.1145/357766.351266
- [31]R.Padhye, C.Lemieux, and K.Sen, “Jqf: coverage-guided property-based testing in java,” in Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA 2019.New York, NY, USA: Association for Computing Machinery, 2019, p. 398–401. [Online]. Available: https://doi.org/10.1145/3293882.3339002
- [32]R.Padhye, C.Lemieux, K.Sen, M.Papadakis, and Y.LeTraon, “Semantic fuzzing with zest,” in Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA 2019.New York, NY, USA: Association for Computing Machinery, 2019, p. 329–340. [Online]. Available: https://doi.org/10.1145/3293882.3330576
- [33]R.Croft, M.A. Babar, and M.M. Kholoosi, “Data quality for software vulnerability datasets,” in Proceedings of the 45th International Conference on Software Engineering, ser. ICSE ’23.IEEE Press, 2023, p. 121–133. [Online]. Available: https://doi.org/10.1109/ICSE48619.2023.00022
- [34]D.Arp, E.Quiring, F.Pendlebury, A.Warnecke, F.Pierazzi, C.Wressnegger, L.Cavallaro, and K.Rieck, “Dos and don’ts of machine learning in computer security,” in 31st USENIX Security Symposium (USENIX Security 22), 2022, pp. 3971–3988.
- [35]N.Risse and M.Böhme, “Uncovering the limits of machine learning for automatic vulnerability detection,” 2024. [Online]. Available: https://arxiv.org/abs/2306.17193
- [36]B.Kitchenham and S.Charters, “Guidelines for performing systematic literature reviews in software engineering,” vol.2, 01 2007.
- [37]K.Petersen, S.Vakkalanka, and L.Kuzniarz, “Guidelines for conducting systematic mapping studies in software engineering: An update,” Information and Software Technology, vol.64, 08 2015.
- [38]J.Fan, Y.Li, S.Wang, and T.N. Nguyen, “A c/c++ code vulnerability dataset with code changes and cve summaries,” in Proceedings of the 17th International Conference on Mining Software Repositories, ser. MSR ’20.New York, NY, USA: Association for Computing Machinery, 2020, p. 508–512. [Online]. Available: https://doi.org/10.1145/3379597.3387501
- [39]Y.Zhou, S.Liu, J.Siow, X.Du, and Y.Liu, Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks.Red Hook, NY, USA: Curran Associates Inc., 2019.
- [40]S.Chakraborty, R.Krishna, Y.Ding, and B.Ray, “Deep learning based vulnerability detection: Are we there yet?” IEEE Transactions on Software Engineering, vol.48, no.09, pp. 3280–3296, sep 2022.
- [41]J.Cohen, “A coefficient of agreement for nominal scales,” Educational and Psychological Measurement, vol.20, pp. 37 – 46, 1960. [Online]. Available: https://api.semanticscholar.org/CorpusID:15926286
- [42]J.R. Landis and G.G. Koch, “The measurement of observer agreement for categorical data.” Biometrics, vol. 33 1, pp. 159–74, 1977. [Online]. Available: https://api.semanticscholar.org/CorpusID:11077516
- [43]S.Lu, D.Guo, S.Ren, J.Huang, A.Svyatkovskiy, A.Blanco, C.Clement, D.Drain, D.Jiang, D.Tang, G.Li, L.Zhou, L.Shou, L.Zhou, M.Tufano, M.GONG, M.Zhou, N.Duan, N.Sundaresan, S.K. Deng, S.Fu, and S.LIU, “Codexglue: A machine learning benchmark dataset for code understanding and generation,” in Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, J.Vanschoren and S.Yeung, Eds., vol.1, 2021. [Online]. Available: https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/file/c16a5320fa475530d9583c34fd356ef5-Paper-round1.pdf
- [44]D.Guo, S.Lu, N.Duan, Y.Wang, M.Zhou, and J.Yin, “Unixcoder: Unified cross-modal pre-training for code representation,” arXiv preprint arXiv:2203.03850, 2022.