Threats to Validity and Relevance in Security Research

When reviewing papers and projects, we notice that many authors make the same mistakes. These mistakes undermine the claims in the papers, sometimes to the point of invalidating them. As a result, we find ourselves writing the same comments over and over again. For this reason alone, it would be useful to compile them into a more or less comprehensive list and explain why they are problematic.

There is another reason. It appears to us that some of these mistakes are becoming so common that there is a risk that they become part of the research culture. The kind of things that reviewers no longer consider, because: “everybody is doing it”. We think this way of thinking is a threat to the field of security research. How can you take the field seriously, if it does not take itself seriously?

Disclaimers. It is important to emphasise that our research in VUSec has made several of the mistakes listed here. Some mistakes are subtle. It takes effort and awareness to avoid them. As a community, we need to improve.

Also, we do not claim the list of issues as our own. Many people within VUSec and elsewhere have identified (subsets of) these and similar issues and thus contributed to this document. In particular, this document directly builds on (and includes) work on sound malware experiments and on work on benchmarking crimes. The following people were involved in these efforts.

Christian Rossow, Christian Dietrich, Chris Grier, Christian Kreibich, Vern Paxson, Herbert Bos and Maarten van Steen all co-authored our paper on soundness in malware experiments.
Erik van der Kouwe, Dennis Andriesse, Herbert Bos, and Cristiano Giuffrida all co-authored the benchmarking flaws paper.
Gernot Heiser deserves special mention for not only co-authoring the benchmarking flaws paper, but also inspiring it with his list of systems benchmarking crimes. If anything, this document is an extension of his work.

Other work in this area:

An interesting study on experimental bias in ML is “TESSERACT: Eliminating Experimental Bias in Malware Classification across Space and Time” by Feargus Pendlebury, Fabio Pierazzi, Roberto Jordaney, Johannes Kinder, and Lorenzo Cavallaro.

There are many other researchers who have commented, written, or spoken about the issues below. We are not claiming much novelty in this document.

Note: we do not aim to point fingers, single out particular papers, or speculate on malicious intent. We are assuming that most of the mistakes are just that: honest mistakes. The causes may be lack of awareness, sloppiness, deadline pressure, or whatever. And yes, there may be a few bad apples that aim to pull a fast one, but we like to believe that this is rare.

We are treating all issues here as threats: to usefulness of the research results, to the culture in the security research community, and to the scientific process.

A. Threats to relevance

You do not solve the problem.
Many defense/defense/analysis (and occasionally attack) papers target the world as it exists today without taking into account how the world would react to the new defense (or attack) if it became popular. If you have a defense that stops all existing exploits, but that can be trivially bypassed by minor tweaks, it is no good at all (except perhaps in a paper reporting the current state of things).
You do not show how much you raise the bar.
A new defense or detection technique should really raise the bar for attackers in a fundamental way, either in terms of effort (e.g., exploits that used to be portable no longer are) or in performance (e.g., attackers who used to be able to leak sensitive data at gigabit rates can now only leak a few bits per second). In your paper, you should to the best of your ability explain to what extent this is the case, analytically or, ideally, quantitatively.

Some research projects are more suited than others for doing so. For instance, it is always difficult to quantify how much harder reverse engineering becomes because of a new obfuscation technique. In those cases, good papers try to show analytically how the technique raises the bar.
[Note: greatly improving an existing defense (e.g., making it faster or simpler) is of course a worthy contribution still.]
You did not examine the residual attack surface.
You solved one problem. What is left?
You solve a problem that does not exist (anymore) and there is no reason to think it will ever become a real problem
The world changes. For example, clever new tricks to detect shellcode in network traffic may no longer have much value for modern general purpose systems where shellcode injection is no longer possible. Convince yourself that the problem the paper addresses matters.

B. Threats to context

You do not discuss relevant related work
Be careful with claiming that your system is the first to do X, because it may not be. Look further than just the last 5 years and wider than just academic papers. Research ideas grow and die and are reinvented again, in an endless cycle of rebirths. This is known as the research wheel of karma. If the idea was good, it will probably resurface in a different context until it finds its niche. If the idea was bad, it will be put on the blockchain.
You discuss related work poorly
Treat related work fairly. Being negative about related work is not what you need to do and often works counter productive. Factually point out limitations and show where your solution differs.
Your defense paper does not contain a threat model
It is important for a reader to know what attack scenario you consider. All assumptions should be listed.
Your threat model is not realistic.
You should convince the reader that your threat model makes sense and corresponds to situations that may occur or occur frequently in the real-world.
You do not discuss limitations that do exist
Your solution is not the final answer. List the limitations. It is much better that you do this than the reviewers. (In fact, listing it yourself also makes it harder for the reviewer to reject your paper because of it.) Do not sweep anything under the carpet. Some reviewer will probably find out anyway and this will taint the paper. Not just for this submission, but also for subsequent ones.

C. Threats to valid evaluation

Your evaluation does not back up your claims.
All too often papers present experimental results that are, at best, not relevant for the claims or contributions and, at worst, contradicting them.
You do not explain all plots and all the results.
Seriously, tell the reader what he/she should notice. Explain outliers, and false positives/negatives, etc. Do this in a critical way and make sure the evaluation converges. In other words, summarize what we learned from the experiments and show how the results confirm all your claims.
You evaluate on 10-year old 10-year old exploits, 10 year old defenses, 10-year old whatever.
Make sure you evaluate on programs, systems and solutions that are recent/state-of-the-art.
You optimized for the benchmarks only
Benchmarks (and benchmark suites) are great, but scoring well on benchmarks is not the end goal. Often we see systems that have been tailored for particular benchmarks without too much regard for how they would perform in the real world. For instance, fuzzers that optimize for and evaluate on LAVA-M only do not necessarily find more bugs in real programs.
You forgot to explain those magic values
In your eagerness to optimize for certain tests, your solution may have a number of tuning knobs–constants that are set to specific values for your system to perform well. You should explain how and why you picked those values (and argue whether these values still work if you apply your solution in other environments/datasets.
Your paper contains one or more benchmarking ~~crimes~~ flaws
Properly benchmarking your work is more difficult than most people realise. System security solutions are a tradeoff between security and performance. Given infinite time, we can deliver arbitrarily strong security. We rarely have that much time. Since we trade security for performance, proper benchmarking is essential. Even a seemingly harmless mistake in benchmarking can compromise the guarantees provided by a systems security defense and threaten reproducibility and comparability.

Inspired by Gernot Heiser’s original blog about benchmarking crimes (check it out, it is awesome!), we looked into the problem of benchmarking a few years ago to see how good or bad things were in the system security domain. We evaluated two years of system security papers with benchmarking result from the top venues in system security and assessed to what extent they exhibited “benchmarking crimes”. The paper we wrote about this study was initially rejected everywhere, often harshly. In the end, we gave up and published the paper on Arxiv. We thought we were done with this and quite ready to move on, but fellow researchers who considered this important for the community persuaded us to try one more time and it eventually was accepted by EuroS&P’19.

The text in this section simply summarises the benchmarking flaws in that paper.

6.A Selective benchmarking

6.A.1 Not evaluating potential performance degradation
You should always include benchmarks that evaluate all operations whose performance one might reasonably expect to be impacted. If a system improves one kind of workload compared to the state of the art but slows down another, it is important to show this to uncover tradeoffs and allow readers to decide whether this solution is actually faster overall or how it compares to related work. This flaw results in a lack of completeness.

6.A.2 Benchmark subsetting without proper justification
Any paper which arbitrarily selects a subset of benchmarks and presents it as a single overall performance overhead number as if it is still representative contains a serious benchmarking flaw. If the missing sub benchmarks happen to be those that incur most overhead, the overall performance number will be meaningless because important components are missing (lack of completeness) and misleads the reader into thinking the system performs better than it actually does (lack of relevance).

6.A.3 Selective data sets that hide deficiencies
Benchmark configurations are often flexible and allow performance to be measured in different settings. Since this configuration parameter is likely to affect overhead, it is important to measure a range of concurrency settings. Papers that fail to test performance over an appropriate range of settings contain a benchmarking flaw. For example, if throughput seems to scale linearly with the number of concurrent connections, it suggests that the range of this variable is too restricted because the system cannot keep this up forever. Like the other two flaws in this group, it potentially results in numbers that do not accurately reflect the performance impact of the system (lack of completeness).

6.B Improper handling of benchmark results

6.B.1 Microbenchmarks representing overall performance
Microbenchmarks measure the performance of specific operations. Such benchmarks can help determine whether a system succeeds in speeding up these particular operations, as well as for drilling down on performance issues. However, they are not an indication of how fast the system would run in practice. For this purpose, more realistic system benchmarks are needed

6.B.2 Throughput degraded by x% ⇒ overhead is x%
Benchmarks usually run either a fixed workload to measure its runtime or repeat operations for a fixed amount of time to measure throughput. One common mistake is for papers to consider the increase in runtime or decrease in throughput to be the overhead. However, for many workloads the CPU is idle some of the time, for example waiting for I/O. If the CPU is working while it would otherwise have been waiting, this masks some of the overhead because it reduces the CPU time potentially available for other jobs. A typical example would be a lightly loaded server program (e.g., at 10% CPU) that reports no throughput degradation when heavily instrumented, given that the spare CPU cycles can be spent on running instrumentation code (at the expense of extra CPU load).

6.B.3 Bad math
This flaw refers to incorrect computations with overhead numbers. Well-known examples include the use of percentage points to present a difference in overhead, such as the case where the difference between 10% overhead and 20% overhead is presented as 10% more overhead, while it is actually 100% more (i.e., 2×). Another example is incorrectly computing slowdown, for example presenting a runtime that changes from 5s to 20s as a 75% slowdown (1 − 5/20 ) rather than a 300% slowdown ( 20/5 − 1). In all such cases, this flaw results in presenting numbers that are incorrect and therefore unsound.

6.B.4 No indication of significance of data
When measuring runtimes or throughput numbers, there is always random variation due to measurement error. Large measurement errors suggest a problem with the experimental setup. Therefore, we consider the lack of some indication of variance, such as a standard deviation or significance test to be a benchmarking flaw (lack of completeness).

6.B.5 Incorrect averaging across benchmark scores
Papers that use benchmarking suites generally present a single overall overhead figure representing average overhead. Some authors use the arithmetic mean to summarize such numbers. However, this is inappropriate because the arithmetic mean over a number of ratios depends on which setup is chosen as a baseline and is therefore not a reliable metric. Only the geometric mean is appropriate for averaging overhead ratios. Papers that use the arithmetic mean (or other averaging strategies such as using the median) contain a benchmarking flaw: incorrect averaging across benchmark scores. This benchmarking flaw threatens soundness because it results in reporting incorrect overall overhead numbers.

6.C Using the wrong benchmarks

6.C.1 Benchmarking of simplified simulated systems
Often benchmarks are not run on a real system but rather an emulated version, for example through virtualization. While it is sometimes necessary to emulate a system if it is not available otherwise, it is best avoided because the characteristics of the emulated system are generally not identical to those of the real system. This results in unsound measurements.

6.C.2 Inappropriate and misleading benchmarks
The use of benchmarks that are not suitable to measure the expected overheads. Classic example: evaluating a security solution that instruments system calls solely with a CPU-intensive benchmark such as SPEC CPU.

6.C.3 Same dataset for calibration and validation
Do not benchmark your system using the same data set that you used to train it or, more generally, if there is any overlap between the training and test sets. A typical example would be profile-guided approaches which optimize for a specific workload and then use (parts of) that same workload to demonstrate the performance of the technique. The results from this approach lack relevance because they mislead the reader into believing the system performs better than it actually would in realistic scenarios.

6.D Improper comparison of benchmarking result

6.D.1 No proper baseline
Picking the right baseline is crucial. As an example, in systems defenses, the proper baseline is usually the original system using default settings with no defenses enabled. If the baseline is modified, for example by adding part of the requirements for the system being evaluated (such as specific compiler flags or virtualization), this misleads the reader by hiding some of the overhead in the baseline and therefore violates the relevance requirement

6.D.2 Only evaluate against yourself
You should evaluate your solution to the state-of-the-art. Merely comparing your new system to your own earlier work rather than the state of the art will mislead the reader. If better solutions are available, they should be included in the comparison, otherwise the comparison is not relevant

6.D.3 Unfair benchmarking of competitors
When comparing to the competition do this as fairly as possible. For example, try to use configuration that are optimal. Otherwise, mislead the reader into thinking the presented system is better than it is, violating relevance.

6.E. Benchmarking omissions

6.E.1 Not all contributions evaluated
If your paper claims to achieve a certain goal, but it should empirically determine whether this goal has been reached. It is critical that papers verify claims for the progress of science, since incorrect claims may prevent later work that does make the contributions from being published. This flaw violates completeness.

6.E.2 Only measure run-time overhead
When evaluating their performance, many papers measure run-time overhead. However, there are often other types of overhead that are also relevant for performance. A typical example would be memory overhead. Memory is a limited resource, so applications with high memory usage can slow down other processes running on the same system. Since most defenses need to use memory for bookkeeping, it is important to measure memory consumption. Omitting this overhead makes your evaluation incomplete.

6.E.3 False positives/negatives not tested
Unless it is obvious that the system can never get it wrong (e.g., security enforcement based on conservative program analysis), the evaluation needs to quantify such failures due to false positives and false negatives. Omission to do so makes it is impossible to judge a system’s value and makes the paper incomplete.

6.E.4 Elements of solution not tested incrementally
Many systems consist of multiple components or steps that can to some extent be used independently. For example, an instrumentation-based system might use static analysis to eliminate irrelevant instrumentation points and improve performance. Such optimizations are optional as they do not affect functionality and can greatly increase complexity, so it is best to only include them if they result in substantial performance gains. Papers that do not measure the impact of such optional components individually contain a benchmarking flaw and are incomplete.

6.F Missing information

6.F.1 Missing platform specification
It is important to include a description of the hardware setup used to perform the experiments. To be able to reproduce the results, it is always important to know what type of CPU was used and how much memory was available. The cache architecture may be important to understand some performance effects. Depending on the type of system being evaluated, other characteristics such as hard drives and networking setup may also be essential for reproducibility.

6.F.2 Missing software versions
In addition to the hardware, you should also specify the type and version of operating system used, while other information such as hypervisors or compiler versions is also commonly needed. Like the previous flaw, such omissions lead to a lack of reproducibility.

6.F.3 Sub benchmarks not listed
In this case, you run a benchmarking suite but do not present the results of the individual sub benchmarks, just the overall number. This threatens completeness as the results on sub benchmarks often carry important information about the strong and the weak points of the system. Moreover, it is important to know whether the overhead is consistent across different applications or highly application-specific.

6.F.4 Relative numbers only
Many papers present only ratios of overheads (example: system X has half the overhead of system Y) without presenting the overhead itself (example: system X incurs 10% overhead). This is a bad flaw as the most important result is withheld and the reader cannot perform a sanity check of whether the results seem reasonable, threatening the evaluation’s completeness. A weaker version of this practice—presenting overheads compared to a baseline without presenting absolute runtimes or throughput numbers—is also undesirable. The absolute numbers are valuable for the reader to perform a sanity check (is the system configured in a reasonable way?) and because a slow baseline often means overhead will be less visible. The practice of omitting absolute numbers is perhaps not harmful enough to consider it a benchmarking flaw, but we do strongly encourage authors to include absolute numbers in addition to overheads.
Unsound malware experiments.
In case your research builds on experiments with malware, where you run and/or analyse live malware, there are many wonderful ways to screw up. In 2012, resesarchers from VUSec, together with researchers from IIS (Gelsenkirchen), UC Berkeley, and ICSI, published a paper on how to perform sound experiments with live malware at the IEEE Security & Privacy conference (Oakland). We again summarise the main issues here.

The pitfalls identified in the paper fall into four categories. First, authors are not always careful compiling correct datasets from the outset. Second, many papers lack transparency about how the experiments were run. Third, experiments are not always realistic (for instance, because they change how the malware runs which may lead to different behaviour). Fourth, authors do not always discuss how they ensured safety and mitigated potential harm to others.

7.A Incorrect datasets

7.A.1 You did not remove goodware where you should have
Whereas goodware (legitimate software) has to be present for example in experiments to measure false alarms, it is typically not desirable to have goodware samples in datasets to estimate false negative rates. However, malware execution systems open to public sample submission lack control over whether specimens submitted to the system in fact consist of malware; the behavior of such samples remains initially unknown rather than malicious per se.

7.A.2 You did not balance datasets over malware families
In unbalanced datasets, aggressively polymorphic malware families will often unduly dominate datasets filtered by sample-uniqueness (e.g., MD5 hashes). Authors should discuss if such imbalances biased their experiments, and, if so, balance the datasets to the degree possible

7.A.3 Your training and evaluation datasets do not have distinct families
This is similar to the benchmarking crime listed earlier, but with a focus on malware experiments. When splitting datasets based on sample-uniqueness, two distinct malware samples of one family can potentially appear in both the training and validation dataset. Appearing in both may prove desirable for experiments that derive generic detection models for malware families by training on sample subsets. In contrast, authors designing experiments to evaluate on previously unseen malware types should separate the sets based on families

7.A.4 Your analysis runs with the same (or lower) privileges as the malware.
Malware with rootkit functionality can interfere with the OS data structures that kernel-base sensors modify. Such malware can readily influence monitoring components, thus authors ought to report on the extent to which malware samples and monitoring mechanisms collide.

7.A.4 You ignored analysis artifacts and biases
Execution environment artifacts, such as the presence of specific strings (e.g., user names or OS serial keys) or the software configuration of an analysis environment, can manifest in the specifics of the behavior recorded for a given execution. Particularly when deriving models to detect malware, papers should explain the particular facets of the execution traces that a given model leverages. Similarly, biases arise if the malware behavior in an analysis environment differs from that manifest in an infected real system.

7.A.5 Your paper happily blends malware activity traces into benign background activity
The behavior exhibited by malware samples executing in dynamic analysis environments differs in a number of ways from that which would manifest in victim machines in the wild. Consequently, environment-specific performance aspects may poorly match those of the background activity with which experimenters combine them. The resulting idiosyncrasies may lead to seemingly excellent evaluation results, even though the system will perform worse in real-world settings. Authors should consider these issues, and discuss them explicitly if they decide to blend malicious traces with benign background activity.

7.B Lack of transparency

7.B.1 No family names of employed malware samples.
Consistent malware naming remain a thorny issue, but labeling the employed malware families in some form helps the reader identify for which malware a methodology works.

7.B.2 Your paper does not list which malware was analyzed when.
To understand and repeat experiments the reader requires a summary, perhaps provided externally to the paper, that fully describes the malware samples in the datasets. Given the ephemeral nature of some malware, it helps to capture the dates on which a given sample executed to put the observed behavior in context, say of a botnet’s lifespan that went through a number of versions or ended via a take-down effort.

7.B.3 No explanation of the malware sample selection.
Researchers often study only a subset of all malware specimens at their disposal. For instance, for statistically valid experiments, evaluating only a random selection of malware samples may prove necessary. Focusing on more recent analysis results and ignoring year-old data may increase relevance. In either case, authors should describe how they selected the malware subsets, and if not obvious, discuss any potential bias this induces

7.B.4 You leave the system used during execution as a guessing exercise to the reader.
We saw this issue before in benchmarking: explain on what system you ran your experiment. Malware may execute differently (if at all) across various systems, software configurations and versions. Explicit description of the particular system(s) used renders experiments more transparent, especially as presumptions about the “standard” OS change with time. When relevant, authors should also include version information of installed software.

7.B.5 You did not bother explaining the network connectivity of the analysis environment
Malware families assign different roles of activity depending on a system’s connectivity, which can significantly influence the recorded behavior. For example, in the Waledac botnet, PCs connected via NAT primarily sent spam, while systems with public IP addresses acted as fast-flux “repeaters.

7.B.6 Omission of the reasons for false positives and false negatives
False classification rates alone provide little clarification regarding a system’s performance. To reveal fully the limitations and potential of a given approach in other environments, we advocate thoughtful exploration of what led to the observed errors.

7.B.7 No explanation of the nature/diversity of true positives
Similarly, true positive rates alone often do not adequately reflect the potential of a methodology. For example, a malware detector flagging hundreds of infected hosts may sound promising, but not if it detects only a single malware family or leverages an environmental artifact. Papers should evaluate the diversity manifest in correct detections to understand to what degree a system has general discriminative power.

7.C Lack of realism

7.C.1 Evaluate on malware families that are not relevant
Using significant numbers of popular malware families bolsters the impact of experiments. Given the ongoing evolution of malware, exclusively using older or sinkholed specimens can undermine relevance

7.C.2 Avoid evaluation in the real world.
We define a real-world experiment as an evaluation scenario that incorporates the behavior of a significant number of hosts in active use by people other than the authors. Real-world experiments play a vital role in evaluating the gap between a method and its application in practice.

7.C.3 Happily generalize generalizing from a single OS version to the universe
By limiting analysis to a single OS version, experiments may fail with malware families that solely run or exhibit different behavior on disregarded OS versions. For studies that strive to develop results that generalize across OS versions, papers should consider to what degree we can generalize results based on one specific OS version.

7.C.4 You did not not bother to choose appropriate malware stimuli.
Malware classes such as keyloggers require triggering by specific stimuli such as keypresses or user interaction in general. In addition, malware often expose additional behavior when allowed to execute for more than a short period. Authors should therefore describe why the analysis duration they chose suffices for their experiments.

7.C.5 You did not give Internet access to the malware.
Deferring legal and ethical considerations for a moment, we argue that experiments become significantly more realistic if the malware has Internet access. Malware often requires connectivity to communicate with command-and-control (C&C) servers and thus to expose its malicious behavior. In exceptional cases where experiments in simulated Internet environments are appropriate, authors need to describe the resulting limitations.

7.D Insufficient safety
You did not deploy or describe containment policies
Well-designed containment policies facilitate realistic experiments while mitigating the potential harm malware causes to others over time. Experiments should at a minimum employ basic containment policies such as redirecting spam and infection attempts, and identifying and suppressing DoS attacks. Authors should discuss the containment policies and their implications on the fidelity of the experiments. Ideally, authors also monitor and discuss security breaches in their containment.

Closing words

We do not claim that the above list is exhaustive. There are many other ways to jeopardise the validity of your paper. Nevertheless, it is a good start :). Moreover, they are common and often subtle. If we were able to ban these across all security research papers, the field would be in a better shape.

vusec

Threats to Validity and Relevance in Security Research

A. Threats to relevance

B. Threats to context

C. Threats to valid evaluation

Closing words

Systems and Network Security Group at VU Amsterdam