Properly benchmarking a system is a difficult and intricate task. Even a seemingly innocuous mistake can compromise the guarantees provided by a systems security defense and threaten reproducibility and comparability. Moreover, as many modern defenses trade security for performance, the damage caused by benchmarking mistakes is increasingly worrying. To analyze the magnitude of the phenomenon, we identify 22 benchmarking flaws that threaten the validity of systems security evaluations, and survey 50 defense papers published in top venues.
We show that benchmarking flaws are widespread even in papers published at tier-1 venues; tier-1 papers contain an average of five benchmarking flaws and we find only a single paper in our sample without any benchmarking flaws. Moreover, the scale of the problem appears constant over time, suggesting that the community is not yet taking sufficient countermeasures. This threatens the scientific process, which relies on reproducibility and comparability to ensure that published research advances the state of the art. We hope to raise awareness and provide recommendations for improving benchmarking quality and safeguard the scientific process in our community.
We originally published a technical report on Arxiv about our results after our paper “Benchmarking Crimes: An Emerging Threat in Systems Security” was repeatedly rejected at top security conferences. However, eventually we found IEEE EuroS&P willing to accept the paper. We changed “crimes” to “flaws” in this case to avoid the implication that incorrect benchmarking is necessarily intentional. This paper is titled “SoK: Benchmarking Flaws in Systems Security”. The list of crimes/flaws is based on Gernot Heiser’s blog blog post about Systems Benchmarking Crimes.
The slides for the EuroS&P presentation given in Stockholm are available. We decided to respect local customs in making the slides. In case this is confusing for some of the non-Swedish readers a version with notes is also available.
The Register has covered our research.
This project was supported by the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 786669 (ReAct) and No. 825377 (UNICORE), by the United States Office of Naval Research (ONR) under contract N00014-17-1-2782, by Cisco Systems, Inc. through grant #1138109, and by the Netherlands Organisation for Scientific Research through grants NWO 639.023.309 VICI “Dowsing” and NWO 639.021.753 VENI “PantaRhei”. The public artifacts reflect only the authors’ view. The funding agencies are not responsible for any use that may be made of the information they contain.