The following matrix represents our framework for classifying anti-patterns in experimental evaluation. The 16 cells represent the 16 anti-patterns. They are the cross product of the four components of experiment design and the four pitfalls. Each cell links to a page with a description of the anti-pattern and with a growing corpus of real-world examples. For background information and a motivation of this matrix, please read our technical report "Can you trust your experimental results?".
When designing quantitative experiments, an experimenter must consider the following components:
Measurement Contexts identify the software and hardware components to vary or hold constant in the experiment. |
|
Workloads identify the benchmarks, along with their inputs, to use in the experiment. |
|
Metrics identify the properties to measure and how to measure them. |
|
Data Analysis identifies how to analyze the data and how to interpret the results of the analysis to provide insight into resulting claims. |
For each of the above four components, we identified four common pitfalls:
Inappropriate An inappropriate component includes elements that are inappropriate for the experimenter's claim (for example, including a desktop workload when making claims about a supercomputer). |
|
Ignored An ignored component omits aspects relevant to the claim (for example, ignoring compile time when making claims about the quality of a just‐in‐time compiler). |
|
Inconsistent An inconsistent component compares aspects that are inconsistent with each other (for example, making claims about the benefit of A compared to B while comparing A on a modern system to B on an antiquated system). |
|
Irreproducible An irreproducible component is one that others cannot use to reproduce the experiments (for example an inadequately described data analysis technique hinders reproducibility). |