Evaluation Anti-Patterns

The following matrix represents our framework for classifying anti-patterns in experimental evaluation. The 16 cells represent the 16 anti-patterns. They are the cross product of the four components of experiment design and the four pitfalls. Each cell links to a page with a description of the anti-pattern and with a growing corpus of real-world examples. For background information and a motivation of this matrix, please read our technical report "Can you trust your experimental results?".

Inappropriate
Measurement Contexts
Ignored
Measurement Contexts
Inconsistent
Measurement Contexts
Irreproducible
Measurement Contexts
Inappropriate
Workloads
Ignored
Workloads
Inconsistent
Workloads
Irreproducible
Workloads
Inappropriate
Metrics
Ignored
Metrics
Inconsistent
Metrics
Irreproducible
Metrics
Inappropriate
Data Analysis
Ignored
Data Analysis
Inconsistent
Data Analysis
Irreproducible
Data Analysis

Components of an Experiment

When designing quantitative experiments, an experimenter must consider the following components:

Measurement Contexts
identify the software and hardware components to vary or hold constant in the experiment.
Workloads
identify the benchmarks, along with their inputs, to use in the experiment.
Metrics
identify the properties to measure and how to measure them.
Data Analysis
identifies how to analyze the data and how to interpret the results of the analysis to provide insight into resulting claims.

Pitfalls in Experimentation

For each of the above four components, we identified four common pitfalls:

Inappropriate
An inappropriate component includes elements that are inappropriate for the experimenter's claim (for example, including a desktop workload when making claims about a supercomputer).
Ignored
An ignored component omits aspects relevant to the claim (for example, ignoring compile time when making claims about the quality of a just‐in‐time compiler).
Inconsistent
An inconsistent component compares aspects that are inconsistent with each other (for example, making claims about the benefit of A compared to B while comparing A on a modern system to B on an antiquated system).
Irreproducible
An irreproducible component is one that others cannot use to reproduce the experiments (for example an inadequately described data analysis technique hinders reproducibility).