Tichy's "Should Computer Scientists Experiment More" defines benchmark as follows:

A benchmark is a task domain sample executed by a computer or by a human and computer. During execution, the human or computer records well-defined performance measurements.

The benchmarks (or, more generally, workloads) represent an important component of an experiment (see Evaluation Anti-Patterns). Workloads can be inappropriate, ignored, inconsistent, and irreproducible.

Prior work on benchmarks

  • Blackburn et al. 2006 The DaCapo benchmarks: java benchmarking development and analysis
  • Tempero et al. 2010 Qualitas Corpus: A Curated Collection of Java Code for Empirical Studies
  • Tichy 1998 Should Computer Scientists Experiment More?
    • "an effective way to simplify repeated experiments is by benchmarking"
    • "the most subjective and therefore weakest part of a benchmark test is the benchmark's composition"
    • "constructing a benchmark is usually intensive work"
    • "it is necessary to evolve benchmarks to prevent overfitting"
    • "benchmarks cause an area to blossom suddenly because they make it easy to identify promising approaches and to discard poor ones"

Check out all papers in the bibliography classified under "benchmarks".