Reproducibility

This page covers both the issue of conducting experiments that are actually reproducible and the issue of conducting reproduction studies. It also contrasts reproduction and replication.

The issue of reproducibility appears in all components of an experiment (see Evaluation Anti-Patterns): the measurement contexts, workloads, metrics, and data analysis of an experimental evaluation might be irreproducible.

Besides the material on this page, you may want to read about the Open Science Framework and their Reproducibility Project, read Tom Bartlett's Chronicle of Higher Education blog post about "Is Psychology About to Come Undone?", and you may want to check out the Reproducible Research Planet web site.

Prior work on reproducibility

  • Feynman 1974. Cargo Cult Science
    • "When I was at Cornell. I often talked to the people in the psychology department. One of the students told me she wanted to do an experiment that went something like this—I don’t remember it in detail, but it had been found by others that under certain circumstances, X, rats did something, A. She was curious as to whether, if she changed the circumstances to Y, they would still do, A. So her proposal was to do the experiment under circumstances Y and see if they still did A. I explained to her that it was necessary first to repeat in her laboratory the experiment of the other person—to do it under condition X to see if she could also get result A—and then change to Y and see if A changed. Then she would know that the real difference was the thing she thought she had under control. She was very delighted with this new idea, and went to her professor. And his reply was, no, you cannot do that, because the experiment has already been done and you would be wasting time. This was in about 1935 or so, and it seems to have been the general policy then to not try to repeat psychological experiments, but only to change the conditions and see hat happens."
    • "Nowadays there’s a certain danger of the same thing happening, even in the famous field of physics. I was shocked to hear of an experiment done at the big accelerator at the National Accelerator Laboratory, where a person used deuterium. In order to compare his heavy hydrogen results to what might happen to light hydrogen he had to use data from someone else’s experiment on light hydrogen, which was done on different apparatus. When asked he said it was because he couldn’t get time on the program (because there’s so little time and it’s such expensive apparatus) to do the experiment with light hydrogen on this apparatus because there wouldn’t be any new result. And so the men in charge of programs at NAL are so anxious for new results, in order to get more money to keep the thing going for public relations purposes, they are destroying—possibly—the value of the experiments themselves, which is the whole purpose of the thing."
  • Mudge 1996. Report on the panel: "how can computer architecture researchers avoid becoming the society for irreproducible results?"
    • Report on a panel discussion at HPCA
    • "[In] Computer Architecture [...] many results that are published are difficult or impossible to confirm"
    • "authors [...] give incomplete information about their experimental procedures"
    • "there is no kudos for validating experiments"
    • "Tilak Agerwala: referees should not accept papers where reproducibility is clearly questionable"
    • "Tilak Agerwala: funding agencies should make reproducibility a criteria for success"
    • "Tom Conte: the intellectual value should be in the architectural ideas, not in their evaluation"
    • "Tom Conte: component for reproducibility is a public forum [...] perhaps this forum should [...] take the form of a watchdog organization"
    • "Michel Dubois: "watchdog" [...] would create a huge overhead"
    • "Michel Dubois: most people don't really care that much about most paper's results"
    • "Michel Dubois: the problem may lie with readers, who should be more critical of the conclusions"
    • "Michael Foster: Reproducing a result means determining which details [of the experiment] are important and which are inessential"
    • "Michael Foster: To make validation possible, originators of the results and validators must cooperate"
    • "Michael Foster: some motivation [...] is needed to encourage you [the originator of the idea] to help in the task [of validation]. [...] co-authorship"
    • "Paul Schneck: many (most?) students of computer science are not educated as scientists. They are trained as programmers."
  • Basili 1996. The role of experimentation in software engineering: past, current, and future
  • Tichy 1998. Should Computer Scientists Experiment More?
    • "An important requirement for any experiment [...] is repeatability. Repeatability ensures that results can be checked independently and thus raises confidence in the results. It helps eliminate errors, hoaxes, and frauds."
    • "Assume that each idea published without validation would have to be followed by at least two validation studies (which is a very mild requirement)"
    • "To obtain such [solid] evidence, we need careful analysis involving experiments, data, and replication."
    • "an effective way to simplify repeated experiments is by benchmarking"
  • Clark et al. 2004. Xen and the art of repeated research
    • Reproduction study of a SOSP paper, includes lessons learned about reproduction studies
    • "repeated research [...] is difficult enough that it should not be left as an exercise to the reader"
    • "repeated research [...] adds additional insight beyond the original results"
    • "repeated research [...] is a great way to gain experience with research"
  • Feitelson 2006. Experimental Computer Science: The Need for a Cultural Change
    • "there is practically no independent replication of the experiments of others"
    • "we try to show that [...] there is a need for reproducibility and repetition of results as advocated by the scientific method"
    • Section 4 focuses entirely on reproducibility
    • "The point of reproducibility is to reproduce the insights, not the numbers. It is more qualitative than quantitative."
    • "In the context of reproducibility it may also be appropriate to challenge the prevailing emphasis on novelty and innovation in computer science, and especially in the systems area."
    • "methodologies should be collected in laboratory manuals [...] These serve as a repository for the collective experience regarding how things should be done"
    • "Replication is more about moving forward than about reviewing the past."
    • "Replication fosters progress because it is hardly ever completely precise. Each replication also introduces a small variation."
  • Drummond 2009. Replicability in not Reproducibility: Nor is it Good Science
    • "A critical point of reproducing an experimental result is that unimportant things are intentionally not replicated"
    • "reproducibility requires changes, replicability avoids them"
    • "removing these differences [between experiments] is what replicability would mean"
    • "the greater the difference from the first experiment, the greater the power of the second [the reproduction]"
    • "one should replicate the result, not the experiment"
    • "what is accepted is the idea that the experimental result empirically justifies"
    • "reproducibility is desirable, [...], the impoverished version, replicability, is one not worth having"
    • "[replication] is the weakest of all possible reproductions of an experimental result, the one with the least power"
    • "[replicability] would cause a great deal of wasted effort"
    • "at best, [replicability] would serve as little more than a policing tool, preventing outright fraud"
    • "sharing the full code would not achieve [reproducibility]"
  • Wieringa et al. 2009. How to Write and Read a Scientific Evaluation Paper
    • "Wrong reasons for rejecting a research paper would be, among others, that the investigated artifact is not novel (rejection for this reason would prevent accumulation of knowledge about an artifact)"
  • Begley 2012. More trial, less error - An effort to improve scientific studies
    • "Bayer Healthcare reported that its scientists could not reproduce some 75 percent of published findings in cardiovascular disease, cancer and women's health."
    • "Lee Ellis of M.D. Anderson Cancer Center and C. Glenn Begley, the former head of global cancer research at Amgen, reported that when the company's scientists tried to replicate 53 prominent studies in basic cancer biology, hoping to build on them for drug discovery, they were able to confirm the results of only six."
    • "Virginia's Nosek [...] recently ran a study in which 1,979 volunteers looked at words printed in different shades of gray and chose which hue on a color chart - from nearly black to almost white - matched that of the printed words. Self-described political moderates perceived the grays more accurately than liberals or conservatives, who literally saw the world in black and white, Nosek said. Rather than publishing the study, Nosek and his colleagues redid it, with 1,300 people. The ideology/shades-of-gray effect vanished. They decided not to publish, figuring the first result was a false positive."

Check out all papers in the bibliography classified under "reproducibility".