To demonstrate the superiority of an innovation, many evaluations compare that innovation to an existing baseline. If one metric is used for the baseline, and a (even slightly) different metric is used for the innovation, the metrics are inconsistent and the measurement results are not necessarily comparable.
Example: You claim that your optimization reduces L2 cache misses. Your evaluation of the unoptimized program includes both demand‐load and prefetch request misses, but the optimized version only counts demand‐load misses.