We're constantly bombarded with benchmark results, used to pitch everything from web browsers to mobile service. But if benchmarks aren't built properly, results are erroneous or misleading. Here's what goes into a great benchmark, and how to make your own.
Why Do Benchmarks Matter?
Benchmarks typically measure the performance of the bottlenecks in your system. Benchmarks of your car measure its speed, braking and cornering. Benchmarks of your mechanical toothbrush measure the percentage of plaque it can remove from your teeth. As you attempt to test more complex systems, it becomes increasingly more difficult to create accurate benchmarks. These days, computers can be very difficult to test accurately.
On paper, making a great benchmark seems simple — it should be a quantitative test that measures something meaningful, delivers correct results and produces similar results when repeated in similar circumstances. However, in the real world, it can be difficult to find a test that fits all three criteria. Worse, it's relatively easy for anyone with an agenda to change the starting variables enough to manipulate a benchmark's results. It's more important than ever for you to know the difference between good and bad benchmarks — especially if you want to avoid being hoodwinked.
There are dozens of examples of benchmark shenaniganry over the last decade, but I'm going to pick on Nvidia. In 2008 Nvidia famously claimed that then high-end quad-core CPUs were overkill, and that the GPU could do everything the CPU could do better and faster. As is frequently the case, there was a demo to sell the point. Nvidia was showing a video transcoding app that used the power of Nvidia GPUs to convert video 19x faster than a quad-core CPU. However, the application used for the CPU part of the comparison was only able to utilise a single core on the CPU, an unusual situation for video conversion apps even then. When the exact same test was run using an industry-standard software that could use all four CPU cores, the performance difference was much less dramatic. So, while Nvidia created a benchmark that really did work, the results weren't indicative of the actual performance that people in the real world would get.
The Lab vs. The Real World
There are two basic types of benchmarks: synthetic and real world. Even though we tend to favour real-world benchmarks at Maximum PC (where I am editor-in-chief), both types of tests have their place. Real-world benchmarks are fairly straightforward — they're tests that mimic a real-world workflow, typically using common applications (or games) in a setting common to the typical user. On the other hand, synthetic benchmarks are artifices typically used to measure specific parts of a system. For example, synthetic benchmarks let you measure the pixel refresh speed of a display or the floating-point computational chutzpah of a CPU. However, the danger of relying on synthetic benchmarks is they may not measure differences that a user would actually experience.
Let's look at hard drive interface speeds, for instance. Synthetic benchmarks of the first generation SATA interface showed a speedy pipe between SATA hard drives and the rest of the system—the connection benchmarked in the vicinity of 150MB/sec. When the second generation SATA 3Gbps spec was introduced, tests showed it was twice as fast, delivering around 300MB/sec of bandwidth to each drive. However, it wasn't correct to say that SATA 3Gbps-equipped drives were twice as fast as their first-gen SATA kin. Why not? In the real world, that extra speed didn't matter. If you tested two identical drives, and enabled SATA 3Gbps on one and disabled it on the other, you'd notice minimal—if any—performance differences. The mechanical hard drives of the era weren't capable of filling either pipe to capacity—a higher ceiling means nothing when nobody's bumping their head. (Today, SSD drives and even the large mechanical disks can saturate even a SATA 3Gbps pipe, but that's a topic for another day.)
So, real-world benchmarks are perfect, right? Not necessarily. Let's look at the Photoshop script we run at Maximum PC to measure system performance. We built a lengthy Photoshop script using dozens of the most common actions and filters, then we measure the time it takes to execute the script on a certain photo using a stopwatch. It's a relatively simple test, but there's still plenty of opportunity for us to muck it up. We could use an image file that's much smaller or larger than what you currently get from a digital camera. If we ran the script on a 128KB JPEG or a 2GB TIFF, it would measure something different than it does using the 15MB RAW file we actually use for the test.
So, how do we know that our Photoshop benchmark is delivering correct results? We test it. First, we run the benchmark many times on several different hardware configurations, tweaking every relevant variable on each configuration. Depending on the benchmark, we test different memory speeds, amounts of memory, CPU architectures, CPU speeds, GPU architectures, GPU memory configurations, different speed hard drives and a whole lot more; then we analyse the results to see which changes affected the benchmark, and by how much.
But by comparing our results to the changes we made as well as other known-good tests, we can determine precisely what a particular benchmark measures. In the case of our Photoshop script, both CPU-intensive math and hard disk reads can change the results. With two variables affecting outcome, we know that while the test result is very valuable, it is not, all by itself, definitive. That's an important concept: No one benchmark will tell you everything you need to know about the performance of a complex system.
Making Your Own Photoshop Benchmark
Once you get the hang of it, it's never a bad idea to run your own benchmarks on a fairly regular basis. It will help you monitor your machine to make sure its performance isn't degrading over time, and if you do add any upgrades, it will help you see if they're actually doing anything. Just don't forget to run a few tests when your computer is new (and theoretically performing at its peak), or before you swap in new RAM or a new HDD or other parts. If you forget, you won't have a starting data point to compare to future results.
If you don't own an expensive testing suite like MobileMark or 3DMark, don't sweat it. If you have an application that you use regularly and can record and play back macros or scripts, like Photoshop, you can build a script that includes the activities you frequently use. We run a 10MP photograph through a series of filters, rotations and resizes that we frequently use as one of our regular system testing benchmarks at Maximum PC.
To make your own, launch Photoshop and open your image. Then go to Windows —> Action, click the down arrow in that palette to select New Action. Name it and click Record, then proceed to put your file through your assorted mutations. Always remember to revert to the original file between each step, and make the final action a file close, so you can easily tell when the benchmark is done. Pile in a lot of actions: As a general rule, you want the total script to take at least two minutes to run — the longer it takes, the less important small inaccuracies on your stopwatch work matter. When you're finished assigning actions and have closed the file, click the little Stop button in the action palette to finish your script.
Once finished, make sure your new action is highlighted, then click the menu down arrow in the Action palette again and select Action Options. Assign a function key, which will let you start your benchmark by pressing a keyboard shortcut. (We use F2.) Then, open the Action palette menu again, and select Playback Options. Set it to Step-by-Step and uncheck Pause for Audio Annotation. Once that's done, ready your stopwatch. (Most mobile phones include one, in case you aren't a track coach.) Load your image, then simultaneously start the stopwatch and press the keyboard shortcut you just selected. Stop the stopwatch when the file closes. We typically run this type of test three times, to minimise any human error we introduce by manually timing the test. If you want to try the same script we use at Maximum PC, you can download it here.
Additionally, if you're a gamer, there are tons of games with built-in benchmarks. These help you know what settings to run in games to maximize image quality without sacrificing framerate as well as measure the impact of use on your computer's overall speed.
Check out Resident Evil 5 benchmark, which includes both DirectX 9 and DirectX 10 modes. Running this test is easy—simply install it and select DirectX 9 or DirectX 10 mode. (Remember, you'll need a Radeon 4800 series card or newer or a GeForce 8800 series card or newer and be running on Vista or Windows 7 to use DirectX 10 mode.) If you want to compare performance over a period of time, we recommend the fixed run, it's simply more repeatable. If you're trying to tell what settings to use, the variable mode isn't as consistent, but it shows actual gameplay, which will be more representative of your in-game experience. Once you're in the game, you'll want to change to your flat panel's native resolution and do a test run of your benchmark. For a single-player game, we like to choose settings that will minimise the framerate drops below 30fps. For multiplayer, we sacrifice image quality for speed and target 60fps. After all, dropped frames in a deathmatch will get you killed.
The Practical Upshot Like everything else, there are good benchmarks and bad benchmarks. However, there's absolutely nothing mysterious about the way a benchmarking should work. In order to know whether you can trust benchmarks you read online, you need to know exactly what's being tested — how the scenario starts, what variables are changed and exactly what's being measured. If you can't tell that a test is being run in a fair, apples-to-apples manner, ask questions or try duplicating the tests yourself. And when someone doesn't want to share their testing methodology? That's always a little suspicious to me.
Will Smith is the Editor-in-Chief of Maximum PC, not the famous actor/rapper. His work has appeared in many publications, including Maximum PC, Wired, Mac|Life T3, and on the web at Maximum PC and Ars Technica. He's the author of The Maximum PC Guide to Building a Dream PC.