Since at least Computex, Intel has been raising concerns with reviewers about the types of tests we run, which applications reviewers tend to use, and whether those tests are capturing ‘real-world’ performance. Specifically, Intel feels that far too much emphasis is put on tests like Cinebench, while the applications that people actually use are practically ignored.
Let’s get a few things out of the way up-front.
Every company has benchmarks that it prefers and benchmarks that it dislikes. The fact that some tests run better on AMD versus Intel, or on Nvidia versus AMD, is not, in and of itself, evidence that the benchmark has been deliberately designed to favor one company or the other. Companies tend to raise concerns about which benchmarks reviewers are using when they are facing increased competitive pressure in the market. Those of you who think Intel is raising questions about the tests we reviewers collectively use partly because it’s losing in a lot of those tests are not wrong. But just because a company has self-interested reasons to be raising questions doesn’t automatically mean that the company is wrong, either. And since I don’t spend dozens of hours and occasional all-nighters testing hardware to give people a false idea of how it will perform, I’m always willing to revisit my own conclusions.
What follows are my own thoughts on this situation. I don’t claim to speak for any other reviewer other than myself.
What Does ‘Real-World’ Performance Actually Mean?
Being in favor of real-world hardware benchmarks is one of the least controversial opinions one can hold in computing. I’ve met people who didn’t necessarily care about the difference between synthetic and real-world tests, but I don’t ever recall meeting someone who thought real-world testing was irrelevant. The fact that nearly everyone agrees on this point does not mean everyone agrees on where the lines are between a real-world and a synthetic benchmark. Consider the following scenarios:
- A developer creates a compute benchmark that tests GPU performance on both AMD and Nvidia hardware. It measures the performance both GPU families should offer in CUDA and OpenCL. Comparisons show that its results map reasonably well to applications in the field.
- A 3D rendering company creates a standalone version of its application to compare performance across CPUs and/or GPUs. The standalone test accurately captures the basic performance of the (very expensive) 3D rendering suite in a simple, easy-to-use test.
- A 3D rendering company creates a number of test scenes for benchmarking its full application suite. Each scene focuses on highlighting a specific technique or technology. They are collectively intended to show the performance impact of various features rather than offering a single overall render.
- A game includes a built-in benchmark test. Instead of replicating an exact scene from in-game, the developers build a demo that tests every aspect of engine performance over a several-minute period. The test can be used to measure the performance of new features in an API like DX11.
- A game includes a built-in benchmark test. This test is based on a single map or event in-game. It accurately measures performance in that specific map or scenario, but does not include any data on other maps or scenarios.
You’re going to have your own opinion about which of these scenarios (if any) constitute a real-world benchmark, and which do not. Let me ask you a different question — one that I genuinely believe is more important than whether a test is “real-world” or not. Which of these hypothetical benchmarks tells you something useful about the performance of the product being tested?
The answer is: “Potentially, all of them.” Which benchmark I pick is a function of the question that I’m asking. A synthetic or standalone test that functions as a good model for a different application is still accurately modeling performance in that application. It may be a far better model for real-world performance than tests performed in an application that has been heavily optimized for a specific architecture. Even though all of the tests in the optimized app are “real-world” — they reflect real workloads and tasks — the application may itself be an unrepresentative outlier.
All of the scenarios I outlined above have the potential to be good benchmarks, depending on how well they generalize to other applications. Generalization is important in reviewing. In my experience, reviewers generally try to balance applications known to favor one company with apps that run well on everyone’s hardware. Oftentimes, if a vendor-specific feature is enabled in one set of data, reviews will include a second set of data with the same featured disabled, in order to provide a more neutral comparison. Running vendor-specific flags can sometimes harm the ability of the test to speak to a wider audience.
Intel Proposes an Alternate Approach
Up until now, we’ve talked strictly about whether a test is real-world in light of whether the results generalize to other applications. There is, however, another way to frame the topic. Intel surveyed users to see which applications they actually used, then presented us with that data. It looks like this:
The implication here is that by testing the most common applications installed on people’s hardware, we can capture a better, more representative use-case. This feels intuitively true — but the reality is more complicated.
Just because an application is frequently used doesn’t make it an objectively good benchmark. Some applications are not particularly demanding. While there are absolutely scenarios in which measuring Chrome performance could be important, like the low-end notebook space, good reviews of these products already include these types of tests. In the high-end enthusiast context, Chrome is unlikely to be a taxing application. Are there test scenarios that can make it taxing? Yes. But those scenarios don’t reflect the way the application is most commonly used.
The real-world experience of using Chrome on a Ryzen 7 3800X is identical to using it on a Core i9-9900K. Even if this were this not the case, Google makes it difficult to keep a previous version of Chrome available for continued A/B testing. Many people run extensions and adblockers, which have their own impact on performance. Does that mean reviewers shouldn’t test Chrome? Of course it doesn’t. That’s why many laptop reviews absolutely do test Chrome, particularly in the context of browser-based battery life, where Chrome, Firefox, and Edge are known to produce different results. Fit the benchmark to the situation.
There was a time when I spent much more time testing many of the applications on this list than we do now. When I began my career, most benchmark suites focused on office applications and basic 2D graphics tests. I remember when swapping out someone’s GPU could meaningfully improve 2D picture quality and Windows’ UI responsiveness, even without upgrading their monitor. When I wrote for Ars Technica, I wrote comparisons of CPU usage during HD content decoding, because at the time, there were meaningful differences to be found. If you think back to when Atom netbooks debuted, many reviews focused on issues like UI responsiveness with an Nvidia Ion GPU solution and compared it with Intel’s integrated graphics. Why? Because Ion made a noticeable difference to overall UI performance. Reviewers don’t ignore these issues. Publications tend to return to them when meaningful differentiation exists.
I do not pick review benchmarks solely because the application is popular, though popularity may figure into the final decision. The goal, in a general review, is to pick tests that will generalize well to other applications. The fact that a person has Steam or Battle.net installed tells me nothing. Is that person playing Overwatch or WoW Classic? Are they playing Minecraft or No Man’s Sky? Do they choose MMORPGs or FPS-type games, or are they just stalled out in Goat Simulator 2017? Are they actually playing any games at all? I can’t know without more data.
The applications on this list that show meaningful performance differences in common tasks are typically tested already. Publications like Puget Systems regularly publish performance comparisons in the Adobe suite. In some cases, the reason applications aren’t tested more often is that there have been longstanding concerns about the reliability and accuracy of the benchmark suite that most commonly includes them.
I’m always interested in better methods of measuring PC performance. Intel absolutely has a part to play in that process — the company has been helpful on many occasions when it comes to finding ways to highlight new features or troubleshoot issues. But the only way to find meaningful differences in hardware is to find meaningful differences in tests. Again, generally speaking, you’ll see reviewers check laptops for gaps in battery life and power consumption as well as performance. In GPUs, we look for differences in frame time and framerate. Because none of us can run every workload, we look for applications with generalizable results. At ET, I run multiple rendering applications specifically to ensure we aren’t favoring any single vendor or solution. That’s why I test Cinebench, Blender, Maxwell Render, and Corona Render. When it comes to media encoding, Handbrake is virtually everyone’s go-to solution — but we check in both H.264 and H.265 to ensure we capture multiple test scenarios. When tests prove to be inaccurate or insufficient to capture the data I need, I use different tests.
The False Dichotomy
The much-argued difference between “synthetic” and “real-world” benchmarks is a poor framing of the issue. What matters, in the end, is whether the benchmark data presented by the reviewer collectively offers an accurate view of expected device performance. As Rob Williams details at Techgage, Intel has been only too happy to use Maxon’s Cinebench as a benchmark at times when its own CPU cores were dominating performance. In a recent post on Medium, Intel’s Ryan Shrout wrote:
Today at IFA we held an event for attending members of the media and analyst community on a topic that’s very near and dear to our heart — Real World Performance. We’ve been holding these events for a few months now beginning at Computex and then at E3, and we’ve learned a lot along the way. The process has reinforced our opinion on synthetic benchmarks: they provide value if you want a quick and narrow perspective on performance. We still use them internally and know many of you do as well, but the reality is they are increasingly inaccurate in assessing real-world performance for the user, regardless of the product segment in question.
Sounds damning. He follows it up with this slide:
To demonstrate the supposed inferiority of synthetic tests, Intel shows 14 separate results, 10 of which are drawn from 3DMark and PCMark. Both of these apps are generally considered to be synthetic applications. When the company presents data on its own performance versus ARM, it pulls the same trick again:
Why is Intel referring back to synthetic applications in the same blog post in which it specifically calls them out as a poor choice compared with supposedly superior “real-world” tests? Maybe it’s because Intel makes its benchmark choices just like we reviewers do — with an eye towards results that are representative and reproducible, using affordable tests, with good feature sets that don’t crash or fail for unknown reasons after install. Maybe Intel also has trouble keeping up with the sheer flood of software released on an ongoing basis and picks tests to represent its products that it can depend on. Maybe it wants to continue to develop its own synthetic benchmarks like WebXPRT without throwing that entire effort under a bus, even though it’s simultaneously trying to imply that the benchmarks AMD has relied on are inaccurate.
And maybe it’s because the entire synthetic-versus-real-world framing is bad to start with.