AMD has published a blog post discussing how temperatures and thermals are calculated on its Navi GPUs. There has been some concern in the enthusiast community about the temperatures posted by reference cards, given that these GPUs can report thermal junction temps of up to 110 degrees Celsius. This is substantially hotter than the old temperature of 95 C, which used to be treated as a thermal trip point.
Beginning with Radeon VII, AMD made significant changes to how it measures temperature across the GPU die. In the past, AMD writes, “the GPU core temperature was read by a single sensor that was placed in the vicinity of the legacy thermal diode.” That single reading was used to make decisions governing the GPUs voltage and operating frequency. Radeon VII and now Navi do things differently. Instead of deploying a single sensor, they use a network of sensor data gathered from across the GPU. AMD has deployed the same AVFS (Adaptive Voltage and Frequency Scaling) strategy that it uses for Ryzen to maximize performance of its GPUs.
AVFS deploys a network of on-die sensors across the entire chip rather than relying on a single point of measurement. Rather than calibrating voltages and frequencies at the factory and preprogramming a series of defined voltage and frequency steps that all CPUs must achieve, AVFS dynamically measures and delivers the voltage required for each individual CPU to hit its desired clock frequencies. This allows for finer-grained power management across the CPU, improving both performance and power efficiency across a range of targets.
The 110-degree junction temperature is not evidence of a problem or a sudden issue with AMD graphics cards. AMD now measures its GPU temperature in new locations and reports additional data points that capture this information because it adopted more sophisticated measuring methods. Arguing that the company should be penalized for reporting data more accurately is akin to arguing that manufacturers ought to hide data because they’re afraid some customers won’t understand it or put it in the proper context.
AMD provides a pair of graphs to illustrate the difference between its Vega 64 and earlier measurement system and how it calibrates voltage on the 5700 XT today. The old discrete state method is shown below:
Now, compare that against the frequency/voltage curve for the 5700 XT.
The 5700 XT is designed to continue boosting performance until it hits its thermal junction threshold. From the company’s blog post:
Paired with this array of sensors is the ability to identify the ‘hotspot’ across the GPU die. Instead of setting a conservative, ‘worst case’ throttling temperature for the entire die, the Radeon RX 5700 series GPUs will continue to opportunistically and aggressively ramp clocks until any one of the many available sensors hits the ‘hotspot’ or ‘Junction’ temperature of 110 degrees Celsius. Operating at up to 110C Junction Temperature during typical gaming usage is expected and within spec. This enables the Radeon RX 5700 series GPUs to offer much higher performance and clocks out of the box, while maintaining acoustic and reliability targets.
There’s a certain knee-jerk “I don’t want 110-degree anything in my case!” reaction from enthusiasts that’s both perfectly understandable and somewhat misguided. There’s an unconscious underlying assumption that 110 degrees Celsius represents a dangerous temperature (it doesn’t) or an extremely loud cooler. The 5700 XT and 5700 are much quieter than Vega 64, but if that’s still too loud, third-party cards are starting to hit the market. Companies like Asus were able to build coolers that handled the R9 290X beautifully, so the 5700 XT should be tamable as well.
Higher temperatures are partially an artifact of better measurement. They’re also a reality of advanced silicon manufacturing nodes. Our ability to pack transistors closer together has outstripped our ability to reduce their power consumption by cutting operating voltages. As a result, increasing transistor density increases hot spot formation and higher peak temperatures. AVFS helps mitigate this tendency by ensuring that operating voltage is precisely mapped to frequency, but it can’t fix the fact that AMD has packed more transistors into a smaller space, leading to higher thermal density.
Higher temperatures are not an intrinsic reason to be concerned about a product provided the manufacturer certifies that this is expected behavior. When I got into computing, a CPU temperature of 50 C (measured via in-socket thermistor) was considered extremely high. Today, Intel and AMD build silicon that can operate reliably at 95C or above for years at a time.