Intel Details Its Nervana Inference and Training AI Cards

This site may earn affiliate commissions from the links on this page. Terms of use.

Hot Chips 31 is underway this week, with presentations from a number of companies. Intel has decided to use the highly technical conference to discuss a variety of products, including major sessions focused on the company’s AI division. AI and machine learning are viewed as critical areas for the future of computing, and while Intel has tackled these fields with features like DL Boost on Xeon, it’s also building dedicated accelerators for the market.

The NNP-I 1000 (Spring Hill) and the NNP-T (Spring Crest) are intended for two different markets, inference and training. “Training” is the work of creating and teaching a neural network how to process data in the first place. Inference refers to the task of actually running the now-trained neural network model. It requires far more computational horsepower to train a neural network than it does to apply the results of that training to real-world categorization or classification tasks.

Intel’s Spring Crest NNP-T is designed to scale out to an unprecedented degree, with a balance between tensor processing capability, on-package HBM, networking capability, and on-die SRAMs to boost processing performance. The underlying chip is built by TSMC — yes, TSMC — on 16nm, with a 680mm2 die size and a 1200mm2 interposer. The entire assembly is 27 billion transistors with 4x8GB stacks of HBM2-2400 memory, 24 Tensor Processing Clusters (TPCs) with a core frequency of up to 1.1GHz. Sixty-four lanes of SerDes HSIO provides 3.58Tbps of aggregate bandwidth and the card supports an x16 PCIe 4.0 connection. Power consumption is expected to be between 150-250W. The chip was built using TSMC’s advanced CoWoS packaging (Chip-on-Wafer-on-Substrate), and carries 60MB of cache distributed across its various cores. CoWoS competes with Intel’s EMIB, but Intel has decided to build this hardware at TSMC rather than using its own foundries. Performance is estimated at up to 119 TOPS.

“We don’t want to waste die area on things we don’t need,” Intel VP of Hardware Carey Kloss told Next Platform. “Our instruction set is simple; matrix multiply, linear algebra, convolutions. We don’t have registers per se, everything is a tensor (2D, 3D, or 4D).” There is a lot that is defined in software, including the ability to program the same when breaking a model to run on or off die. “Think of it as a hierarchy,” Kloss said in the interview. “You can use the same set of instructions to move data between two clusters in one group next to one HBM or between groups or even die in a network. We want to make it simple for software to manage the communication.”

The slideshow below steps through the NNP-T architecture. All data is courtesy of Intel, and the performance figures shared in the company’s microbenchmarks have obviously not been validated by ExtremeTech.

The NNP-T is designed to scale outwards effectively without requiring a chassis. Multiple NNP-T accelerators can be connected together in the same chassis, and the cards support chassis-to-chassis and even rack-to-rack glueless connection without needing a switch. There are four QFSP (Quad Small Form Factor Pluggable) network ports on the back of each mezzanine card.

We don’t have performance data yet, but this is the high-end training card Intel will come to market with to compete against the likes of Nvidia. It’s not yet clear how eventual solutions like Xe, which won’t ship for data centers until 2021, will fit into the company’s future product portfolio once it has both tensor processing cores and GPUs in the data center market.

Spring Hill / NNP-I: Icelake On-Board

Spring Hill, Intel’s new inference accelerator, is an entirely different beast. Where the NNP-T is designed for 150-250W power envelopes, the NNP-I is a 10-50W part intended to plug into an M.2 slot. It features two Icelake CPU cores paired with 12 Inference Compute Engines (ICE).

The 12 ICE engines and dual CPU cores are backed up by 24MB of coherent L3 and support both AVX-512 and VNNI instructions. There’s two on-die LPDDR4X memory controllers connected to an on-die pool of LPDDR4 memory (no word on capacity yet). DRAM bandwidth is up to 68GB/s, but total amount of on-card DRAM is unknown. Spring Hill can be added to any modern server that supports M.2 slots — according to Intel, the device communicates over the M.2 riser like a PCIe product rather than via NVMe.

The goal, with NNP-I, is to run operations on the AI processor with less overhead required from the primary CPU in the system. The device connects via PCIe (both PCIe 3.0 and 4.0 are supported) and handles the AI workload, using the on-die Icelake cores for any necessary processing. The on-die SRAMs and DRAM provide local memory bandwidth.

The Inference Compute Engine supports various instruction formats, ranging from FP16 to INT1, with a programmable vector processor and a 4MB SRAM for each individual ICE.

There’s also a tensor engine, dubbed the Deep Learning Compute Grid, and a Tensilica Vision P6 DSP (used for processing workloads that aren’t tuned for running in the fixed-function DL Compute Grid).

The overall memory subsystem of the NNP-I is also optimized, with the L3 cache broken into eight 3MB slices, shared between the ICE and CPU cores. The goal is to keep data as near to the processing elements that need it as possible. Intel claims the NNP-I can deliver ResNet50 performance of 3,600 inferences per second when running at a 10W TDP. That works out to 4.8 TOPS/watt, which meets Intel’s overall efficiency goals (the company claims that NNP-I is most efficient at lower wattages).

Intel doesn’t expect the NNP-I to come to the retail market, but inference solutions are doing a brisk business compared with the high-end data center-centric training solutions. The NNP-I could ship to a wide range of customers in the not-too-distant future, depending on overall uptake.

Both of these solutions are intended to challenge Nvidia in the data center. While they’re both quite different from Xeon Phi, you could argue that they collectively target some of the spaces Intel wanted to sell Xeon Phi into, albeit in very different ways. That’s not necessarily a bad thing, however — when the original Larrabee was built, the idea of using GPUs for AI and data center work was a distant concept. Revisiting the topic with a new specialized architecture for both inference and training is a smart move for Intel, if the company can grab volume away from Nvidia.

Now Read:

10 minutes mail – Also known by names like : 10minemail, 10minutemail, 10mins email, mail 10 minutes, 10 minute e-mail, 10min mail, 10minute email or 10 minute temporary email. 10 minute email address is a disposable temporary email that self-destructed after a 10 minutes.– is most advanced throwaway email service that helps you avoid spam and stay safe. Try tempemail and you can view content, post comments or download something

Intel Announces Cooper Lake Will Be Socketed, Compatible With Future Ice Lake CPUs

This site may earn affiliate commissions from the links on this page. Terms of use.

Intel may have launched Cascade Lake relatively recently, but there’s another 14nm server refresh already on the horizon. Intel lifted the lid on Cooper Lake today, giving some new details on how the CPU fits into its product lineup with Ice Lake 10nm server chips already supposedly queuing up for 2020 deployment.

Cooper Lake’s features include support for the Google-developed bfloat16 format. It will also support up to 56 CPU cores in a socketed format, unlike Cascade Lake-AP, which scales up to 56 cores but only in a soldered, BGA configuration. The new socket will reportedly be known as LGA4189. There are reports that these chips could offer up to 16 memory channels (because Cascade Lake-AP and Cooper Lake both use multiple dies on the same chip, the implication is that Intel may launch up to 16 memory channels per socket with the dual-die version).


The bfloat16 support is a major addition to Intel’s AI efforts. While 16-bit half-precision floating point numbers have been defined in the IEEE 754 standard for over 30 years, bfloat16 changes the balance between how much of the format is used for significant digits and how much is devoted to exponents. The original IEEE 754 standard is designed to prioritize precision, with just five exponent bits. The new format allows for a much greater range of values but at lower precision. This is particularly valuable for AI and deep learning calculations, and is a major step on Intel’s path to improving the performance of AI and deep learning calculations on CPUs. Intel has published a whitepaper on bfloat16 if you’re looking for more information on the topic. Google claims that using bfloat16 instead of conventional half-precision floating point can yield significant performance advantages. The company writes: “Some operations are memory-bandwidth-bound, which means the memory bandwidth determines the time spent in such operations. Storing inputs and outputs of memory-bandwidth-bound operations in the bfloat16 format reduces the amount of data that must be transferred, thus improving the speed of the operations.”

The other advantage of Cooper Lake is that the CPU will reportedly share a socket with Ice Lake servers coming in 2020. One major theorized distinction between the two families is that Ice Lake servers on 10nm may not support bfloat16, while 14nm Cooper Lake servers will. This could be the result of increased differentiation in Intel’s product lines, though it’s also possible that it reflects 10nm’s troubled development.

Bringing 56 cores to market in a socketed form factor indicates Intel expects Cooper Lake to expand to more customers than Cascade Lake / Cascade Lake-AP targeted. It also raises questions about what kind of Ice Lake servers Intel will bring to market, and whether we’ll see 56-core versions of these chips as well. To-date, all of Intel’s messaging around 10nm Ice Lake has focused on servers or mobile. This may mirror the strategy Intel used for Broadwell, where the desktop versions of the CPU were few and far between, and the mobile and server parts dominated that family — but Intel also said later that not doing a Broadwell desktop release was a mistake and that the company had goofed by skipping the market. Whether that means Intel is keeping an Ice Lake desktop launch under its hat or if the company has decided skipping desktop again does make sense this time around is still unclear.

Cooper Lake’s focus on AI processing implies that it isn’t necessarily intended to go toe-to-toe with AMD’s upcoming 7nm Epyc. AMD hasn’t said much about AI or machine learning workloads on its processors, and while its 7nm chips add support for 256-bit AVX2 operations, we haven’t heard anything from the CPU division at the company to imply a specific focus on the AI market. AMD’s efforts in this space are still GPU-based, and while its CPUs will certainly run AI code, it doesn’t seem to be gunning for the market the way Intel has. Between adding new support for AI to existing Xeons, its Movidius and Nervana products, projects like Loihi, and plans for the data center market with Xe, Intel is trying to build a market for itself to protect its HPC and high-end server business — and to tackle Nvidia’s own current dominance of the space.

Now Read:

10 minutes mail – Also known by names like : 10minemail, 10minutemail, 10mins email, mail 10 minutes, 10 minute e-mail, 10min mail, 10minute email or 10 minute temporary email. 10 minute email address is a disposable temporary email that self-destructed after a 10 minutes.– is most advanced throwaway email service that helps you avoid spam and stay safe. Try tempemail and you can view content, post comments or download something