What do you get when you cross a modern day tremendous-scalar out-of-purchase CPU core with extra classic microcontroller elements this sort of as no virtual memory, no memory cache, and no DDR or PCIe controllers? You get the Tesla Dojo, which Chips and Cheese lately did a deep dive on.
It starts with a comparison to the IBM Mobile processors. The Cell of the mid-2000s featured something termed the SPE (Synergistic Processing Things). They have been more compact cores concentrated on vector processing or other specialised kinds of workloads. They did not access the most important memory and experienced to be presented responsibilities by the fully featured CPU. Dojo has 1.25MB of SRAM that it can use as working memory with 5 ports, but it has no cache or virtual memory. It utilizes DMA to get the info it requirements by using a mesh procedure. The entrance close pulls RISC-V-like (seriously MIPS-influenced) guidance into a little instruction cache and decodes 8 directions for every cycle.
Interestingly, the front stop aggressively prunes instructions this sort of as jumps or conditionals. Even so, eradicated guidelines aren’t tracked by means of the pipeline. Directions are not tracked via retirement, so all through exceptions and debugging, and it is unclear what the faulting instruction was as guidance are retired out of purchase.
Even with the huge entrance stop, there are just two ALUs and two AGUs. This will make sense as the emphasis of integer execution is largely centered on regulate flow and logic. The real computing horsepower is the vector and matrix execution pipelines. With 512-bit vectors and 8x8x4 matrices, every dojo core arrives close to a entire BF26 TFLOP. The consequence is a thing that seems additional like a microprocessor but is wide like a fashionable desktop CPU.
All these conclusions may possibly feel peculiar till you stage back and search at what Tesla is making an attempt to attain. They’re heading for the smallest probable main to in good shape as a lot of cores on the die as possible. With out a cache, you do not need any snoop filters or tags in memory to sustain coherency. On TSMC’s 7nm process, the Dojo core and SRAM fit in 1.1 square millimeters. Above 71.1% of the die is spent on cores and SRAM (compared to 56% of the AMD Zeppelin). A single Dojo D1 die has 354 Dojo cores. As you can think about, a Dojo die should talk with an interface processor, which connects to the host computer system through PCIe. Having said that, Dojo deployments usually have 25 dies, generating this a pretty scalable supercomputer.
If you’re curious about peeling back the layers of extra compute cores, appear into Alder Lake.