Up To 6x Quicker Than NVIDIA H100 & 30x Quicker Than Intel Xeon 8380, Out there In 2H 2023

Tachyum has formally printed the white paper of its 5nm Prodigy Common Processor which was unveiled all the best way again in 2018.

Tachyum Guarantees Huge Numbers In 5nm Prodigy Common Processor Whitepaper, Up To 9 Occasions Increased Efficiency Effectivity Than NVIDIA’s H100

The Tachyum Prodigy CPUs make the most of a common processor design which implies that they will execute CPU, GPU, and TPU duties on the identical chip, saving prices over competing merchandise and in addition providing actually excessive efficiency.

The corporate goals to sort out all three chip giants, AMD, Intel & NVIDIA with its Prodigy lineup and of their shows, Tachyum has estimated a 4x efficiency uplift over Intel’s Xeon CPUs, on the HPC entrance, a 3x enhance over NVIDIA’s H100 and a 6x Improve in uncooked efficiency in AI & inference workloads. The chips are additionally mentioned to supply over 10x the efficiency of its competitor’s methods on the similar energy. A number of the important options of the CPUs embrace:

  • 128 high-performance unified 64-bit cores operating as much as 5.7 GHz
  • 16 DDR5 reminiscence controllers
  • 64 PCIe 5.0 lanes
  • Multiprocessor assist for 4-socket and 2-socket platforms
  • Rack options for each air-cooled and liquid-cooled information facilities
  • SPECrate 2017 Integer efficiency of round 4x Intel 8380 and round 3x AMD 7763HPC
  • Double-Precision Floating-Level efficiency is 3x NVIDIA H100
  • AI FP8 efficiency is 6x NVIDIA H100

Tachyum has now launched the complete whitepaper of its Prodigy Common Processor that particulars the CPU structure, platform, and lineup, which can scale from the low-power T8232-LP 32 Core CPU at 180W TDP, all the best way as much as the flagship T16128- AIX, which includes a complete of 128 cores.

Tachyum Prodigy Common CPU Structure – Customized 64-bit Design

The Tachyum Prodigy makes use of an OOD (Out-Of-Order) structure that may decode and retire as much as 8 directions per clock, problem as much as 11 directions per clock, with an instruction queue that helps as much as 48 directions and a scheduler that helps 12 queues which are 15 entries deep. It comes with 4 ALUs, one load unit, one retailer unit, one load/retailer unit, one masks unit & two 1024-bit vector items. Every core additionally has an AI subsystem that features a 4096-bit matrix unit. Every core is a single-threaded {hardware} design.

Coming to the cache configuration, every core packs 64 KB I-Cache & 64 KB D-Cache with SECDED ECC. Every core additionally has 1 MB of L2 with twin error appropriate ECC and triple error detect DECTED. The energetic cores can even pool within the L2 cache from idle CPU cores to behave as a shared L3 cache.

Prodigy employs an progressive coherency protocol, T-MESI (Tachyum-MESI), that’s based mostly on MESI. T-MESI provides optimizations enhancing normal MESI that enhance latency and efficiency. Along with on-chip cache coherency, Prodigy additionally helps {hardware} coherency between Prodigy gadgets that allows each 2-socket and 4-socket platforms to be absolutely coherent. Prodigy’s {hardware} coherency makes use of eight full duplex lanes of 112 gigabit/sec SERDES hyperlinks between every set of coherent gadgets, offering an combination of 1.8 terabit/sec of bandwidth between coherent gadgets.

Prodigy’s TLB can maintain giant reminiscence footprints for HPC, as much as 128 TB. The MMU is hardware-managed for max efficiency and features a refined international purge mechanism.

Vector and matrix items

Prodigy’s 2×1024-bit vector subsystems are 2x the dimensions of Intel and 4x the dimensions of AMD top-end processors. Prodigy’s 4096 matrix unit helps 16 x 16, 8 x 8, and 4 x 4 operations. The vector and matrix subsystems assist a variety of information varieties, together with FP64, FP32, TF32, BF16, Int8, FP8, in addition to TAI, or Tachyum AI, a brand new information kind that will probably be introduced later this yr and can ship increased efficiency than FP8. Prodigy’s matrix operations assist sparse information varieties for highest efficiency, together with 4:2 sparsity which can also be supported by the Nvidia H100, in addition to Tachyum’s Tremendous-Sparsity, which permits even increased efficiency with an 8:3 ratio.

Sparse information varieties maximize efficiency for coaching and inference with a really minor discount in accuracy. Decrease precision information varieties and sparsity are mentioned in additional element within the part “Prodigy on the Main Fringe of AI Trade Tendencies” beneath. Scatter/Collect operations present quick, environment friendly loading and storing for vectors and matrices.

Reminiscence and I/O Subsystems

Prodigy integrates an industry-leading sixteen DDR5 reminiscence controllers that run as much as DDR5-7200, offering roughly 1 TB/sec of reminiscence bandwidth, supporting 2 DIMMs per channel. Tachyum will probably be saying a brand new function later this yr referred to as “Bandwidth Amplification” that successfully doubles the reminiscence bandwidth to a staggering 2 TB/sec. The PCIe subsystem contains 64 lanes of PCIe 5.0 with 32 PCIe controllers.

The PCIe subsystem contains 4 x16 PCIe useful blocks, and every of the x16 blocks contains 8 controllers that may bifurcate all the way down to x2, providing most flexibility to assist exterior gadgets starting from excessive efficiency NICs to giant NVMe storage arrays.

Emulation for x86, Arm, RISC-V Prodigy Runs

Prodigy helps software program dynamic binary translation for different instruction set architectures (ISAs) that embrace x86, Arm, and RISC-V. x86 is the established information heart processor, Arm could be very prevalent for telco purposes, and RISC-V is common with educational establishments. The overhead for binary translation is roughly 30 – 40%, however Prodigy will probably be operating roughly two occasions the frequency of aggressive processors, so the efficiency ought to be just like operating native. Binary translation is meant to allow quick, simple out-of-the-box analysis and testing for patrons and companions, with clients migrating to Prodigy’s native ISA for manufacturing deployments for max efficiency.

All chips are fabricated on TSMC’s 5nm (N5P) course of node which is a barely optimized variant of the usual 5nm (N5) node and run native and x86, Arm, and RISC-V binaries. As for HPC and AI-specific options, the Tachyum Prodigy lineup contains:

  • 2 x 1024-bit vector items per core
  • 4096-bit matrix processors per core
  • FP64, FP32, TF32, BF16, Int8, FP8, TAI Knowledge Sorts
  • Sparse Knowledge Sorts Optimizes Effectivity
  • Quantization Help Utilizing Low Precision Knowledge Sorts
  • Scatter/Collect for effectively storing and loading matrices

Tachyum Prodigy Common CPU Lineup/Platform – Scaling from 180W To 900W

All 128 cores on the flagship CPU are clocked at 5.7 GHz plus, AI clients will probably be getting as much as 16 reminiscence channels, supporting as much as 32 TB (64 DIMMs) of DDR5-7200. The processor may also rock 64 PCIe Gen 5.0 lanes and can are available a 950W TDP package deal.

The remainder of the CPUs that Tachyum will supply are listed within the specs sheet beneath:

cores Clock Reminiscence PCIe TDP market phase
Prodigy T16128-AIX 128 5.7GHz 16x DDR5-7200 Gen5 x64 950W HPC, Huge AI
Prodigy T16128-AIM 128 4.5GHz 16x DDR5-7200 Gen5 x64 700W HPC, Huge AI
Prodigy T16128-AIE 128 4.0GHz 16x DDR5-7200 Gen5 x64 600W HPC, Huge AI
Prodigy T16128-HT 128 4.5GHz 16x DDR5-6400 Gen5 x64 300W Analytics, massive information
Prodigy T864-HS 64 5.7GHz 8x DDR5-6400 Gen5 x32 300W cloud, databases
Prodigy T864-HT 64 4.5GHz 8x DDR5-6400 Gen5 x32 300W cloud, databases
Prodigy T832-HS 32 5.7GHz 8x DDR5-6400 Gen5 x32 300W Scalar workloads
Prodigy T832 LP 32 3.2GHz 8x DDR5-4800 Gen5 x32 180W Internet hosting, Storage, Edge

Now that is only one chip and Tachyum will enable full {hardware} coherency that helps 2 and 4-socket methods. In order that’s as much as 512 cores and 3600W of energy from 4 Progidy T16128-AIX tier processors.

The Prodigy Platform will are available numerous rack options similar to an air-cooled 2U server that may be capable of home as much as 4 Tachyum Prodigy chips, 64 16 GB DDR5 DIMMs, and 2×200 GbE RoCE NICs. There’s additionally a customized 48U rack reference design that is available in 2 variations, one liquid cooled and one air-cooled. The air-cooled model helps 40 4-socket 2U servers for a complete of 160 chips whereas the liquid-cooled model helps 88 4-socket 1U servers for a complete of 352 chips. Each racks have a modular design and a couple of racks could be mixed right into a 2-rack cupboard to optimize ground house. Every server comes with 4 cLGA sockets.

Tachyum Prodigy Common CPU Lineup – Hitting NVIDIA, Intel & AMD All At As soon as

Tachyum additionally offers some preliminary efficiency estimates towards Intel Ice Lake, NVIDIA Hopper / Grace HPC chips, and AMD Milan CPUs. The corporate claims as much as a 4x SPECrate 2017 Integer and 30x Uncooked Floating Level efficiency (FP64) enhance versus the competitors. Hopper H100 from NVIDIA is the principle chip that Tachyum appears to have its eyes set upon because it’s utilized in a number of comparative checks.

A number of the efficiency figures talked about embrace:

  • 3x vs NVIDIA H100 in Double Precision Floating-Level Efficiency
  • 6x vs NVIDIA H100 in AI FP8 Efficiency
  • 9x vs NVIDIA H100 in efficiency per watt
  • 4x vs Intel Xeon Platinum 8380 in Specrate 2017 INT Efficiency
  • 30x vs Intel Xeon Platinum 8380 in FP64 efficiency

Tachyum additionally offers some preliminary efficiency estimates towards Intel Ice Lake, NVIDIA Hopper / Grace HPC chips, and AMD Milan CPUs. The corporate claims as much as a 4x SPECrate 2017 Integer and 30x Uncooked Floating Level efficiency (FP64) enhance versus the competitors. Hopper H100 from NVIDIA is the principle chip that Tachyum appears to have its eyes set upon because it’s utilized in a number of comparative checks.

Whereas the Prodigy T16128-AIX gives round 90 TFLOPs of FP64 perf (with sparsity). The corporate makes use of an Air-cooled Prodigy rack which is estimated to ship as much as 6.2 PetaFlops of HPC FP64 horsepower versus an NVIDIA H100 DGX POD rack which gives 960 TFLOPs of FP64 HPC efficiency. The liquid-cooled Prodigy which may maintain higher-end chips ought to supply over double the efficiency at 12.9 PetaFLOPs.

Tachyum expects the primary Prodigy ships to begin sampling later this yr with quantity manufacturing anticipated within the second half of 2023. The following-gen improve to Prodigy, generally known as Prodigy 2 can also be listed in Tachyum’s roadmap and will probably be providing a brand new 3nm structure with much more cores, increased reminiscence bandwidth, PCIe 6.0 + CXL assist, and enhanced connectivity. Sampling on that ought to start by the second half of 2024.

Leave a Comment