【转】了解Centaur CHA(威盛半人马CNS)的物理实现与双槽测试
Examining Centaur CHA’s Die and Implementation Goals
外文资料,附机翻,图片有数量上限自己对一下吧
April 30, 2022 clamchowder Leave a comment
In our last article, we examined Centaur’s CNS architecture. Centaur had a long history as a third x86 CPU designer, and CNS was the last CPU architecture Centaur had in the works before they were bought by Intel. In this article, we’ll take a look at Centaur CHA’s physical implementation. CHA is Centaur’s name for their system-on-chip (SoC) that targets edge server inference workloads by integrating a variety of blocks. These include:
- Eight x86-compatible CNS cores running at up to 2.5 GHz
- NCore, a machine learning accelerator, also running at 2.5 GHz
- 16 MB of L3 cache
- Quad channel DDR4 controller
- 44 PCIe lanes, and IO links for dual socket support
We’re going to examine how CHA allocates die area to various functions. From there, we’ll discuss how Centaur’s design goals influenced their x86 core implementation. Compared to Haswell-ECentaur says CNS has similar IPC to Haswell, so we’ll start with that obvious comparison. CHA is fabricated on TSMC’s 16nm FinFET process node, with a 194 mm2die size. Haswell-E was fabricated on Intel’s 22nm FinFET process node and the 8 core die is 355 mm2. So by server and high end desktop standards, the CHA die is incredibly compact.

CHA ends up being just slightly larger than half of Haswell-E, even though both chips are broadly comparable in terms of core count and IO capabilities. Both chips have eight cores fed by a quad channel DDR4 controller, and support dual socket configurations. Differences from the CPU and IO side are minor. CHA has slightly more PCIe lanes, with 44 compared to Haswell-E’s 40. And Haswell has 20 MB of L3 cache, compared to 16 MB on CHA. Now, let’s break down die area distribution a bit. The Haswell cores themselves occupy just under a third of the die. Add the L3 slices and ring stops, and we’ve accounted for about 50% of die area. The rest of the chip is used to implement IO.

CHA also has eight cores and similar amounts of IO. Like with Haswell-E, IO takes up about half the die. But the eight CNS cores and their L3 cache occupy a third of the die, compared to half on Haswell-E. That leaves enough die area to fit Centaur’s NCore. This machine learning accelerator takes about as much area as the eight CNS cores, so it was definitely a high priority for CHA’s design.

Centaur and Intel clearly have different goals. Haswell-E serves as a high end desktop chip, where it runs well above 3 GHz to achieve higher CPU performance. That higher performance especially applies in applications that don’t have lots of parallelism. To make that happen, Intel dedicates a much larger percentage of Haswell-E’s die area towards CPU performance. Haswell cores use that area to implement large, high clocking circuits. They’re therefore much larger, even after a process node shrink to Intel’s 14nm process (in the form of Broadwell).

In contrast, Centaur aims to make their cores as small as possible. That maximizes compute density on a small die, and makes room for NCore. Unlike Intel, CNS doesn’t try to cover a wide range of bases. Centaur is only targeting edge servers. In those environments, high core counts and limited power budget usually force CPUs to run at conservative clock speeds anyway. It’s no surprise that clock speed takes a back to seat to high density. The result is that CNS achieves roughly Haswell-like IPC in a much smaller core, albeit one that can’t scale to higher performance targets. We couldn’t get a Haswell based i7-4770 to perfectly hit 2.2 GHz, but you get the idea:

7-Zip has a lot of branches, and a lot of them seem to give branch predictors a hard time. In our previous article, we noted that Haswell’s branch predictor could track longer history lengths, and was faster with a lot of branches in play. That’s likely responsible for Haswell’s advantage in 7-Zip. Unfortunately, we don’t have documented performance counters for CNS, and can’t validate that hypothesis. Video encoding is a bit different. Branch performance takes a back seat to vector throughput. Our test uses libx264, which can use AVX-512. AVX-512 instructions account for about 14.67% of executed instructions, provided the CPU supports them. CNS roughly matches Haswell’s performance per clock in this scenario. But again, Haswell is miles ahead when the metaphorical handbrake is off:

CNS is most competitive when specialized AVX-512 instructions come into play. Y-Cruncher is an example of this, with 23.29% of executed instructions belonging to the AVX-512 ISA extension. VPMADD52LUQ (12.76%) and VPADDQ (9.06%) are the most common AVX-512 instructions. The former has no direct AVX2 equivalent as far as I know.

Even though its vector units are not significantly more powerful than Haswell’s, CNS knocks this test out of the park. At similar clock speeds, Haswell gets left in the dust. It still manages to tie CNS at stock, showing the difference high clock speeds can make. We also can’t ignore Haswell’s SMT capability, which helps make up some of its density disadvantage. Compared to Zeppelin and Coffee LakePerhaps comparisons with client dies make more sense, since they’re closer in area to CHA. AMD’s Zeppelin (Zen 1) and Intel’s Coffee Lake also contain eight cores with 16 MB of L3. Both are especially interesting because they’re implemented on process nodes that are vaguely similar to CHA’s.

Let’s start with Zeppelin. This AMD die implements eight Zen 1 cores in two clusters of four, each with 8 MB of L3. It’s about 9% larger than CHA, but has half as many DDR4 channels. Zeppelin has fewer PCIe lanes as well, with 32 against CHA’s 44.

L3 area is in the same ballpark on both Zeppelin and CHA, suggesting that TSMC 16nm and GlobalFoundries 14nm aren’t too far apart when high density SRAM is in play. Core area allocation falls between CHA and Haswell-E’s, although percentage-wise, it’s closer to the latter. A lot of space on Zeppelin that doesn’t belong to cores, cache, DDR4, or PCIe. That’s because Zeppelin is a building block for everything. Desktops are served with a single Zepplin die that provides up to eight cores. Workstations and servers get multiple dies linked together to give more cores, more memory controller channels, and more PCIe. In the largest configuration, a single Epyc server chip uses four Zeppelins to give 32 cores, eight DDR4 channels, and 128 PCIe lanes. Zeppelin’s flexibility lets AMD streamline production. But nothing in life is free, and die space is used to make that happen. For client applications, Zeppelin tries to bring a lot of traditional chipset functionality onto its die. It packs a quad port USB 3.1 controller. That takes space. Some PCIe lanes can operate in SATA mode to support multi-mode M.2 slots, which means extra SATA controller logic. Multi-die setups are enabled by cross-die links (called Infinity Fabric On Package, or IFOP), which cost area as well. AMD’s Zen 1 core was probably designed with that in mind. It can’t clock low because it has to reach desktop performance levels, so density is achieved by sacrificing vector performance. Zen 1 uses 128-bit execution units and registers, with 256-bit AVX instructions split into two micro-ops.

Then, Zen 1 took aim at modest IPC goals, and gave up on extremely high clock speeds. That helps improve their compute density, and lets them fit eight cores in a die that has all kinds of other connectivity packed into it. Like Haswell, Zen uses SMT to boost multithreaded performance with very little die area overhead.

In its server configuration, Zen 1 can’t boost as high as it can on desktops. But it still has a significant clock speed advantage over CNS on low threaded parts of workloads. Like Haswell, Zen 1 delivers better performance per clock than CNS with 7-Zip compression.

With video encoding, CNS again falls behind. Part of that is down to clock speed. Zen 1’s architecture is also superior in some ways. AMD has a large non-scheduling queue for vector operations, which can prevent renamer stalls when lots of vector operations are in play. In this case, that extra “fake” scheduling capacity seems to be more valuable than CNS’s higher execution throughput.

Finally, CNS leaves Zen 1 in the dust if an application very heavy on vector compute, and can take advantage of AVX-512. In Y-Cruncher, Zen 1 is no match for CNS. Intel’s Coffee LakeCoffee Lake’s die area distribution looks very different. Raw CPU performance is front and center, with CPU cores and L3 cache taking more than half the die.

About a quarter of Coffee Lake’s die goes to an integrated GPU. In both absolute area and percentage terms, Comet Lake’s iGPU takes up more space than Centaur’s NCore, showing the importance Intel places on integrated graphics. After the cores, cache, and iGPU are accounted for, there’s not much die area left. Most of it is used to implement modest IO capabilities. That makes Coffee Lake very area efficient. It packs more CPU power into a smaller die than CHA or Zeppelin, thanks to its narrow focus on client applications. Like Haswell and Zen, the Skylake core used in Coffee Lake looks designed to reach very high clocks, at the expense of density. It’s much larger than CNS and Zen 1, and clocks a lot higher too.

Intel also sacrifices some area to make Skylake “AVX-512 ready”, for lack of a better term. This lets the same basic core design go into everything from low power ultrabooks to servers and supercomputers, saving design effort.

Perhaps it’s good to talk about AVX-512 for a bit, since that’s a clear goal in both Skylake and CNS’s designs. AVX-512 Implementation ChoicesCNS’s AVX-512 implementation looks focused on minimizing area overhead, rather than maximizing performance gain when AVX-512 instructions are used. Intel has done the opposite with Skylake-X. To summarize some AVX-512 design choices:
To me, CNS is very interesting because it shows how decent AVX-512 support can be implemented with as little cost as possible. Furthermore, CNS demonstrates that reasonably powerful vector units can be implemented in a core that takes up relatively little area. To be clear, this kind of strategy may not offer the best gains from AVX-512:

But it can still offer quite a bit of performance improvement in certain situations, as the benchmark above shows. Thoughts about Centaur’s DesignCentaur aimed to create a low cost SoC that combines powerful inference capabilities with CPU performance appropriate for a server. Low cost means keeping die area down. But Centaur’s ML goals mean dedicating significant die area toward its NCore accelerator. Servers need plenty of PCIe lanes for high bandwidth network adapters, and that takes area too. All of that dictates a density oriented CPU design, in order to deliver the required core count within remaining space. To achieve this goal, Centaur started by targeting low clocks. High clocking designs usually take more area, and that’s especially apparent on Intel’s designs. There are tons of reasons for this. To start, process libraries designed for high speeds occupy more area than ones designed for high density. For example, Samsung’s high performance SRAM bitcells occupy 0.032 µm² on their 7nm node, while their high density SRAM takes 0.026 µm². Then, higher clock speeds often require longer pipelines, and buffers between pipeline stages take space. On top of that, engineers might use more circuitry to reduce the number of dependent transitions that have to complete within a cycle. One example is using carry lookahead adders instead of simpler carry propagate adders, where additional gates are used to compute whether a carry will be needed at certain positions, rather than wait for a carry signal to propagate across all of a number’s binary digits. Then, AVX-512 is supported with minimal area overhead. Centaur extended the renamer’s register alias table to handle AVX-512 mask registers. They tweaked the decoders to recognize the extra instructions. And they added some execution units for specialized instructions that they cared for. I doubt any of those took much die area. CNS definitely doesn’t follow Intel’s approach of trying to get the most out of AVX-512, which involves throwing core area at wider execution units and registers. Finally, Centaur set modest IPC goals. They aimed for Haswell-like performance per clock, at a time when Skylake was Intel’s top of the line core. That avoids bloating out-of-order execution buffers and queues to chase diminishing IPC gains. To illustrate this, let’s look at simulated reorder buffer (ROB) utilization in an instruction trace of a 7-Zip compression workload. A CPU’s ROB tracks all instructions until their results are made final, so its capacity represents an upper bound for a CPU’s out of order execution window. We modified ChampSim to track ROB occupancies every cycle, to get an idea of how often a certain number of ROB entries are utilized.

Most cycles fall into one of two categories. We either see less than 200 entries used. Or the ROB is basically filling up, indicating it’s not big enough to absorb latency. We can also graph the percentage of cycles for which ROB utilization is below a certain point (below). Again, we see clearly diminishing returns from increased reordering capacity.

Increasing ROB size implies beefing up other out-of-order buffers too. The renamed register files, scheduler, and load/store queues will have to get bigger, or they’ll fill up before the ROB does. That means IPC increases require disproportionate area tradeoffs, and Centaur’s approach has understandably been conservative. The resulting CNS architecture ends up being very dense and well suited to CHA’s target environment, at the expense of flexibility. Under a heavy AVX load, the chip draws about 65 W at 2.2 GHz, but power draw at 2.5 GHz goes up to 140 W. Such a sharp increase in power consumption is a sign that the architecture won’t clock much further. That makes CNS completely unsuitable for consumer applications, where single threaded performance is of paramount importance. At a higher level, CHA also ends up being narrowly targeted. Even though CNS cores are small, each CHA socket only has eight of them. That’s because Centaur spends area on NCore and uses a very small die. CHA therefore falls behind in multi-threaded CPU throughput, and won’t be competitive in applications that don’t use NCore’s ML acceleration capabilities. Perhaps the biggest takeaway from CNS is that it’s possible to implement powerful vector units a physically small core, as long as clock speed and IPC targets are set with density in mind.
Surface-level look at core width/reodering capacity, vector execution, and clock speedsCentaur did all this with a team of around 100 people, on a last-generation process node. That means a density oriented design is quite achievable with limited resources. I wonder what AMD or Intel could do with their larger engineering teams and access to cutting edge process nodes, if they prioritized density. What Could Have Been?Instead of drawing a conclusion, let’s try something a little different. I’m going to speculate and daydream about how else CNS could have been employed. Pre-2020 – More Cores?CHA tries to be a server chip with the die area of a client CPU. Then it tries to stuff a machine learning accelerator on the same die. That’s crazy. CHA wound up with fewer cores than Intel or AMD server CPUs, even though CNS cores prioritized density. What if Centaur didn’t try to create a server processor with the area of a client die? If we quadruple CHA’s 8-core CPU complex along with its cache, that would require about 246 mm2 of die area. Of course, IO and connectivity would require space as well. But it’s quite possible for a 32 core CNS chip to be implemented using a much smaller die than say, a 28-core Skylake-X chip.

A fully enabled 28 core Skylake-X chip would likely outperform a hypothetical 32 core CNS one. But Intel uses 677.98 mm2 to do that. A few years ago, Intel also suffered from a shortage of production capacity on their 14nm process. All of that pushes up prices for Skylake-X. That gives Centaur an opportunity to undercut Intel. Using a much smaller die on a previous generation TSMC node should make for a cheap chip. Before 2019, AMD could also offer 32 cores per socket with their top end Zen 1 Epyc SKUs. Against that, CNS would compete by offering AVX-512 support and better vector performance. Post 2020 – A Die Shrink?But AMD’s 2019 Zen 2 launch changes things. CNS’s ability to execute AVX-512 instructions definitely gives it an advantage. But Zen 2 has 256-bit execution units and registers just like CNS. More importantly, Zen 2 has a process node advantage. AMD uses that to implement a more capable out-of-order execution engine, more cache, and faster caches. That puts CNS in a rough position.

Core for core, CNS stands no chance. Even if we had CNS running at 2.5 GHz, its performance simply isn’t comparable to Zen 2. It gets worse in file compression:

Even in Y-Cruncher, which is a best case for CNS, Zen 2’s higher clocks and SMT support let it pull ahead.

Worse, TSMC’s 7nm process lets AMD pack 64 of those cores into one socket. In my view, CNS has to be ported to a 7nm class process to have any chance after 2020. It’s hard to guess what that would look like, but I suspect a CNS successor on 7nm would fill a nice niche. It’d be the smallest CPU core with AVX-512 support. Maybe it could be an Ampere Altra competitor with better vector performance. Under Intel, CNS could bring AVX-512 support to E-Cores in Intel’s hybrid designs. That would fix Alder Lake’s mismatched ISA awkwardness. Compared to Gracemont, CNS probably wouldn’t perform as well in integer applications, but vector performance would be much better. And CNS’s small core area would be consistent with Gracemont’s area efficiency goal. Perhaps a future E-Core could combine Gracemont’s integer prowess with CNS’s vector units and AVX-512 implementation.
Centaur CHA’s Probably Unfinished Dual Socket Implementation
April 23, 2022 clamchowder Leave a comment
Centaur’s CHA chip targets the server market with a low core count. Its dual socket capability is therefore quite important, because it’d allow up to 16 cores in a single CHA-based server. Unfortunately for Centaur, modern dual socket implementations are quite complicated. CPUs today use memory controllers integrated into the CPU chip itself, meaning that each CPU socket has its own pool of memory. If CPU cores on one socket want to access memory connected to another socket, they’ll have to go through a cross-socket link. That creates a setup with non-uniform memory access (NUMA). Crossing sockets will always increase latency and reduce bandwidth, but a good NUMA implementation will minimize those penalties.

Cross-Socket LatencyHere, we’ll test how much latency the cross-socket link adds by allocating memory on different nodes, and using cores on different nodes to access that memory. This is basically our latency test being run only at the 1 GB test size, because that size is large enough to spill out of any caches. And we’re using 2 MB pages to avoid TLB miss penalties. That’s not realistic for most consumer applications, which use 4 KB pages, but we’re trying to isolate NUMA-related latency penalties instead of showing memory latency that applications will see in practice.

Crossing sockets adds about 92 ns of additional latency, meaning that memory on a different socket takes almost twice as long to access. For comparison, Intel suffers less of a penalty.

On a dual socket Broadwell system, crossing sockets adds 42 ns of latency with the early snoop setting. Accessing remote memory takes 41.7% longer than hitting memory directly attached to the CPU. Compared to CNS, Intel is a mile ahead, partially because that early snoop mode is optimized for low latency. The other part is that Intel has plenty of experience working on multi-socket capable chips. If we go back over a decade to Intel’s Westmere based Xeon X5650, memory access latency on the same node is 70.3 ns, while remote memory access is 121.1 ns. The latency delta there is just above 50 ns. It’s worse than Broadwell, but still significantly better than Centaur’s showing.

Broadwell also supports a cluster-on-die setting, which creates two NUMA nodes per socket. In this mode, a NUMA node covers a single ring bus, connected to a dual channel DDR4 memory controller. This slightly reduces local memory access latency. But Intel has a much harder time with four pools of memory in play. Crossing sockets now takes almost as long as it does on CHA. Looking closer, we can see that memory latency jumps by nearly 70 ns when accessing “remote” memory connected to the same die. That’s bigger than the cross-socket latency delta, and suggests that Intel takes a lot longer to figure out where to send a memory request if it has three remote nodes to pick from.

Popping over to AMD, we have results from when we tested a Milan-X system on Azure. Like Broadwell’s cluster on die mode, AMD’s NPS2 mode creates two NUMA nodes within a socket. However, AMD seems to have very fast directories for figuring out which node is responsible for a memory address. Going from one half of a socket to another only adds 14.33 ns. The cross socket connection on Milan-X adds around 70-80 ns of latency, depending on which half of the remote socket you’re accessing. To summarize, Centaur’s cross-node latency is mediocre. It’s worse than what we see from Intel or AMD, unless the Intel Broadwell system is juggling four NUMA nodes. But it’s not terrible for a company that has no experience in multi-socket designs. Cross-Socket BandwidthNext, we’ll test bandwidth. Like with the latency test, we’re running our bandwidth test with different combinations of where memory is allocated and what CPU cores are used. The test size here is 3 GB, because that’s the largest size we have hardcoded into our bandwidth test. Size doesn’t really matter as long as it’s big enough to get out of caches. To keep things simple, we’re only testing read bandwidth.

Centaur’s cross socket bandwidth is disastrously poor at just above 1.3 GB/s. When you can read faster from a good NVMe SSD, something is wrong. For comparison, Intel’s decade old Xeon X5650 can sustain 11.2 GB/s of cross socket bandwidth, even though its triple channel DDR3 setup only achieved 20.4 GB/s within a node. A newer design like Broadwell does even better.

With each socket represented by one NUMA node, Broadwell can get nearly 60 GB/s of read bandwidth from its four DDR4-2400 channels. Accessing that from a different socket drops bandwidth to 21.3 GB/s. That’s quite a jump over Westmere, showing the progress Intel has made over the past decade. If we switch Broadwell into cluster on die mode, each node of seven cores can still pull more cross-socket bandwidth than what Centaur can achieve. Curiously, Broadwell suffers a heavy penalty from crossing nodes within a die, with memory bandwidth cut approximately in half.

Finally, let’s have a look at AMD’s Milan-X:

Milan-X is a bandwidth monster compared to the other chips here. It has twice as many DDR4 channels as CHA and Broadwell, so high intra-node bandwidth comes as no surprise. Across nodes, AMD retains very good bandwidth when accessing the other half of the same socket. Across sockets, each NPS2 node can still pull over 40 GB/s, which isn’t far off CHA’s local memory bandwidth. Core to Core Latency with Contested AtomicsOur last test evaluates cache coherency performance by using locked compare-and-exchange operations to modify data shared between two cores. Centaur does well here, with latencies around 90 to 130 ns when crossing sockets.

The core to core latency plot above is similar to Ampere Altra’s, where cache coherency operations on a cache line homed to a remote socket require a round trip over the cross-socket interconnect, even when the two cores communicating with each other are on the same chip. However, absolute latencies on CHA are far lower, thanks to CHA having far fewer cores and a less complex topology. Intel’s Westmere architecture from 2010 is able to do better than CHA when cache coherency goes across sockets. They’re able to handle cache coherency operations within a die (likely at the L3 level) even if the cache line is homed to a remote socket.

But this sort of excellent cross socket performance isn’t typical. Westmere likely benefits because all off-core requests go through a centralized global queue. Compared to the distributed approach used since Sandy Bridge, that approach suffers from higher latency and low bandwidth for regular L3 accesses. But its simplicity and centralized nature likely enables excellent cross-socket cache coherency performance.

Broadwell’s cross-socket performance is similar to CHA’s. By looking at results from both cluster and die and early snoop modes, we can clearly see that the bulk of core to core cache coherence latency comes from how directory lookups are performed. If the transfer happens within a cluster on die node, coherency is handled via the inclusive L3 and its core valid bits. If the L3 is missed, the coherency mechanism is much slower. Intra-die, cross-node latencies are already over 100 ns. Crossing dies only adds another 10-20 ns. Early snoop mode shows that intra-die coherency can be quite fast. Latencies stay within the 50 ns range or under, even when rings are crossed. However, early snoop mode increases cross socket latency to about 140 ns, making it slightly worse than CNS’s.

We don’t have clean results from Milan-X because hypervisor core pinning on the cloud instance was a bit funky. But our results were roughly in line with Anandtech’s results on Epyc 7763. Intra-CCX latencies are very low. Cross-CCX latencies within a NPS2 node were in the 90 ns range. Crossing NPS2 nodes brought latencies to around 110 ns, and crossing sockets resulted in ~190 ns latency. Centaur’s cross socket performance in this category is therefore better than Epyc’s. More generally, CHA puts in its best showing in this kind of test. It’s able to go toe to toe with AMD and Intel systems that smacked it around in our “clean” memory access tests. “Clean” here means that we don’t have multiple cores writing to the same cache line. Unfortunately for Centaur, we’ve seen exactly zero examples of applications that benefit from low core-to-core latency. Contested atomics just don’t seem to be very common in multithreaded code. Final WordsBefore CNS, Centaur focused on low power consumer CPUs with products like the VIA Nano. Some of that experience carries over into server designs. After all, low power consumption and small core size are common goals. But go beyond the cpu core, and servers are a different world. Servers require high core counts, lots of IO, and lots of memory bandwidth. They also need to support high memory capacity. CHA delivers on some of those fronts. It can support hundreds of gigabytes of memory per socket. Its quad channel DDR4 memory controller and 44 PCIe lanes give adequate but not outstanding off-chip bandwidth. CHA is also the highest core count chip created by Centaur. But eight cores is a bit low for the server market today. Dual socket support could partially mitigate that. Unfortunately, the dual socket work appears to be unfinished. CHA’s low cross socket bandwidth will cause serious problems, especially for NUMA-unaware workloads. It also sinks any possibility of using the system in interleaved mode, where accesses are striped across sockets to provide more bandwidth to NUMA-unuaware applications at the expense of higher latency.

So what went wrong? Well, remember that cross socket accesses suffer extra latency. That’s common to all platforms. But achieving high bandwidth over a long latency connection requires being able to queue up a lot of outstanding requests. My guess is that Centaur implemented a queue in front of their cross socket link, but never got around to validating it. Centaur’s small staff and limited resources were probably busy covering all the new server-related technologies. What we’re seeing is probably a work in progress.

Centaur has implemented the protocols and coherence directories necessary to make multiple sockets work. And they work with reasonably good latency. Unfortunately, the cross-socket work can’t be finished because Centaur doesn’t exist anymore, so we’ll never see CHA’s full potential in a dual socket setup. Special thanks to Brutus for setting the system up and running tests on it.
检查 Centaur CHA 的模具和实施目标
在上一篇文章中,我们研究了 Centaur 的 CNS 架构。Centaur 作为第三个 x86 CPU 设计师有着悠久的历史,而 CNS 是 Centaur 在被英特尔收购之前的最后一个 CPU 架构。在本文中,我们将了解 Centaur CHA 的物理实现。CHA 是 Centaur 对其片上系统 (SoC) 的名称,它通过集成各种块来针对边缘服务器推理工作负载。这些包括:
- 八个兼容 x86 的 CNS 内核,运行频率高达 2.5 GHz
- NCore,一种机器学习加速器,同样以 2.5 GHz 运行
- 16 MB 的 L3 缓存
- 四通道 DDR4 控制器
- 44 个 PCIe 通道和用于双插槽支持的 IO 链路
我们将研究 CHA 如何为各种功能分配管芯区域。从那里,我们将讨论 Centaur 的设计目标如何影响他们的 x86 核心实现。 与 Haswell-E 相比Centaur 表示 CNS 具有与 Haswell 相似的 IPC,因此我们将从明显的比较开始。CHA 采用 TSMC 的 16nm FinFET 工艺节点制造,芯片尺寸为 194 mm 2。Haswell-E 采用 Intel 的 22nm FinFET 工艺节点制造,8 核芯片为 355 mm 2。因此,按照服务器和高端台式机标准,CHA 芯片非常紧凑。

CHA 最终只略大于 Haswell-E 的一半,尽管这两种芯片在核心数量和 IO 能力方面大体相当。这两款芯片都有八个内核,由一个四通道 DDR4 控制器供电,并支持双插槽配置。CPU 和 IO 端的差异很小。CHA 的 PCIe 通道略多一些,有 44 条,而 Haswell-E 有 40 条。Haswell 有 20 MB 的三级缓存,而 CHA 有 16 MB。 现在,让我们稍微分解一下裸片面积分布。Haswell 内核本身占据了不到三分之一的芯片。添加 L3 切片和环形挡块,我们已经占了裸片面积的大约 50%。芯片的其余部分用于实现IO。
CHA 也有八个核心和类似数量的 IO。与 Haswell-E 一样,IO 占据了大约一半的芯片。但是八个 CNS 内核及其 L3 缓存占据了芯片的三分之一,而 Haswell-E 则占一半。这留下了足够的芯片面积来安装 Centaur 的 NCore。这个机器学习加速器占用的面积与八个 CNS 核心差不多,因此它绝对是 CHA 设计的重中之重。
Centaur 和 Intel 显然有不同的目标。Haswell-E 用作高端台式机芯片,其运行频率远高于 3 GHz 以实现更高的 CPU 性能。更高的性能尤其适用于没有很多并行性的应用程序。为实现这一目标,英特尔将 Haswell-E 芯片面积的更大比例用于 CPU 性能。Haswell 内核使用该区域来实现大型、高时钟电路。因此它们要大得多,即使在工艺节点缩小到英特尔的 14 纳米工艺(以 Broadwell 的形式)之后也是如此。
相比之下,Centaur 旨在使它们的核心尽可能小。这最大限度地提高了小芯片上的计算密度,并为 NCore 腾出了空间。与 Intel 不同,CNS 并不试图涵盖广泛的基础。Centaur 仅针对边缘服务器。在这些环境中,高内核数和有限的功率预算通常迫使 CPU 以保守的时钟速度运行。毫不奇怪,时钟速度让位于高密度。 结果是 CNS 在一个更小的内核中实现了大致类似于 Haswell 的 IPC,尽管它不能扩展到更高的性能目标。我们无法让基于 Haswell 的 i7-4770 完美达到 2.2 GHz,但你明白了:
7-Zip 有很多分支,其中很多似乎给分支预测器带来了困难。在我们之前的文章中,我们注意到 Haswell 的分支预测器可以跟踪更长的历史长度,并且在使用大量分支时速度更快。这可能是 Haswell 在 7-Zip 中占据优势的原因。不幸的是,我们没有记录 CNS 的性能计数器,因此无法验证该假设。 视频编码有点不同。分支性能在矢量吞吐量方面处于次要地位。我们的测试使用的是libx264,它可以使用AVX-512。AVX-512 指令约占执行指令的 14.67%,前提是 CPU 支持它们。在这种情况下,CNS 大致匹配 Haswell 的每时钟性能。但同样,当比喻手刹关闭时,哈斯韦尔遥遥领先:
当专门的 AVX-512 指令发挥作用时,CNS 最具竞争力。Y-Cruncher 就是一个例子,其中 23.29% 的执行指令属于 AVX-512 ISA 扩展。VPMADD52LUQ (12.76%) 和 VPADDQ (9.06%) 是最常见的 AVX-512 指令。据我所知,前者没有直接的 AVX2 等效项。
尽管它的矢量单元并不比 Haswell 的强大得多,但 CNS 将这项测试淘汰出局。在相似的时钟速度下,Haswell 被甩在了尘土中。它仍然设法在库存中捆绑 CNS,显示出高时钟速度可以带来的差异。我们也不能忽视 Haswell 的 SMT 能力,这有助于弥补它的一些密度劣势。 与 Zeppelin 和 Coffee Lake 相比也许与客户模具进行比较更有意义,因为它们在区域上更接近 CHA。AMD 的 Zeppelin (Zen 1) 和 Intel 的 Coffee Lake 也包含八个内核和 16 MB 的 L3。两者都特别有趣,因为它们是在与 CHA 隐约相似的流程节点上实现的。
让我们从齐柏林飞艇开始。这个 AMD 芯片在两个集群中实现了八个 Zen 1 核心,每个集群有四个,每个都有 8 MB 的 L3。它比 CHA 大约大 9%,但 DDR4 通道数量只有它的一半。Zeppelin 的 PCIe 通道也更少,只有 32 个,而 CHA 有 44 个。
L3 区域在 Zeppelin 和 CHA 上处于同一范围内,这表明在使用高密度 SRAM 时,台积电 16nm 和 GlobalFoundries 14nm 相距不远。核心区域分配介于 CHA 和 Haswell-E 之间,尽管在百分比方面更接近后者。 Zeppelin 上有很多不属于内核、缓存、DDR4 或 PCIe 的空间。那是因为 Zeppelin 是一切的基石。台式机配有一个 Zepplin 裸片,最多可提供八个内核。工作站和服务器将多个裸片连接在一起,以提供更多内核、更多内存控制器通道和更多 PCIe。在最大的配置中,单个 Epyc 服务器芯片使用四个 Zeppelin 提供 32 个内核、8 个 DDR4 通道和 128 个 PCIe 通道。 Zeppelin 的灵活性让 AMD 简化了生产。但生活中没有什么是免费的,而空间就是用来实现这一点的。对于客户端应用程序,Zeppelin 试图将许多传统芯片组功能集成到其芯片中。它包含一个四端口 USB 3.1 控制器。这需要空间。一些 PCIe 通道可以在 SATA 模式下运行以支持多模式 M.2 插槽,这意味着额外的 SATA 控制器逻辑。多管芯设置通过交叉管芯链接(称为 Infinity Fabric On Package,或 IFOP)实现,这也会占用面积。 AMD 的 Zen 1 核心在设计时可能考虑到了这一点。它不能时钟低,因为它必须达到桌面性能水平,所以密度是通过牺牲矢量性能来实现的。Zen 1 使用 128 位执行单元和寄存器,256 位 AVX 指令分为两个微操作。
然后,Zen 1 瞄准了适度的 IPC 目标,并放弃了极高的时钟速度。这有助于提高他们的计算密度,并让他们能够在一个芯片中安装八个内核,该芯片中包含各种其他连接。与 Haswell 一样,Zen 使用 SMT 以极小的管芯面积开销提升多线程性能。
在其服务器配置中,Zen 1 无法像在台式机上那样提升。但在工作负载的低线程部分,它仍然比 CNS 具有显着的时钟速度优势。与 Haswell 一样,Zen 1 的每时钟性能优于采用 7-Zip 压缩的 CNS。
在视频编码方面,CNS 再次落后。部分原因在于时钟速度。Zen 1 的架构在某些方面也更胜一筹。AMD 有一个用于矢量操作的大型非调度队列,可以防止在进行大量矢量操作时重命名器停顿。在这种情况下,额外的“假”调度能力似乎比 CNS 更高的执行吞吐量更有价值。
最后,如果应用程序的矢量计算量很大,CNS 将 Zen 1 远远甩在后面,并且可以利用 AVX-512。在 Y-Cruncher 中,Zen 1 不是 CNS 的对手。 英特尔的咖啡湖Coffee Lake 的 die area 分布看起来很不一样。原始 CPU 性能是重中之重,CPU 内核和 L3 缓存占据了一半以上的份额。
Coffee Lake 大约四分之一的芯片用于集成 GPU。从绝对面积和百分比上看,Comet Lake的iGPU比Centaur的NCore占用的空间更多,可见Intel对集成显卡的重视。计算完内核、缓存和 iGPU 后,剩下的裸片面积不多了。其中大部分用于实现适度的 IO 功能。这使得 Coffee Lake 非常高效。与 CHA 或 Zeppelin 相比,它在更小的芯片中集成了更多的 CPU 能力,这要归功于它对客户端应用程序的专注。 与 Haswell 和 Zen 一样,Coffee Lake 中使用的 Skylake 核心看起来旨在以牺牲密度为代价达到非常高的时钟。它比 CNS 和 Zen 1 大得多,时钟频率也高得多。
英特尔还牺牲了一些区域来让 Skylake“准备好 AVX-512”,因为没有更好的术语了。这让相同的基本核心设计适用于从低功耗超极本到服务器和超级计算机的所有设备,从而节省了设计工作量。
也许稍微谈谈 AVX-512 是件好事,因为这是 Skylake 和 CNS 设计中的明确目标。 AVX-512 实施选择CNS 的 AVX-512 实现看起来专注于最小化面积开销,而不是在使用 AVX-512 指令时最大化性能增益。英特尔对 Skylake-X 做了相反的事情。 总结一些 AVX-512 设计选择:
对我来说,CNS 非常有趣,因为它展示了如何以尽可能低的成本实现对 AVX-512 的良好支持。此外,CNS 证明相当强大的矢量单元可以在占用相对较小面积的内核中实现。需要明确的是,这种策略可能无法提供 AVX-512 的最佳收益:
但它仍然可以在某些情况下提供相当多的性能改进,如上面的基准所示。 Centaur 的设计思考Centaur 旨在创建一个低成本的 SoC,将强大的推理能力与适合服务器的 CPU 性能相结合。低成本意味着降低管芯面积。但 Centaur 的 ML 目标意味着将大量芯片面积用于其 NCore 加速器。服务器需要大量 PCIe 通道用于高带宽网络适配器,这也需要占用空间。所有这些都要求以密度为导向的 CPU 设计,以便在剩余空间内提供所需的核心数。 为实现这一目标,Centaur 以低时钟为目标。高时钟频率设计通常占用更多面积,这在英特尔的设计中尤为明显。这有很多原因。首先,为高速设计的工艺库比为高密度设计的工艺库占用更多的区域。例如,三星的高性能 SRAM 位单元在其 7 纳米节点上占用 0.032 µm²,而其高密度 SRAM 占用 0.026 µm²。然后,更高的时钟速度通常需要更长的流水线,流水线级之间的缓冲区会占用空间。最重要的是,工程师可能会使用更多电路来减少必须在一个周期内完成的相关转换的数量。一个例子是使用进位先行加法器而不是更简单的进位传播加法器, 然后,以最小的面积开销支持 AVX-512。Centaur 扩展了重命名器的寄存器别名表以处理 AVX-512 掩码寄存器。他们调整了解码器以识别额外的指令。他们为他们关心的专门指令添加了一些执行单元。我怀疑其中任何一个占用了很大的芯片面积。CNS 绝对不会遵循英特尔试图充分利用 AVX-512 的方法,该方法涉及将核心区域投入更广泛的执行单元和寄存器。 最后,Centaur 设定了适度的 IPC 目标。在 Skylake 是 Intel 的顶级核心时,他们的目标是每个时钟具有类似 Haswell 的性能。这避免了为了追逐递减的 IPC 收益而使乱序执行缓冲区和队列膨胀。为了说明这一点,让我们看一下 7-Zip 压缩工作负载的指令跟踪中的模拟重新排序缓冲区 (ROB) 利用率。CPU 的 ROB 跟踪所有指令,直到它们的结果成为最终结果,因此它的容量代表了 CPU 乱序执行窗口的上限。我们修改了 ChampSim 以跟踪每个周期的 ROB 占用情况,以了解特定数量的 ROB 条目的使用频率。

大多数周期属于两类之一。我们要么看到使用的条目少于 200 个。或者 ROB 基本上已满,表明它不够大,无法吸收延迟。我们还可以绘制 ROB 利用率低于特定点(下图)的周期百分比。再一次,我们看到重新订购能力增加带来的回报明显减少。
增加 ROB 大小也意味着增加其他无序缓冲区。重命名的寄存器文件、调度程序和加载/存储队列必须变大,否则它们会在 ROB 之前填满。这意味着 IPC 的增加需要不成比例的面积权衡,而 Centaur 的方法一直是保守的,这是可以理解的。 由此产生的 CNS 架构最终变得非常密集并且非常适合 CHA 的目标环境,但以牺牲灵活性为代价。在 AVX 负载很重的情况下,该芯片在 2.2 GHz 时的功耗约为 65 W,但在 2.5 GHz 时的功耗高达 140 W。功耗的急剧增加表明该架构的时钟频率不会再高了。这使得 CNS 完全不适合消费者应用程序,其中单线程性能至关重要。在更高的层面上,CHA 也最终成为了狭隘的目标。尽管 CNS 内核很小,但每个 CHA 插槽只有八个。那是因为 Centaur 在 NCore 上占用了面积并且使用了非常小的 die。因此,CHA 在多线程 CPU 吞吐量方面落后,并且在不使用 NCore 的 ML 加速功能的应用程序中不会具有竞争力。 也许 CNS 的最大收获是,只要在设置时钟速度和 IPC 目标时考虑到密度,就可以在物理上很小的内核中实现强大的矢量单元。
表面层查看核心宽度/重新排序容量、矢量执行和时钟速度Centaur 在上一代工艺节点上与大约 100 人的团队一起完成了所有这些工作。这意味着以有限的资源完全可以实现面向密度的设计。我想知道如果 AMD 或英特尔优先考虑密度,他们可以用他们更大的工程团队做些什么并获得尖端工艺节点。 可能是什么?让我们尝试一些不同的东西,而不是得出结论。我将推测和幻想 CNS 还能如何使用。 2020 年之前 – 更多内核?CHA 试图成为具有客户端 CPU 裸片区域的服务器芯片。然后它试图在同一个芯片上填充机器学习加速器。太疯狂了。尽管 CNS 内核优先考虑密度,但 CHA 的内核数量少于 Intel 或 AMD 服务器 CPU。如果 Centaur 不尝试创建具有客户端裸片面积的服务器处理器怎么办? 如果我们将 CHA 的 8 核 CPU 复合体及其高速缓存增加四倍,则需要大约 246 mm 2的裸片面积。当然,IO 和连接也需要空间。但是,使用比 28 核 Skylake-X 芯片小得多的芯片来实现 32 核 CNS 芯片是很有可能的。
完全启用的 28 核 Skylake-X 芯片可能会胜过假设的 32 核 CNS 芯片。但英特尔使用 677.98 mm 2来做到这一点。几年前,英特尔的14纳米工艺也遭遇产能短缺。所有这些都推高了 Skylake-X 的价格。这让 Centaur 有机会削弱英特尔。在上一代 TSMC 节点上使用更小的裸片应该可以制造出便宜的芯片。 在 2019 年之前,AMD 还可以为其高端 Zen 1 Epyc SKU 提供每个插槽 32 个内核。与之相反,CNS 将通过提供 AVX-512 支持和更好的矢量性能来竞争。 2020 年后——模具收缩?但 AMD 2019 年 Zen 2 的发布改变了一切。CNS 执行 AVX-512 指令的能力无疑给它带来了优势。但是 Zen 2 和 CNS 一样有 256 位的执行单元和寄存器。更重要的是,Zen 2 拥有工艺节点优势。AMD 使用它来实现功能更强大的乱序执行引擎、更多缓存和更快的缓存。这使 CNS 处于一个艰难的境地。
核心对核心,CNS 没有机会。即使我们让 CNS 以 2.5 GHz 运行,它的性能也无法与 Zen 2 相提并论。它在文件压缩方面变得更糟:
即使在 CNS 的最佳案例 Y-Cruncher 中,Zen 2 更高的时钟和 SMT 支持也让它领先。
更糟糕的是,台积电的 7nm 工艺让 AMD 可以将 64 个内核封装到一个插槽中。在我看来,CNS 必须移植到 7nm 级工艺才能在 2020 年之后有任何机会。很难猜测它会是什么样子,但我怀疑 7nm 的 CNS 继任者会填补一个不错的利基市场。它将是支持 AVX-512 的最小 CPU 内核。也许它可能是具有更好矢量性能的 Ampere Altra 竞争对手。 在英特尔的领导下,CNS 可以为英特尔混合设计中的 E-Cores 带来 AVX-512 支持。这将解决 Alder Lake 不匹配的 ISA 尴尬问题。与 Gracemont 相比,CNS 在整数应用程序中可能表现不佳,但矢量性能会好得多。而 CNS 的小核心区域将与 Gracemont 的区域效率目标一致。也许未来的 E-Core 可以将 Gracemont 的整数能力与 CNS 的向量单元和 AVX-512 实现结合起来。
Centaur CHA 可能未完成的双插槽实现
Centaur 的 CHA 芯片以低核心数瞄准服务器市场。因此,它的双插槽功能非常重要,因为它允许在单个基于 CHA 的服务器中使用多达 16 个内核。不幸的是,对于 Centaur 来说,现代双插槽实现相当复杂。今天的 CPU 使用集成到 CPU 芯片本身的内存控制器,这意味着每个 CPU 插槽都有自己的内存池。如果一个插槽上的 CPU 内核想要访问连接到另一个插槽的内存,它们将必须通过跨插槽链接。这会创建一个具有非统一内存访问 (NUMA) 的设置。交叉套接字总是会增加延迟并减少带宽,但良好的 NUMA 实现会将这些损失降到最低。
跨套接字延迟在这里,我们将通过在不同节点上分配内存并使用不同节点上的核心来访问该内存来测试跨套接字链接增加了多少延迟。这基本上是我们的延迟测试仅在 1 GB 测试大小下运行,因为该大小足以溢出任何缓存。我们使用 2 MB 页面来避免 TLB 未命中惩罚。这对于大多数使用 4 KB 页面的消费者应用程序来说是不现实的,但我们正在尝试隔离与 NUMA 相关的延迟惩罚,而不是显示应用程序在实践中会看到的内存延迟。

交叉套接字增加了大约 92 ns 的额外延迟,这意味着不同套接字上的内存访问时间几乎是原来的两倍。相比之下,英特尔受到的处罚较少。
在双插槽 Broadwell 系统上,交叉插槽通过早期侦听设置增加了 42 ns 的延迟。访问远程内存比访问直接连接到 CPU 的内存要长 41.7%。与 CNS 相比,英特尔领先一英里,部分原因是早期的侦听模式针对低延迟进行了优化。另一部分是英特尔在开发支持多插槽的芯片方面拥有丰富的经验。如果我们回到十多年前基于英特尔 Westmere 的 Xeon X5650,同一节点上的内存访问延迟为 70.3 ns,而远程内存访问为 121.1 ns。那里的延迟增量刚好超过 50 ns。它比 Broadwell 差,但仍然明显好于 Centaur 的表现。
Broadwell 还支持 cluster-on-die 设置,它为每个插槽创建两个 NUMA 节点。在这种模式下,一个 NUMA 节点覆盖一条单环总线,连接到一个双通道 DDR4 内存控制器。这会稍微减少本地内存访问延迟。但是英特尔在使用四个内存池时要困难得多。交叉套接字现在几乎和在 CHA 上一样长。仔细观察,我们可以看到当访问连接到同一芯片的“远程”内存时,内存延迟跳跃了近 70 ns。这比跨套接字延迟增量大,表明如果英特尔有三个远程节点可供选择,它需要更长的时间来确定将内存请求发送到哪里。

转到 AMD,我们得到了在 Azure 上测试 Milan-X 系统时的结果。与 Broadwell 的 cluster on die 模式一样,AMD 的 NPS2 模式在一个插槽内创建两个 NUMA 节点。然而,AMD 似乎有非常快速的目录来确定哪个节点负责内存地址。从一个套接字的一半转到另一半只会增加 14.33 ns。Milan-X 上的交叉套接字连接会增加大约 70-80 纳秒的延迟,具体取决于您访问的是远程套接字的哪一半。 总而言之,Centaur 的跨节点延迟很一般。它比我们从英特尔或 AMD 看到的更糟糕,除非英特尔 Broadwell 系统正在处理四个 NUMA 节点。但对于没有多路设计经验的公司来说,这并不可怕。 跨套接字带宽接下来,我们将测试带宽。与延迟测试一样,我们使用内存分配位置和使用的 CPU 内核的不同组合来运行带宽测试。这里的测试大小是 3 GB,因为这是我们硬编码到带宽测试中的最大大小。大小并不重要,只要它大到可以从缓存中取出即可。为简单起见,我们只测试读取带宽。

Centaur 的交叉套接字带宽非常差,只有 1.3 GB/s 以上。当您可以从优质的 NVMe SSD 读取速度更快时,就说明出了问题。相比之下,英特尔十年前的 Xeon X5650 可以维持 11.2 GB/s 的交叉插槽带宽,尽管其三通道 DDR3 设置在一个节点内仅达到 20.4 GB/s。像 Broadwell 这样的更新设计甚至做得更好。
每个插槽代表一个 NUMA 节点,Broadwell 可以从其四个 DDR4-2400 通道获得近 60 GB/s 的读取带宽。从不同的套接字访问它会使带宽下降到 21.3 GB/s。这大大超过了 Westmere,显示了英特尔在过去十年中取得的进步。 如果我们将 Broadwell 切换到芯片上的集群模式,七个内核的每个节点仍然可以拉取比 Centaur 所能达到的更多的跨插槽带宽。奇怪的是,Broadwell 遭受了芯片内交叉节点的严重损失,内存带宽减少了大约一半。
最后再来看看AMD的Milan-X:
与这里的其他芯片相比,Milan-X 是带宽怪物。它的 DDR4 通道数量是 CHA 和 Broadwell 的两倍,因此高节点内带宽不足为奇。跨节点,AMD 在访问同一插槽的另一半时保留了非常好的带宽。跨套接字,每个 NPS2 节点仍然可以拉动超过 40 GB/s,这与 CHA 的本地内存带宽相差不远。 有争议的原子的核心到核心延迟我们的最后一个测试通过使用锁定的比较和交换操作来修改两个内核之间共享的数据来评估缓存一致性性能。Centaur 在这里表现不错,跨套接字时的延迟约为 90 到 130 ns。
上面的核心到核心延迟图类似于Ampere Altra 的,其中归属于远程套接字的高速缓存行上的高速缓存一致性操作需要通过跨套接字互连进行往返,即使相互通信的两个核心在同一个芯片。但是,CHA 的绝对延迟要低得多,这要归功于 CHA 的内核少得多,拓扑也不那么复杂。 当缓存一致性跨套接字时,英特尔 2010 年的 Westmere 架构能够比 CHA 做得更好。他们能够处理芯片内的缓存一致性操作(可能在 L3 级别),即使缓存行位于远程套接字上也是如此。
但这种出色的交叉套接字性能并不常见。Westmere 可能会受益,因为所有非核心请求都会通过一个集中的全局队列。与自 Sandy Bridge 以来使用的分布式方法相比,该方法对于常规 L3 访问具有更高的延迟和低带宽。但其简单性和集中式特性可能会实现出色的跨套接字缓存一致性性能。

Broadwell 的跨套接字性能与 CHA 的类似。通过查看 cluster and die 和 early snoop 模式的结果,我们可以清楚地看到核心到核心缓存一致性延迟的大部分来自目录查找的执行方式。如果传输发生在 die 节点上的集群内,则通过包含的 L3 及其核心有效位处理一致性。如果错过了 L3,一致性机制就会慢得多。芯片内、跨节点延迟已经超过 100 ns。交叉管芯只会再增加 10-20 ns。 Early snoop 模式表明芯片内一致性可以非常快。延迟保持在 50 ns 范围内或更短,即使在环交叉时也是如此。然而,早期侦听模式将交叉套接字延迟增加到大约 140 ns,使其略低于 CNS。
我们没有从 Milan-X 得到干净的结果,因为固定在云实例上的管理程序核心有点古怪。但我们的结果与Anandtech 在 Epyc 7763 上的结果大致一致。CCX 内部延迟非常低。NPS2 节点内的跨 CCX 延迟在 90 ns 范围内。跨 NPS2 节点使延迟达到 110 纳秒左右,而跨套接字导致约 190 纳秒的延迟。 因此,Centaur 在这一类别中的交叉套接字性能优于 Epyc。更一般地说,CHA 在这种测试中表现最好。它能够与在我们的“干净”内存访问测试中击败它的 AMD 和 Intel 系统针锋相对。这里的“干净”意味着我们没有多个内核写入同一缓存行。 不幸的是,对于 Centaur,我们看到的受益于低核心到核心延迟的应用程序的例子完全为零。有争议的原子在多线程代码中似乎并不常见。 最后的话在 CNS 之前,Centaur 专注于低功耗消费类 CPU,其产品包括威盛 Nano。其中一些经验会延续到服务器设计中。毕竟,低功耗和小内核尺寸是共同的目标。但是超越cpu核心,服务器就是另外一个世界了。 服务器需要高核心数、大量 IO 和大量内存带宽。它们还需要支持高内存容量。CHA 在其中一些方面有所作为。它可以支持每个插槽数百 GB 的内存。它的四通道 DDR4 内存控制器和 44 个 PCIe 通道提供了足够但不突出的片外带宽。CHA也是Centaur打造的核心数最高的芯片。但八核对于当今的服务器市场来说有点低。双插槽支持可以部分缓解这种情况。 不幸的是,双插座的工作似乎还没有完成。CHA 的低交叉套接字带宽会导致严重的问题,尤其是对于不支持 NUMA 的工作负载。它还排除了在交错模式下使用系统的任何可能性,在这种模式下,访问被跨套接字条带化,以更高的延迟为代价为不支持 NUMA 的应用程序提供更多带宽。
那么出了什么问题呢?好吧,请记住,跨套接字访问会遭受额外的延迟。这对所有平台都是通用的。但是通过长延迟连接实现高带宽需要能够排队大量未完成的请求。我的猜测是 Centaur 在他们的交叉套接字链接之前实现了一个队列,但从来没有抽出时间来验证它。Centaur 的人员不多,资源有限,可能正忙于涵盖所有与服务器相关的新技术。我们所看到的可能是一项正在进行的工作。
Centaur 已经实现了使多个套接字工作所必需的协议和一致性目录。而且它们的工作延迟相当不错。不幸的是,由于 Centaur 不复存在,跨插槽工作无法完成,因此我们永远无法在双插槽设置中看到 CHA 的全部潜力。 特别感谢Brutus设置系统并在其上运行测试。