快速注册

实打实的代码，实打实的性能分析

xeeXas 2014-12-13 08:55:14

update2:
第一，发现a.c中当打开SANITY的时候一个小bug。虽然不影响测试结果，但还是说一下。
这个diff如下，大家可以自己手动改一下
diff --git a/a.c b/a.c
index be0a664..acecdfb 100644
--- a/a.c
+++ b/a.c
@@ -97,7 +97,7 @@ pop_flush_task(void *arg)
         struct msg *m = pop(q);
         if (m) {
             __sync_fetch_and_add(&pop_total, 1);
- msg_flags[(m - msgs)] = 1;
+ msg_flags[(m - msgs)] += 1;
         }
#else
         if (pop(q))
第二，因为云风提到了单线程的事情，所以我又跑了三个方法的单producer单consumer的测试。
新的测试结果加在日记后面了。
=====================================================
update:
听同学说pastebin被强了。所以把全部代码打包放这了：http://url.cn/YxG3ja
=====================================================
去了趟知乎，体验一个真理：要喷别人，自己得过硬，不然就得自己吃自己的屎。
所以刚去就吃了口自己的屎。还好，拼了一晚上，找回来了。
以下为全文转自知乎我自己的答案。
======================================================
ok 你彻底把我激怒了。。。当然，我也的确犯了一个又一个愚蠢错误。q3.h连编译都通不过。
之前我写的全是垃圾，我承认，所以干脆都删除了。
现在我得动真格的。正确性和性能都拼一下。三种方案一起上。
测试内容是：7个core跑push，7个core跑pop。总共7百万条消息，均分到7个push core。
每个thread都pin到不同core上。跑测试的机器是Xeon E5-2670 Sandybridge， 32核心。
测试总共用到15个core，并保证在同一个socket上。
测试结束条件是push完成了7百万条消息。当push thread达到目标后退出，计算每次push所需时间。同时强制退出所有pop thread，计算当前已经pop的消息数量和每条pop消耗的时间。
不允许push丢消息 -- 所有三种方案均满足此条件。
下面是全部的代码和测试结果。
a.c -- 测试用例
http://pastebin.com/mruCwEq6
=============================
q.h -- 云风的msg queue
http://pastebin.com/L9WxDgF8
pop total: 2,196,725
pop cycles/msg: 4610
push cycles/msg: 4447
push了7百万条消息，当push完成的时候，pop只完成了2百万条。
每次pop消耗4447 cycle，每次push消耗4610 cycle。
=============================
q3.h --- 我的msg queue
http://pastebin.com/D72PWT0z
pop total: 6,999,110
pop cycles/msg: 5625
push cycles/msg: 5307
push了7百万条消息，当push完成的时候，pop完成了非常接近7百万条。
每次pop消耗5625 cycle，每次push消耗5307 cycle。
=============================
qlock.h -- 朱欣愚的msg queue
http://pastebin.com/4mEk55BH
???
没耐心等下去了。所以没有结果。
=============================
总结：云风的msg queue的push效率正常，但pop效率极低。
我的msg queue的push和pop效率正常，至少是云风的msg queue的3倍。
朱欣愚的msg queue存在严重问题，无法通过基本测试。
下面我们修改一下结束条件。不仅push必须完成，pop也必须完成所有7百万消息。
修改过的测试代码： a2.c http://pastebin.com/JHhydtTh
在此条件下，q3.h测试结果：
[root@localhost ~]$ perf stat -d ./a.out
flushing: 1012
pop total: 7000000
pop cycles/msg: 5363
push cycles/msg: 5178
Performance counter stats for './a.out':
28358.315905 task-clock # 13.697 CPUs utilized
2950 context-switches # 0.104 K/sec
1 CPU-migrations # 0.000 K/sec
194 page-faults # 0.007 K/sec
73021185526 cycles # 2.575 GHz [39.96%]
69599721520 stalled-cycles-frontend # 95.31% frontend cycles idle [39.93%]
66141989649 stalled-cycles-backend # 90.58% backend cycles idle [39.98%]
2871667693 instructions # 0.04 insns per cycle
# 24.24 stalled cycles per insn [50.01%]
891092152 branches # 31.423 M/sec [50.02%]
36868384 branch-misses # 4.14% of all branches [50.10%]
795521720 L1-dcache-loads # 28.053 M/sec [50.09%]
183223900 L1-dcache-load-misses # 23.03% of all L1-dcache hits [50.10%]
121578500 LLC-loads # 4.287 M/sec [40.04%]
2711774 LLC-load-misses # 2.23% of all LL-cache hits [40.00%]
2.070478646 seconds time elapsed
两秒完成7百万push和pop。
===============================
q.h和qlock.h均没有测试结果，因为几分钟后仍然没有结束。
于是把7百万消息改为7千，还是不能结束。
有硬件的同学可以测试下q.h和qlock.h在总共70条消息下的表现，更有意思。
这两个测试显示，q3.h相比qlock.h和q.h的优势超过5个数量级。
===============================
为了给q.h和qlock.h有个benchmark结果，用ctrl-c打断他们，再看看perf的数据：
q.h 的结果是：
[root@localhost ~]$ perf stat -d ./a.out
flushing: 4479796
^C./a.out: Interrupt
Performance counter stats for './a.out':
     229446.571945 task-clock # 7.359 CPUs utilized
             23651 context-switches # 0.103 K/sec
                 3 CPU-migrations # 0.000 K/sec
              1089 page-faults # 0.005 K/sec
      593208091316 cycles # 2.585 GHz [39.98%]
      560023462512 stalled-cycles-frontend # 94.41% frontend cycles idle [39.97%]
      536361645156 stalled-cycles-backend # 90.42% backend cycles idle [39.98%]
       13796086610 instructions # 0.02 insns per cycle
                                             # 40.59 stalled cycles per insn [50.00%]
        3685125949 branches # 16.061 M/sec [49.99%]
          19225467 branch-misses # 0.52% of all branches [50.04%]
        4922001309 L1-dcache-loads # 21.452 M/sec [50.02%]
        1148076158 L1-dcache-load-misses # 23.33% of all L1-dcache hits [50.03%]
         522524359 LLC-loads # 2.277 M/sec [40.01%]
           1629746 LLC-load-misses # 0.31% of all LL-cache hits [40.01%]
      31.180113989 seconds time elapsed
再看看qlock.h的表现：
[root@localhost~]$ perf stat -d ./a.out
^C./a.out: Interrupt
Performance counter stats for './a.out':
     799360.463478 task-clock # 10.316 CPUs utilized
             81762 context-switches # 0.102 K/sec
                 3 CPU-migrations # 0.000 K/sec
               632 page-faults # 0.001 K/sec
     2061104851992 cycles # 2.578 GHz [39.98%]
     2038536012530 stalled-cycles-frontend # 98.91% frontend cycles idle [39.97%]
     2006513262888 stalled-cycles-backend # 97.35% backend cycles idle [39.96%]
       15013468072 instructions # 0.01 insns per cycle
                                             # 135.78 stalled cycles per insn [49.97%]
        3428746157 branches # 4.289 M/sec [49.99%]
         109570783 branch-misses # 3.20% of all branches [50.01%]
        3299557362 L1-dcache-loads # 4.128 M/sec [50.02%]
        2583057883 L1-dcache-load-misses # 78.28% of all L1-dcache hits [50.04%]
        1040850469 LLC-loads # 1.302 M/sec [40.02%]
          15998971 LLC-load-misses # 1.54% of all LL-cache hits [40.00%]
      77.485303626 seconds time elapsed
最明显的数据对比是L1 d$ miss。
q3.h: 183223900
q.h: 1148076158
qlock.h: 2583057883
然后是stalled cycles per insn:
q3.h: 24.24
q.h: 40.59
qlock.h: 135.78
而LLC miss的差距也很明显：
q3.h: 2711774
q.h: 1629746
qlock.h: 15998971
注意，q.h的LLC miss是最少的，stall cycle跟q3.h也相差不大，但q.h却始终无法在1分钟里完成7百万pop，q3.h却能在2秒钟里完成。这说明q.h的问题极有可能是livelock。
而qlock.h的表现无疑是最糟糕的。3项数据都比其它方案糟一个数量级，更不用说整体运行速度。
========================================
现在再改一下测试用例，使其更接近真实环境。在真实环境里，每pop出一条msg，都会有相应对msg的操作。现在在pop_task里，每取出一个msg就等待2000 cycles，来模拟pop_task的延迟。
这个delay的macro很简单，如下：
#define delay(c) do { \
    if ((c) == 0) break; \
    uint64_t start = rdtsc(); \
    uint64_t now = start; \
    while (now - start < (c)) { \
        _mm_pause(); \
        now = rdtsc(); \
    } \
} while (0)
引进delay(2000)之后，q3.h的表现没有显著变化，这次用了2.1秒。stall反而有更好的表现，从24降到了14。
q3.h
[root@localhost ~]$ perf stat -d ./a.out
flushing: 1014
pop total: 7000000
pop cycles/msg: 5521
push cycles/msg: 5369
Performance counter stats for './a.out':
      29294.731898 task-clock # 13.745 CPUs utilized
              3038 context-switches # 0.104 K/sec
                 1 CPU-migrations # 0.000 K/sec
               195 page-faults # 0.007 K/sec
       75677575418 cycles # 2.583 GHz [40.00%]
       67755864040 stalled-cycles-frontend # 89.53% frontend cycles idle [39.99%]
       62117972064 stalled-cycles-backend # 82.08% backend cycles idle [40.00%]
        4796302585 instructions # 0.06 insns per cycle
                                             # 14.13 stalled cycles per insn [50.01%]
         979789697 branches # 33.446 M/sec [50.00%]
          38429726 branch-misses # 3.92% of all branches [50.03%]
         425295288 L1-dcache-loads # 14.518 M/sec [50.06%]
         174605736 L1-dcache-load-misses # 41.06% of all L1-dcache hits [50.06%]
         114066256 LLC-loads # 3.894 M/sec [40.05%]
           1875774 LLC-load-misses # 1.64% of all LL-cache hits [40.03%]
       2.131322208 seconds time elapsed
而q.h和qlock.h同样没有显著变化，仍然是对7百万条，甚至70条信息无法结束任务。这里就不贴perf的结果了。
========================================
最后，是sanity test。我知道这个q3.h很多人看不懂。不过没关系，这个测试的目的是数据一致性。每个push的msg和pop的msg将会一一对应，如果全部7百万消息，有一个没收到，或者有一个被接收了超过一次，这个测试都将发现错误。
测试代码在这：http://pastebin.com/1M0S0YiJ
当然，q3.h通过了测试。而其它两个不清楚是否能通过这个sanity test，因为其它两个不管等多久也跑不完。
=========================================
在正确性得到保证，且符合所有需求的条件下，q3.h的速度是其它两个方案的远远不能比的。
知乎的代码编辑太难用，所以全部代码都贴在http://pastebin.com上。我也不清楚国内是不是墙了pastebin。如果是的话，请告知，我找过一个地方贴。
所有代码编译参数为-g -O2 -pthread。如果你没有32core的机器，也可以减少POP_CNT和PUSH_CNT，让它们能在你的机器上跑。当然，相应的你必须懂怎么修改socket_top，这关系到你的CPU的topology。
由于测试的多样性，建议采用a3.c为测试用例，因为q3.h即使是在加入sanity test的情况下，速度依旧没有改变。
==========================================
对我的代码风格有意见？为什么定义两个一样的internal struct？为什么一定要用static inline？
为什么mask是一样的值，却要定义在两个地方？是不是还要问为什么要asm volatile和padding？
如果你有这些疑问，不好意思，你现在不懂，以后自然也不会懂。
==========================================
下面是单producer单consumer的测试，push和pop各占一个core，消息总数量为1百万。
这次，q.h和qlock.h都飞快的完成了任务，所以perf的结果就比之前更有意义。下面看一下：
q3.h
-------
perf stat ./a.out
flushing: 397
pop total: 1000000
pop cycles/msg: 313
push cycles/msg: 313
Performance counter stats for './a.out':
        241.568726 task-clock # 1.967 CPUs utilized
                31 context-switches # 0.128 K/sec
                 0 CPU-migrations # 0.000 K/sec
               163 page-faults # 0.675 K/sec
         574367767 cycles # 2.378 GHz [82.91%]
         470717717 stalled-cycles-frontend # 81.95% frontend cycles idle [83.79%]
         420555239 stalled-cycles-backend # 73.22% backend cycles idle [67.00%]
          47221990 instructions # 0.08 insns per cycle
                                             # 9.97 stalled cycles per insn [83.59%]
           8793543 branches # 36.402 M/sec [83.38%]
             32125 branch-misses # 0.37% of all branches [83.59%]
       0.122797009 seconds time elapsed
==========================================
q.h
-----
perf stat -d ./a.out
flushing: 443450
pop total: 1000000
pop cycles/msg: 510
push cycles/msg: 438
Performance counter stats for './a.out':
        365.382189 task-clock # 1.837 CPUs utilized
                45 context-switches # 0.123 K/sec
                 0 CPU-migrations # 0.000 K/sec
               414 page-faults # 0.001 M/sec
         904288538 cycles # 2.475 GHz [39.84%]
         754521614 stalled-cycles-frontend # 83.44% frontend cycles idle [39.38%]
         672028182 stalled-cycles-backend # 74.32% backend cycles idle [39.68%]
          55959596 instructions # 0.06 insns per cycle
                                             # 13.48 stalled cycles per insn [49.85%]
          11774581 branches # 32.225 M/sec [49.90%]
            349833 branch-misses # 2.97% of all branches [51.00%]
          20296480 L1-dcache-loads # 55.549 M/sec [51.96%]
           8785886 L1-dcache-load-misses # 43.29% of all L1-dcache hits [51.58%]
           3674897 LLC-loads # 10.058 M/sec [40.80%]
              2589 LLC-load-misses # 0.07% of all LL-cache hits [40.40%]
       0.198946809 seconds time elapsed
===============================================
qlock.h
----------
perf stat -d ./a.out
flushing: 931992
pop total: 1000000
pop cycles/msg: 396
push cycles/msg: 330
Performance counter stats for './a.out':
        280.101692 task-clock # 1.806 CPUs utilized
                35 context-switches # 0.125 K/sec
                 0 CPU-migrations # 0.000 K/sec
               580 page-faults # 0.002 M/sec
         688274111 cycles # 2.457 GHz [38.96%]
         584988396 stalled-cycles-frontend # 84.99% frontend cycles idle [39.48%]
         520073758 stalled-cycles-backend # 75.56% backend cycles idle [39.84%]
          51100072 instructions # 0.07 insns per cycle
                                             # 11.45 stalled cycles per insn [50.19%]
          11161825 branches # 39.849 M/sec [50.94%]
             20756 branch-misses # 0.19% of all branches [52.47%]
          12643419 L1-dcache-loads # 45.139 M/sec [52.00%]
           5130755 L1-dcache-load-misses # 40.58% of all L1-dcache hits [51.28%]
           2406577 LLC-loads # 8.592 M/sec [40.34%]
              4564 LLC-load-misses # 0.19% of all LL-cache hits [39.79%]
       0.155101843 seconds time elapsed
=====================================
首先，3个测试用的时间差没有之前那么夸张了，但依旧是q3.h最快。q3.h比最慢的q.h快了约40%。
其次，注意到q.h和qlock.h延续了之前的问题：当push完成的时候，大量的消息还没有被pop出来。当push结束时，q3.h 还有397条消息在队列中，q.h还有443450条消息在队列中，qlock.h还有931992条消息没有被pop。这说明即使在没有consumer竞争的情况下，q.h和qlock.h还是存在严重问题。
然后看看具体的profiling数据。这次，三者的差距比多线程的测试更明显。
L1 d$ miss
q3.h: 0
q.h: 43.29%
qlock.h: 40.58%
即使在单线程的测试里，都有接近一半的L1 miss，这是非常惊人的损失。这直接说明q.h和qlock.h中，producer和consumer的竞争非常激烈。而相比之下，q3.h则完全没有。
除此之外，q.h和qlock.h还存在LLC miss，而q3.h则完全没有。
memory system是所有msg queue的无法避免的热点，参与的core越多，这个热点就越热。所以一个msg queue的终极设计目标就是减少内存共享--但永远不可能没有内存共享。
很明显，q.h和qlock.h没有这个基本理念，即使是单线程测试，还是在producer和consumer之间制造了大量麻烦。

回应转发赞收藏

xeeXas (新疆喀什)

INTERRUPTED BY FIREWORKS 1 did not like it 2 it was ok 3 liked it 4...

实打实的代码，实打实的性能分析

热门话题 · · · · · · ( 去话题广场 )