超微B200推理性能测试

汇总信息

本次性能评估采用数据集sharegpt, 500并发、1万个请求, 客户端采用vllm bench进行作为基准进行测试,重点关注吞吐量(tokens/s)。

DeepSeek-R1-0528(TP8)

Framework

B200

H200

B200/H200

TensorRT-LLM

3039.49

4139.24

0.734

SGLang

3613.17

3741.13

0.966

vLLM

3175.19

3941.19

0.806

DeepSeek-R1-0528-FP4-v2

Framework

FP4(TP4)

FP8(TP8)

FP4*2 / FP8

TensorRT-LLM

3836.98

3039.49

2.525

SGLang

4442.75

3613.17

2.460

vLLM

3191.42

3175.19

2.010

从推理性能上看,生态上对B200的支持比年初要好很多,但未能充分发挥B200的性能;

单机推理性能B200稍弱于H200, SGLang对B200支持相对较好采用FP4时,由于可以节约一半的卡,整机性能收益明显

服务器信息

数量: 1

CPU: Intel 6960P, 72Cores, 2.70GHz

MEM: 3TB

GPU: B200 DGX



$ lscpu
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        52 bits physical, 57 bits virtual
Byte Order:                           Little Endian
CPU(s):                               288
On-line CPU(s) list:                  0-287
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Xeon(R) 6960P
CPU family:                           6
Model:                                173
Thread(s) per core:                   2
Core(s) per socket:                   72
Socket(s):                            2
Stepping:                             1
CPU(s) scaling MHz:                   22%
CPU max MHz:                          3900.0000
CPU min MHz:                          800.0000
BogoMIPS:                             5400.00
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc ar
t arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2api
c movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriorit
y ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xg
etbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect user_shstk avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req hfi vnmi avx512vb
mi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk p
config arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization:                       VT-x
L1d cache:                            6.8 MiB (144 instances)
L1i cache:                            9 MiB (144 instances)
L2 cache:                             288 MiB (144 instances)
L3 cache:                             864 MiB (2 instances)
NUMA node(s):                         6
NUMA node0 CPU(s):                    0-23,144-167
NUMA node1 CPU(s):                    24-47,168-191
NUMA node2 CPU(s):                    48-71,192-215
NUMA node3 CPU(s):                    72-95,216-239
NUMA node4 CPU(s):                    96-119,240-263
NUMA node5 CPU(s):                    120-143,264-287
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS Not affected; BHI BHI_DIS_S
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

MEM



$ lsmem
RANGE                                 SIZE  STATE REMOVABLE  BLOCK
0x0000000000000000-0x000000007fffffff   2G online       yes      0
0x0000000100000000-0x000003007fffffff   3T online       yes 2-1536
 
Memory block size:         2G
Total online memory:       3T
Total offline memory:      0B
 
$ sudo dmidecode -t memory
# dmidecode 3.5
Getting SMBIOS data from sysfs.
SMBIOS 3.7.0 present.
# SMBIOS implementations newer than version 3.5.0 are not
# fully supported by this version of dmidecode.
 
Handle 0x0014, DMI type 16, 23 bytes
Physical Memory Array
        Location: System Board Or Motherboard
        Use: System Memory
        Error Correction Type: Single-bit ECC
        Maximum Capacity: 12 TB
        Error Information Handle: Not Provided
        Number Of Devices: 48
 
Handle 0x0015, DMI type 17, 92 bytes
Memory Device
        Array Handle: 0x0014
        Error Information Handle: Not Provided
        Total Width: 80 bits
        Data Width: 64 bits
        Size: 128 GB
        Form Factor: DIMM
        Set: None
        Locator: P1-DIMMA1
        Bank Locator: P0_Node0_Channel0_Dimm0
        Type: DDR5
        Type Detail: Synchronous Registered (Buffered)
        Speed: 6400 MT/s
        Manufacturer: Samsung
        Serial Number: 80CE01253101D68B4D
        Asset Tag: P1-DIMMA1_AssetTag25/31)
        Part Number: M321RAJA0MB2-CCPWC            
        Rank: 2
        Configured Memory Speed: 6400 MT/s
        Minimum Voltage: 1.1 V
        Maximum Voltage: 1.1 V
        Configured Voltage: 1.1 V
        Memory Technology: DRAM
        Memory Operating Mode Capability: Volatile memory
        Firmware Version: 0000 
        Module Manufacturer ID: Bank 1, Hex 0xCE
        Module Product ID: 0xCE00
        Memory Subsystem Controller Manufacturer ID: Unknown
        Memory Subsystem Controller Product ID: Unknown
        Non-Volatile Size: None
        Volatile Size: 128 GB
        Cache Size: None
        Logical Size: None
...

GPU



$ sudo lspci |grep -i nvidia
17:00.0 3D controller: NVIDIA Corporation Device 2901 (rev a1)
3d:00.0 3D controller: NVIDIA Corporation Device 2901 (rev a1)
5f:00.0 3D controller: NVIDIA Corporation Device 2901 (rev a1)
70:00.0 3D controller: NVIDIA Corporation Device 2901 (rev a1)
97:00.0 3D controller: NVIDIA Corporation Device 2901 (rev a1)
ba:00.0 3D controller: NVIDIA Corporation Device 2901 (rev a1)
dc:00.0 3D controller: NVIDIA Corporation Device 2901 (rev a1)
ed:00.0 3D controller: NVIDIA Corporation Device 2901 (rev a1)
 
$ nvidia-smi 
Mon Oct 27 03:00:05 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA B200                    On  |   00000000:17:00.0 Off |                    0 |
| N/A   42C    P0            384W / 1000W |  175328MiB / 183359MiB |     87%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA B200                    On  |   00000000:3D:00.0 Off |                    0 |
| N/A   51C    P0            427W / 1000W |  174658MiB / 183359MiB |     83%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA B200                    On  |   00000000:5F:00.0 Off |                    0 |
| N/A   50C    P0            431W / 1000W |  174658MiB / 183359MiB |     93%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA B200                    On  |   00000000:70:00.0 Off |                    0 |
| N/A   41C    P0            388W / 1000W |  174498MiB / 183359MiB |     25%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA B200                    On  |   00000000:97:00.0 Off |                    0 |
| N/A   30C    P0            142W / 1000W |       4MiB / 183359MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA B200                    On  |   00000000:BA:00.0 Off |                    0 |
| N/A   35C    P0            142W / 1000W |       4MiB / 183359MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA B200                    On  |   00000000:DC:00.0 Off |                    0 |
| N/A   35C    P0            144W / 1000W |       4MiB / 183359MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA B200                    On  |   00000000:ED:00.0 Off |                    0 |
| N/A   30C    P0            139W / 1000W |       4MiB / 183359MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
 



lo@localhost:~$ nvidia-smi topo  -m
	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	NIC0	NIC1	NIC2	NIC3	NIC4	NIC5	NIC6	NIC7	NIC8	NIC9	NIC10	NIC11	NIC12	NIC13	NIC14	NIC15	NIC16	NIC17	NIC18	NIC19	NIC20	NIC21	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV18	NV18	NV18	NV18	NV18	NV18	NV18	PIX	PIX	NODE	NODE	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	0-23,144-167	0		N/A
GPU1	NV18	 X 	NV18	NV18	NV18	NV18	NV18	NV18	NODE	NODE	PIX	PIX	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	0-23,144-167	0		N/A
GPU2	NV18	NV18	 X 	NV18	NV18	NV18	NV18	NV18	SYS	SYS	SYS	SYS	SYS	SYS	PIX	PIX	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	48-71,192-215	2		N/A
GPU3	NV18	NV18	NV18	 X 	NV18	NV18	NV18	NV18	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	PIX	PIX	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	48-71,192-215	2		N/A
GPU4	NV18	NV18	NV18	NV18	 X 	NV18	NV18	NV18	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PIX	PIX	NODE	NODE	NODE	NODE	NODE	NODE	SYS	SYS	SYS	SYS	72-95,216-239	3		N/A
GPU5	NV18	NV18	NV18	NV18	NV18	 X 	NV18	NV18	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	NODE	NODE	NODE	NODE	PIX	PIX	SYS	SYS	SYS	SYS	72-95,216-239	3		N/A
GPU6	NV18	NV18	NV18	NV18	NV18	NV18	 X 	NV18	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PIX	PIX	NODE	NODE	120-143,264-287	5		N/A
GPU7	NV18	NV18	NV18	NV18	NV18	NV18	NV18	 X 	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	PIX	PIX	120-143,264-287	5		N/A
NIC0	PIX	NODE	SYS	SYS	SYS	SYS	SYS	SYS	 X 	PIX	NODE	NODE	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS				
NIC1	PIX	NODE	SYS	SYS	SYS	SYS	SYS	SYS	PIX	 X 	NODE	NODE	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS				
NIC2	NODE	PIX	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	 X 	PIX	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS				
NIC3	NODE	PIX	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	PIX	 X 	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS				
NIC4	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	NODE	NODE	 X 	PIX	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS				
NIC5	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	NODE	NODE	PIX	 X 	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS				
NIC6	SYS	SYS	PIX	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	 X 	PIX	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS				
NIC7	SYS	SYS	PIX	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PIX	 X 	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS				
NIC8	SYS	SYS	NODE	PIX	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	 X 	PIX	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS				
NIC9	SYS	SYS	NODE	PIX	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	PIX	 X 	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS				
NIC10	SYS	SYS	SYS	SYS	PIX	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	 X 	PIX	NODE	NODE	NODE	NODE	NODE	NODE	SYS	SYS	SYS	SYS				
NIC11	SYS	SYS	SYS	SYS	PIX	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PIX	 X 	NODE	NODE	NODE	NODE	NODE	NODE	SYS	SYS	SYS	SYS				
NIC12	SYS	SYS	SYS	SYS	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	 X 	PIX	PIX	PIX	NODE	NODE	SYS	SYS	SYS	SYS				
NIC13	SYS	SYS	SYS	SYS	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	PIX	 X 	PIX	PIX	NODE	NODE	SYS	SYS	SYS	SYS				
NIC14	SYS	SYS	SYS	SYS	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	PIX	PIX	 X 	PIX	NODE	NODE	SYS	SYS	SYS	SYS				
NIC15	SYS	SYS	SYS	SYS	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	PIX	PIX	PIX	 X 	NODE	NODE	SYS	SYS	SYS	SYS				
NIC16	SYS	SYS	SYS	SYS	NODE	PIX	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	NODE	NODE	NODE	NODE	 X 	PIX	SYS	SYS	SYS	SYS				
NIC17	SYS	SYS	SYS	SYS	NODE	PIX	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	NODE	NODE	NODE	NODE	PIX	 X 	SYS	SYS	SYS	SYS				
NIC18	SYS	SYS	SYS	SYS	SYS	SYS	PIX	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	 X 	PIX	NODE	NODE				
NIC19	SYS	SYS	SYS	SYS	SYS	SYS	PIX	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PIX	 X 	NODE	NODE				
NIC20	SYS	SYS	SYS	SYS	SYS	SYS	NODE	PIX	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	 X 	PIX				
NIC21	SYS	SYS	SYS	SYS	SYS	SYS	NODE	PIX	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	PIX	 X 				
 
Legend:
 
  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
 
NIC Legend:
 
  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8
  NIC9: mlx5_9
  NIC10: mlx5_10
  NIC11: mlx5_11
  NIC12: mlx5_12
  NIC13: mlx5_13
  NIC14: mlx5_14
  NIC15: mlx5_15
  NIC16: mlx5_16
  NIC17: mlx5_17
  NIC18: mlx5_18
  NIC19: mlx5_19
  NIC20: mlx5_20
  NIC21: mlx5_21

TensorRT-LLM

see: https://nvidia.github.io/TensorRT-LLM/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.html

DeepSeek-R1-0528

TP=8, EP=8



Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 500
100%|██████████| 10000/10000 [10:44<00:00, 15.52it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     10000     
Maximum request concurrency:             500       
Benchmark duration (s):                  644.23    
Total input tokens:                      2273690   
Total generated tokens:                  1958122   
Request throughput (req/s):              15.52     
Output token throughput (tok/s):         3039.49   
Peak output token throughput (tok/s):    1312.00   
Peak concurrent requests:                557.00    
Total Token throughput (tok/s):          6568.81   
---------------Time to First Token----------------
Mean TTFT (ms):                          711.26    
Median TTFT (ms):                        563.85    
P99 TTFT (ms):                           4417.88   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          160.46    
Median TPOT (ms):                        163.61    
P99 TPOT (ms):                           199.14    
---------------Inter-token Latency----------------
Mean ITL (ms):                           1445.59   
Median ITL (ms):                         1570.76   
P99 ITL (ms):                            1997.93   
==================================================

TP=8, EP=1



Maximum request concurrency: 500
100%|██████████| 10000/10000 [12:37<00:00, 13.20it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     10000     
Maximum request concurrency:             500       
Benchmark duration (s):                  757.78    
Total input tokens:                      2273690   
Total generated tokens:                  1958119   
Request throughput (req/s):              13.20     
Output token throughput (tok/s):         2584.00   
Peak output token throughput (tok/s):    1057.00   
Peak concurrent requests:                535.00    
Total Token throughput (tok/s):          5584.45   
---------------Time to First Token----------------
Mean TTFT (ms):                          969.35    
Median TTFT (ms):                        619.08    
P99 TTFT (ms):                           8525.06   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          192.43    
Median TPOT (ms):                        190.21    
P99 TPOT (ms):                           432.71    
---------------Inter-token Latency----------------
Mean ITL (ms):                           1696.98   
Median ITL (ms):                         1871.15   
P99 ITL (ms):                            2541.63   
==================================================

DeepSeek-R1-0528-FP4-v2

TP=4, EP=4



Maximum request concurrency: 500
100%|██████████| 10000/10000 [08:30<00:00, 19.59it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     10000     
Maximum request concurrency:             500       
Benchmark duration (s):                  510.48    
Total input tokens:                      2273690   
Total generated tokens:                  1958698   
Request throughput (req/s):              19.59     
Output token throughput (tok/s):         3836.98   
Peak output token throughput (tok/s):    1507.00   
Peak concurrent requests:                550.00    
Total Token throughput (tok/s):          8291.00   
---------------Time to First Token----------------
Mean TTFT (ms):                          551.74    
Median TTFT (ms):                        442.95    
P99 TTFT (ms):                           3300.34   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          126.87    
Median TPOT (ms):                        129.05    
P99 TPOT (ms):                           155.78    
---------------Inter-token Latency----------------
Mean ITL (ms):                           1144.90   
Median ITL (ms):                         1238.82   
P99 ITL (ms):                            1536.10   
==================================================

TP=4, EP=1



Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 500
100%|██████████| 10000/10000 [08:31<00:00, 19.57it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     10000     
Maximum request concurrency:             500       
Benchmark duration (s):                  511.03    
Total input tokens:                      2273690   
Total generated tokens:                  1959263   
Request throughput (req/s):              19.57     
Output token throughput (tok/s):         3833.96   
Peak output token throughput (tok/s):    1349.00   
Peak concurrent requests:                564.00    
Total Token throughput (tok/s):          8283.20   
---------------Time to First Token----------------
Mean TTFT (ms):                          551.66    
Median TTFT (ms):                        442.69    
P99 TTFT (ms):                           3526.80   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          126.70    
Median TPOT (ms):                        128.25    
P99 TPOT (ms):                           162.89    
---------------Inter-token Latency----------------
Mean ITL (ms):                           1136.09   
Median ITL (ms):                         1247.91   
P99 ITL (ms):                            1577.18   
==================================================

TP=8, EP=8



Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 500
100%|██████████| 10000/10000 [08:59<00:00, 18.55it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     10000     
Maximum request concurrency:             500       
Benchmark duration (s):                  539.19    
Total input tokens:                      2273690   
Total generated tokens:                  1957282   
Request throughput (req/s):              18.55     
Output token throughput (tok/s):         3630.02   
Peak output token throughput (tok/s):    1711.00   
Peak concurrent requests:                585.00    
Total Token throughput (tok/s):          7846.85   
---------------Time to First Token----------------
Mean TTFT (ms):                          562.96    
Median TTFT (ms):                        474.82    
P99 TTFT (ms):                           2562.39   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          135.51    
Median TPOT (ms):                        138.17    
P99 TPOT (ms):                           173.24    
---------------Inter-token Latency----------------
Mean ITL (ms):                           1220.94   
Median ITL (ms):                         1354.88   
P99 ITL (ms):                            1786.97   
==================================================

SGLang

DeepSeek-R1-0528



Maximum request concurrency: 500
100%|██████████| 10000/10000 [09:02<00:00, 18.44it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     10000     
Maximum request concurrency:             500       
Benchmark duration (s):                  542.43    
Total input tokens:                      2273690   
Total generated tokens:                  1959889   
Request throughput (req/s):              18.44     
Output token throughput (tok/s):         3613.17   
Peak output token throughput (tok/s):    10923.00  
Peak concurrent requests:                551.00    
Total Token throughput (tok/s):          7804.85   
---------------Time to First Token----------------
Mean TTFT (ms):                          556.70    
Median TTFT (ms):                        375.04    
P99 TTFT (ms):                           4964.01   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          136.97    
Median TPOT (ms):                        135.53    
P99 TPOT (ms):                           298.24    
---------------Inter-token Latency----------------
Mean ITL (ms):                           131.63    
Median ITL (ms):                         46.51     
P99 ITL (ms):                            464.98    
==================================================

DeepSeek-R1-0528-FP4-v2



Maximum request concurrency: 500
100%|██████████| 10000/10000 [07:20<00:00, 22.68it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     10000     
Maximum request concurrency:             500       
Benchmark duration (s):                  440.84    
Total input tokens:                      2273690   
Total generated tokens:                  1958537   
Request throughput (req/s):              22.68     
Output token throughput (tok/s):         4442.75   
Peak output token throughput (tok/s):    12956.00  
Peak concurrent requests:                563.00    
Total Token throughput (tok/s):          9600.39   
---------------Time to First Token----------------
Mean TTFT (ms):                          837.32    
Median TTFT (ms):                        366.68    
P99 TTFT (ms):                           9831.91   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          109.24    
Median TPOT (ms):                        105.83    
P99 TPOT (ms):                           302.07    
---------------Inter-token Latency----------------
Mean ITL (ms):                           105.05    
Median ITL (ms):                         39.11     
P99 ITL (ms):                            476.41    
==================================================

VLLM

DeepSeek-R1-0528



Maximum request concurrency: 500
100%|██████████| 10000/10000 [10:17<00:00, 16.21it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     10000     
Maximum request concurrency:             500       
Benchmark duration (s):                  617.03    
Total input tokens:                      2273690   
Total generated tokens:                  1959184   
Request throughput (req/s):              16.21     
Output token throughput (tok/s):         3175.19   
Peak output token throughput (tok/s):    7273.00   
Peak concurrent requests:                537.00    
Total Token throughput (tok/s):          6860.10   
---------------Time to First Token----------------
Mean TTFT (ms):                          814.28    
Median TTFT (ms):                        387.54    
P99 TTFT (ms):                           9494.70   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          154.87    
Median TPOT (ms):                        152.75    
P99 TPOT (ms):                           241.02    
---------------Inter-token Latency----------------
Mean ITL (ms):                           148.09    
Median ITL (ms):                         169.82    
P99 ITL (ms):                            235.63    
==================================================

DeepSeek-R1-0528-FP4-v2



Maximum request concurrency: 500
100%|██████████| 10000/10000 [09:33<00:00, 17.43it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     10000     
Maximum request concurrency:             500       
Benchmark duration (s):                  573.83    
Total input tokens:                      2273690   
Total generated tokens:                  1831346   
Request throughput (req/s):              17.43     
Output token throughput (tok/s):         3191.42   
Peak output token throughput (tok/s):    9098.00   
Peak concurrent requests:                540.00    
Total Token throughput (tok/s):          7153.70   
---------------Time to First Token----------------
Mean TTFT (ms):                          691.30    
Median TTFT (ms):                        354.41    
P99 TTFT (ms):                           7219.65   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          151.53    
Median TPOT (ms):                        151.23    
P99 TPOT (ms):                           258.52    
---------------Inter-token Latency----------------
Mean ITL (ms):                           145.27    
Median ITL (ms):                         161.73    
P99 ITL (ms):                            223.18    
==================================================

其他

推理测试过程中单卡功耗约500W

© 版权声明

相关文章

暂无评论

您必须登录才能参与评论!
立即登录
none
暂无评论...