超微B200推理性能测试
汇总信息
本次性能评估采用数据集sharegpt, 500并发、1万个请求, 客户端采用vllm bench进行作为基准进行测试,重点关注吞吐量(tokens/s)。
DeepSeek-R1-0528(TP8)
|
Framework |
B200 |
H200 |
B200/H200 |
|
TensorRT-LLM |
3039.49 |
4139.24 |
0.734 |
|
SGLang |
3613.17 |
3741.13 |
0.966 |
|
vLLM |
3175.19 |
3941.19 |
0.806 |
DeepSeek-R1-0528-FP4-v2
|
Framework |
FP4(TP4) |
FP8(TP8) |
FP4*2 / FP8 |
|
TensorRT-LLM |
3836.98 |
3039.49 |
2.525 |
|
SGLang |
4442.75 |
3613.17 |
2.460 |
|
vLLM |
3191.42 |
3175.19 |
2.010 |
从推理性能上看,生态上对B200的支持比年初要好很多,但未能充分发挥B200的性能;
单机推理性能B200稍弱于H200, SGLang对B200支持相对较好采用FP4时,由于可以节约一半的卡,整机性能收益明显
服务器信息
数量: 1
CPU: Intel 6960P, 72Cores, 2.70GHz
MEM: 3TB
GPU: B200 DGX
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 52 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 288
On-line CPU(s) list: 0-287
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) 6960P
CPU family: 6
Model: 173
Thread(s) per core: 2
Core(s) per socket: 72
Socket(s): 2
Stepping: 1
CPU(s) scaling MHz: 22%
CPU max MHz: 3900.0000
CPU min MHz: 800.0000
BogoMIPS: 5400.00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc ar
t arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2api
c movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriorit
y ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xg
etbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect user_shstk avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req hfi vnmi avx512vb
mi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk p
config arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization: VT-x
L1d cache: 6.8 MiB (144 instances)
L1i cache: 9 MiB (144 instances)
L2 cache: 288 MiB (144 instances)
L3 cache: 864 MiB (2 instances)
NUMA node(s): 6
NUMA node0 CPU(s): 0-23,144-167
NUMA node1 CPU(s): 24-47,168-191
NUMA node2 CPU(s): 48-71,192-215
NUMA node3 CPU(s): 72-95,216-239
NUMA node4 CPU(s): 96-119,240-263
NUMA node5 CPU(s): 120-143,264-287
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS Not affected; BHI BHI_DIS_S
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
MEM
$ lsmem
RANGE SIZE STATE REMOVABLE BLOCK
0x0000000000000000-0x000000007fffffff 2G online yes 0
0x0000000100000000-0x000003007fffffff 3T online yes 2-1536
Memory block size: 2G
Total online memory: 3T
Total offline memory: 0B
$ sudo dmidecode -t memory
# dmidecode 3.5
Getting SMBIOS data from sysfs.
SMBIOS 3.7.0 present.
# SMBIOS implementations newer than version 3.5.0 are not
# fully supported by this version of dmidecode.
Handle 0x0014, DMI type 16, 23 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: Single-bit ECC
Maximum Capacity: 12 TB
Error Information Handle: Not Provided
Number Of Devices: 48
Handle 0x0015, DMI type 17, 92 bytes
Memory Device
Array Handle: 0x0014
Error Information Handle: Not Provided
Total Width: 80 bits
Data Width: 64 bits
Size: 128 GB
Form Factor: DIMM
Set: None
Locator: P1-DIMMA1
Bank Locator: P0_Node0_Channel0_Dimm0
Type: DDR5
Type Detail: Synchronous Registered (Buffered)
Speed: 6400 MT/s
Manufacturer: Samsung
Serial Number: 80CE01253101D68B4D
Asset Tag: P1-DIMMA1_AssetTag25/31)
Part Number: M321RAJA0MB2-CCPWC
Rank: 2
Configured Memory Speed: 6400 MT/s
Minimum Voltage: 1.1 V
Maximum Voltage: 1.1 V
Configured Voltage: 1.1 V
Memory Technology: DRAM
Memory Operating Mode Capability: Volatile memory
Firmware Version: 0000
Module Manufacturer ID: Bank 1, Hex 0xCE
Module Product ID: 0xCE00
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Non-Volatile Size: None
Volatile Size: 128 GB
Cache Size: None
Logical Size: None
...
GPU
$ sudo lspci |grep -i nvidia
17:00.0 3D controller: NVIDIA Corporation Device 2901 (rev a1)
3d:00.0 3D controller: NVIDIA Corporation Device 2901 (rev a1)
5f:00.0 3D controller: NVIDIA Corporation Device 2901 (rev a1)
70:00.0 3D controller: NVIDIA Corporation Device 2901 (rev a1)
97:00.0 3D controller: NVIDIA Corporation Device 2901 (rev a1)
ba:00.0 3D controller: NVIDIA Corporation Device 2901 (rev a1)
dc:00.0 3D controller: NVIDIA Corporation Device 2901 (rev a1)
ed:00.0 3D controller: NVIDIA Corporation Device 2901 (rev a1)
$ nvidia-smi
Mon Oct 27 03:00:05 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA B200 On | 00000000:17:00.0 Off | 0 |
| N/A 42C P0 384W / 1000W | 175328MiB / 183359MiB | 87% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA B200 On | 00000000:3D:00.0 Off | 0 |
| N/A 51C P0 427W / 1000W | 174658MiB / 183359MiB | 83% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA B200 On | 00000000:5F:00.0 Off | 0 |
| N/A 50C P0 431W / 1000W | 174658MiB / 183359MiB | 93% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA B200 On | 00000000:70:00.0 Off | 0 |
| N/A 41C P0 388W / 1000W | 174498MiB / 183359MiB | 25% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA B200 On | 00000000:97:00.0 Off | 0 |
| N/A 30C P0 142W / 1000W | 4MiB / 183359MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA B200 On | 00000000:BA:00.0 Off | 0 |
| N/A 35C P0 142W / 1000W | 4MiB / 183359MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA B200 On | 00000000:DC:00.0 Off | 0 |
| N/A 35C P0 144W / 1000W | 4MiB / 183359MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA B200 On | 00000000:ED:00.0 Off | 0 |
| N/A 30C P0 139W / 1000W | 4MiB / 183359MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
lo@localhost:~$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 NIC9 NIC10 NIC11 NIC12 NIC13 NIC14 NIC15 NIC16 NIC17 NIC18 NIC19 NIC20 NIC21 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PIX PIX NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS 0-23,144-167 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE PIX PIX NODE NODE SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS 0-23,144-167 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 SYS SYS SYS SYS SYS SYS PIX PIX NODE NODE SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS 48-71,192-215 2 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 SYS SYS SYS SYS SYS SYS NODE NODE PIX PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS 48-71,192-215 2 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX PIX NODE NODE NODE NODE NODE NODE SYS SYS SYS SYS 72-95,216-239 3 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE NODE NODE PIX PIX SYS SYS SYS SYS 72-95,216-239 3 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX PIX NODE NODE 120-143,264-287 5 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS NODE NODE PIX PIX 120-143,264-287 5 N/A
NIC0 PIX NODE SYS SYS SYS SYS SYS SYS X PIX NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS
NIC1 PIX NODE SYS SYS SYS SYS SYS SYS PIX X NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS
NIC2 NODE PIX SYS SYS SYS SYS SYS SYS NODE NODE X PIX NODE NODE SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS
NIC3 NODE PIX SYS SYS SYS SYS SYS SYS NODE NODE PIX X NODE NODE SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS
NIC4 NODE NODE SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE X PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS
NIC5 NODE NODE SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE PIX X SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS
NIC6 SYS SYS PIX NODE SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X PIX NODE NODE SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS
NIC7 SYS SYS PIX NODE SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX X NODE NODE SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS
NIC8 SYS SYS NODE PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS NODE NODE X PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS
NIC9 SYS SYS NODE PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS NODE NODE PIX X SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS
NIC10 SYS SYS SYS SYS PIX NODE SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X PIX NODE NODE NODE NODE NODE NODE SYS SYS SYS SYS
NIC11 SYS SYS SYS SYS PIX NODE SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX X NODE NODE NODE NODE NODE NODE SYS SYS SYS SYS
NIC12 SYS SYS SYS SYS NODE NODE SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS NODE NODE X PIX PIX PIX NODE NODE SYS SYS SYS SYS
NIC13 SYS SYS SYS SYS NODE NODE SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS NODE NODE PIX X PIX PIX NODE NODE SYS SYS SYS SYS
NIC14 SYS SYS SYS SYS NODE NODE SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS NODE NODE PIX PIX X PIX NODE NODE SYS SYS SYS SYS
NIC15 SYS SYS SYS SYS NODE NODE SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS NODE NODE PIX PIX PIX X NODE NODE SYS SYS SYS SYS
NIC16 SYS SYS SYS SYS NODE PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE NODE NODE X PIX SYS SYS SYS SYS
NIC17 SYS SYS SYS SYS NODE PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE NODE NODE PIX X SYS SYS SYS SYS
NIC18 SYS SYS SYS SYS SYS SYS PIX NODE SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X PIX NODE NODE
NIC19 SYS SYS SYS SYS SYS SYS PIX NODE SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX X NODE NODE
NIC20 SYS SYS SYS SYS SYS SYS NODE PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS NODE NODE X PIX
NIC21 SYS SYS SYS SYS SYS SYS NODE PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS NODE NODE PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
NIC8: mlx5_8
NIC9: mlx5_9
NIC10: mlx5_10
NIC11: mlx5_11
NIC12: mlx5_12
NIC13: mlx5_13
NIC14: mlx5_14
NIC15: mlx5_15
NIC16: mlx5_16
NIC17: mlx5_17
NIC18: mlx5_18
NIC19: mlx5_19
NIC20: mlx5_20
NIC21: mlx5_21
TensorRT-LLM
see: https://nvidia.github.io/TensorRT-LLM/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.html
DeepSeek-R1-0528
TP=8, EP=8
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 500
100%|██████████| 10000/10000 [10:44<00:00, 15.52it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests: 10000
Maximum request concurrency: 500
Benchmark duration (s): 644.23
Total input tokens: 2273690
Total generated tokens: 1958122
Request throughput (req/s): 15.52
Output token throughput (tok/s): 3039.49
Peak output token throughput (tok/s): 1312.00
Peak concurrent requests: 557.00
Total Token throughput (tok/s): 6568.81
---------------Time to First Token----------------
Mean TTFT (ms): 711.26
Median TTFT (ms): 563.85
P99 TTFT (ms): 4417.88
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 160.46
Median TPOT (ms): 163.61
P99 TPOT (ms): 199.14
---------------Inter-token Latency----------------
Mean ITL (ms): 1445.59
Median ITL (ms): 1570.76
P99 ITL (ms): 1997.93
==================================================
TP=8, EP=1
Maximum request concurrency: 500
100%|██████████| 10000/10000 [12:37<00:00, 13.20it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests: 10000
Maximum request concurrency: 500
Benchmark duration (s): 757.78
Total input tokens: 2273690
Total generated tokens: 1958119
Request throughput (req/s): 13.20
Output token throughput (tok/s): 2584.00
Peak output token throughput (tok/s): 1057.00
Peak concurrent requests: 535.00
Total Token throughput (tok/s): 5584.45
---------------Time to First Token----------------
Mean TTFT (ms): 969.35
Median TTFT (ms): 619.08
P99 TTFT (ms): 8525.06
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 192.43
Median TPOT (ms): 190.21
P99 TPOT (ms): 432.71
---------------Inter-token Latency----------------
Mean ITL (ms): 1696.98
Median ITL (ms): 1871.15
P99 ITL (ms): 2541.63
==================================================
DeepSeek-R1-0528-FP4-v2
TP=4, EP=4
Maximum request concurrency: 500
100%|██████████| 10000/10000 [08:30<00:00, 19.59it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests: 10000
Maximum request concurrency: 500
Benchmark duration (s): 510.48
Total input tokens: 2273690
Total generated tokens: 1958698
Request throughput (req/s): 19.59
Output token throughput (tok/s): 3836.98
Peak output token throughput (tok/s): 1507.00
Peak concurrent requests: 550.00
Total Token throughput (tok/s): 8291.00
---------------Time to First Token----------------
Mean TTFT (ms): 551.74
Median TTFT (ms): 442.95
P99 TTFT (ms): 3300.34
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 126.87
Median TPOT (ms): 129.05
P99 TPOT (ms): 155.78
---------------Inter-token Latency----------------
Mean ITL (ms): 1144.90
Median ITL (ms): 1238.82
P99 ITL (ms): 1536.10
==================================================
TP=4, EP=1
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 500
100%|██████████| 10000/10000 [08:31<00:00, 19.57it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests: 10000
Maximum request concurrency: 500
Benchmark duration (s): 511.03
Total input tokens: 2273690
Total generated tokens: 1959263
Request throughput (req/s): 19.57
Output token throughput (tok/s): 3833.96
Peak output token throughput (tok/s): 1349.00
Peak concurrent requests: 564.00
Total Token throughput (tok/s): 8283.20
---------------Time to First Token----------------
Mean TTFT (ms): 551.66
Median TTFT (ms): 442.69
P99 TTFT (ms): 3526.80
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 126.70
Median TPOT (ms): 128.25
P99 TPOT (ms): 162.89
---------------Inter-token Latency----------------
Mean ITL (ms): 1136.09
Median ITL (ms): 1247.91
P99 ITL (ms): 1577.18
==================================================
TP=8, EP=8
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 500
100%|██████████| 10000/10000 [08:59<00:00, 18.55it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests: 10000
Maximum request concurrency: 500
Benchmark duration (s): 539.19
Total input tokens: 2273690
Total generated tokens: 1957282
Request throughput (req/s): 18.55
Output token throughput (tok/s): 3630.02
Peak output token throughput (tok/s): 1711.00
Peak concurrent requests: 585.00
Total Token throughput (tok/s): 7846.85
---------------Time to First Token----------------
Mean TTFT (ms): 562.96
Median TTFT (ms): 474.82
P99 TTFT (ms): 2562.39
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 135.51
Median TPOT (ms): 138.17
P99 TPOT (ms): 173.24
---------------Inter-token Latency----------------
Mean ITL (ms): 1220.94
Median ITL (ms): 1354.88
P99 ITL (ms): 1786.97
==================================================
SGLang
DeepSeek-R1-0528
Maximum request concurrency: 500
100%|██████████| 10000/10000 [09:02<00:00, 18.44it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests: 10000
Maximum request concurrency: 500
Benchmark duration (s): 542.43
Total input tokens: 2273690
Total generated tokens: 1959889
Request throughput (req/s): 18.44
Output token throughput (tok/s): 3613.17
Peak output token throughput (tok/s): 10923.00
Peak concurrent requests: 551.00
Total Token throughput (tok/s): 7804.85
---------------Time to First Token----------------
Mean TTFT (ms): 556.70
Median TTFT (ms): 375.04
P99 TTFT (ms): 4964.01
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 136.97
Median TPOT (ms): 135.53
P99 TPOT (ms): 298.24
---------------Inter-token Latency----------------
Mean ITL (ms): 131.63
Median ITL (ms): 46.51
P99 ITL (ms): 464.98
==================================================
DeepSeek-R1-0528-FP4-v2
Maximum request concurrency: 500
100%|██████████| 10000/10000 [07:20<00:00, 22.68it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests: 10000
Maximum request concurrency: 500
Benchmark duration (s): 440.84
Total input tokens: 2273690
Total generated tokens: 1958537
Request throughput (req/s): 22.68
Output token throughput (tok/s): 4442.75
Peak output token throughput (tok/s): 12956.00
Peak concurrent requests: 563.00
Total Token throughput (tok/s): 9600.39
---------------Time to First Token----------------
Mean TTFT (ms): 837.32
Median TTFT (ms): 366.68
P99 TTFT (ms): 9831.91
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 109.24
Median TPOT (ms): 105.83
P99 TPOT (ms): 302.07
---------------Inter-token Latency----------------
Mean ITL (ms): 105.05
Median ITL (ms): 39.11
P99 ITL (ms): 476.41
==================================================
VLLM
DeepSeek-R1-0528
Maximum request concurrency: 500
100%|██████████| 10000/10000 [10:17<00:00, 16.21it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests: 10000
Maximum request concurrency: 500
Benchmark duration (s): 617.03
Total input tokens: 2273690
Total generated tokens: 1959184
Request throughput (req/s): 16.21
Output token throughput (tok/s): 3175.19
Peak output token throughput (tok/s): 7273.00
Peak concurrent requests: 537.00
Total Token throughput (tok/s): 6860.10
---------------Time to First Token----------------
Mean TTFT (ms): 814.28
Median TTFT (ms): 387.54
P99 TTFT (ms): 9494.70
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 154.87
Median TPOT (ms): 152.75
P99 TPOT (ms): 241.02
---------------Inter-token Latency----------------
Mean ITL (ms): 148.09
Median ITL (ms): 169.82
P99 ITL (ms): 235.63
==================================================
DeepSeek-R1-0528-FP4-v2
Maximum request concurrency: 500
100%|██████████| 10000/10000 [09:33<00:00, 17.43it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests: 10000
Maximum request concurrency: 500
Benchmark duration (s): 573.83
Total input tokens: 2273690
Total generated tokens: 1831346
Request throughput (req/s): 17.43
Output token throughput (tok/s): 3191.42
Peak output token throughput (tok/s): 9098.00
Peak concurrent requests: 540.00
Total Token throughput (tok/s): 7153.70
---------------Time to First Token----------------
Mean TTFT (ms): 691.30
Median TTFT (ms): 354.41
P99 TTFT (ms): 7219.65
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 151.53
Median TPOT (ms): 151.23
P99 TPOT (ms): 258.52
---------------Inter-token Latency----------------
Mean ITL (ms): 145.27
Median ITL (ms): 161.73
P99 ITL (ms): 223.18
==================================================
其他
推理测试过程中单卡功耗约500W
© 版权声明
文章版权归作者所有,未经允许请勿转载。
相关文章
暂无评论...


