Benchmark Methodology
ZCP's benchmark claims should be reproducible from the Python SDK repository, not maintained manually in the docs site.
Source Of Truth
Use only the benchmark harness in zero-context-protocol-python:
- script:
examples/compare_zcp_mcp_tool_call_benchmark.py - script (Excel semantic suite):
examples/compare_excel_client_protocol_benchmark.py - harness:
tools/benchmarking.py - harness (Excel semantic suite):
tools/excel_llm_benchmarking.py - report outputs:
benchmark_reports/zcp_mcp_tool_call_benchmark.jsonbenchmark_reports/zcp_mcp_tool_call_benchmark.mdbenchmark_reports/full_semantic_compare_v5/excel_llm_token_benchmark.jsonbenchmark_reports/full_semantic_compare_v5/excel_llm_token_benchmark.mdbenchmark_reports/full_semantic_compare_v5/semantic_benchmark_summary.md
The docs site should consume generated artifacts. It should not reimplement the runner logic.
Required Reporting Fields
Every benchmark statement should include:
- run date
- model and provider
- repeat count
- task set description
- prompt construction rules
- whether full MCP schema injection was included
- token accounting source
- failure handling rules
Reproduction
From zero-context-protocol-python:
python3 examples/compare_zcp_mcp_tool_call_benchmark.py --repeats 2For the current published Excel semantic comparison (full_semantic_compare_v5):
python3 examples/compare_excel_client_protocol_benchmark.py \
--model deepseek-chat \
--repeats 1 \
--tiers A,B,C,D \
--backends zcp_client_to_native_zcp,mcp_client_to_zcp_mcp_surface \
--output-dir benchmark_reports/full_semantic_compare_v5Interpretation Rules
- compare protocol overhead, not general model intelligence
- keep the model, provider, and run date explicit because benchmark numbers
drift over time
- use the generated JSON and markdown reports as the publishable evidence
- treat docs-site charts and tables as presentation only
Why ZCP Can Win
The intended efficiency gains come from:
- compact registry exchange
- schema-out-of-band design
- handle-based large result storage
- incremental disclosure of constraints instead of repeated full schemas
- task-oriented runtime state that stays outside the transcript
Every claim about token savings should be paired with the exact generated benchmark artifact used for that comparison.
Latest Published Results (full_semantic_compare_v5)
Publish metadata:
- run date:
2026-03-17 - model/provider:
deepseek-chatonhttps://api.deepseek.com - repeats:
1 - case count:
37 - backend pair:
- native:
zcp_client_to_native_zcp - MCP surface:
mcp_client_to_zcp_mcp_surface
| Scope | Native ZCP Avg Total | MCP Surface Avg Total | Token Delta (MCP - Native) | Ratio |
|---|---|---|---|---|
| Overall | 8027.9 | 30723.7 | 22695.8 | 3.83x |
| Tier A | 15979.4 | 17613.2 | 1633.8 | 1.10x |
| Tier B | 1826.6 | 29239.4 | 27412.8 | 16.01x |
| Tier C | 2091.1 | 72113.9 | 70022.7 | 34.49x |
| Tier D | 2018.3 | 19375.7 | 17357.3 | 9.60x |
Quality summary for native ZCP in this run:
- answer accuracy:
100.0% - workbook accuracy:
97.3% - tool compliance:
100.0%
Boundary notes:
- this report is a single-run (
repeats=1) snapshot, not a variance study 25/37cases favor native ZCP on tokens; the12non-winning cases are all Tier A primitive operations