Benchmark Methodology

ZCP's benchmark claims should be reproducible from the Python SDK repository, not maintained manually in the docs site.

Source Of Truth

Use only the benchmark harness in zero-context-protocol-python:

  • script: examples/compare_zcp_mcp_tool_call_benchmark.py
  • script (Excel semantic suite): examples/compare_excel_client_protocol_benchmark.py
  • harness: tools/benchmarking.py
  • harness (Excel semantic suite): tools/excel_llm_benchmarking.py
  • report outputs:
  • benchmark_reports/zcp_mcp_tool_call_benchmark.json
  • benchmark_reports/zcp_mcp_tool_call_benchmark.md
  • benchmark_reports/full_semantic_compare_v5/excel_llm_token_benchmark.json
  • benchmark_reports/full_semantic_compare_v5/excel_llm_token_benchmark.md
  • benchmark_reports/full_semantic_compare_v5/semantic_benchmark_summary.md

The docs site should consume generated artifacts. It should not reimplement the runner logic.

Required Reporting Fields

Every benchmark statement should include:

  • run date
  • model and provider
  • repeat count
  • task set description
  • prompt construction rules
  • whether full MCP schema injection was included
  • token accounting source
  • failure handling rules

Reproduction

From zero-context-protocol-python:

python3 examples/compare_zcp_mcp_tool_call_benchmark.py --repeats 2

For the current published Excel semantic comparison (full_semantic_compare_v5):

python3 examples/compare_excel_client_protocol_benchmark.py \
  --model deepseek-chat \
  --repeats 1 \
  --tiers A,B,C,D \
  --backends zcp_client_to_native_zcp,mcp_client_to_zcp_mcp_surface \
  --output-dir benchmark_reports/full_semantic_compare_v5

Interpretation Rules

  • compare protocol overhead, not general model intelligence
  • keep the model, provider, and run date explicit because benchmark numbers

drift over time

  • use the generated JSON and markdown reports as the publishable evidence
  • treat docs-site charts and tables as presentation only

Why ZCP Can Win

The intended efficiency gains come from:

  • compact registry exchange
  • schema-out-of-band design
  • handle-based large result storage
  • incremental disclosure of constraints instead of repeated full schemas
  • task-oriented runtime state that stays outside the transcript

Every claim about token savings should be paired with the exact generated benchmark artifact used for that comparison.

Latest Published Results (full_semantic_compare_v5)

Publish metadata:

  • run date: 2026-03-17
  • model/provider: deepseek-chat on https://api.deepseek.com
  • repeats: 1
  • case count: 37
  • backend pair:
  • native: zcp_client_to_native_zcp
  • MCP surface: mcp_client_to_zcp_mcp_surface
ScopeNative ZCP Avg TotalMCP Surface Avg TotalToken Delta (MCP - Native)Ratio
Overall8027.930723.722695.83.83x
Tier A15979.417613.21633.81.10x
Tier B1826.629239.427412.816.01x
Tier C2091.172113.970022.734.49x
Tier D2018.319375.717357.39.60x

Quality summary for native ZCP in this run:

  • answer accuracy: 100.0%
  • workbook accuracy: 97.3%
  • tool compliance: 100.0%

Boundary notes:

  • this report is a single-run (repeats=1) snapshot, not a variance study
  • 25/37 cases favor native ZCP on tokens; the 12 non-winning cases are all Tier A primitive operations