Benchmark reports
This page publishes the benchmark artifacts that back the docs claims. The runner itself still lives in the Python SDK repository.
The docs site ships curated benchmark artifacts so deployed builds can render the same evidence that was generated in the SDK repository.
Latest snapshot
Model: deepseek-chat. Repeats: 1.
Semantic Workflow Benchmark v5
Headline
zcp_client_to_native_zcp vs mcp_client_to_zcp_mcp_surface: Advantage 3.83x. Token delta: 22695.8.
Artifacts: benchmark_reports/full_semantic_compare_v5/semantic_benchmark_summary.json
Overall comparison
| Backend | Answer | Workbook | Tool | Avg Total | Avg Turns | Avg Tool Calls |
|---|---|---|---|---|---|---|
| zcp_client_to_native_zcp | 100.0% | 97.3% | 100.0% | 8027.9 | 2.1 | 1.1 |
| mcp_client_to_zcp_mcp_surface | 97.3% | 91.9% | 73.0% | 30723.7 | 3.9 | 3.0 |
Tier comparison
| Tier | Native ZCP Avg Total | MCP Surface Avg Total | Ratio | Native Quality |
|---|---|---|---|---|
| A | 15979.4 | 17613.2 | 1.10x | 100.0% / 93.8% / 100.0% |
| B | 1826.6 | 29239.4 | 16.01x | 100.0% / 100.0% / 100.0% |
| C | 2091.1 | 72113.9 | 34.49x | 100.0% / 100.0% / 100.0% |
| D | 2018.3 | 19375.7 | 9.60x | 100.0% / 100.0% / 100.0% |
Compact Tool Benchmark
Artifacts: benchmark_reports/zcp_mcp_tool_call_benchmark.json
Compact summary
| Protocol | Runs | Answer Accuracy | Tool Compliance | Avg Prompt | Avg Completion | Avg Total |
|---|---|---|---|---|---|---|
| mcp | 8 | 100.0% | 100.0% | 4136.1 | 367.8 | 4503.9 |
| zcp | 8 | 100.0% | 100.0% | 2577.5 | 255.5 | 2833.0 |
Compact case breakdown
| Case | ZCP Avg Total | MCP Avg Total | MCP / ZCP | Token Delta |
|---|---|---|---|---|
| warmer_city_delta | 2821.0 | 4579.5 | 1.62x | 1758.5 |
| shanghai_temp_f_and_humidity | 2565.0 | 3834.5 | 1.49x | 1269.5 |
| average_three_city_temperature | 3116.0 | 5237.5 | 1.68x | 2121.5 |
| more_humid_city_delta | 2830.0 | 4364.0 | 1.54x | 1534.0 |