Code Examples

This section provides practical code examples for common SDK use cases. All examples are available as runnable scripts in the samples/ directory.

Quick Reference

Sample	Description
`benchmark_evaluation.py`	Run a model against a benchmark, wait for completion, retrieve results
`quickstart.py`	Minimal end-to-end trace evaluation
`async_workflow.py`	Full async evaluation workflow with concurrent operations
`async_results.py`	Fetch results for multiple evaluations concurrently
`model_benchmark_management.py`	Filter models by name/company/region, add/remove from project
`evaluation_filtering.py`	Sort and filter evaluations by status, accuracy, date
`compare_evaluations.py`	Compare two models on a benchmark with outcome filtering
`paginated_results.py`	Paginate through results or fetch all at once
`custom_model.py`	Register a custom model with an OpenAI-compatible API
`custom_benchmark.py`	Create custom and smart benchmarks from data files
`create_judge.py`	Create, list, update, and delete judges
`basic_trace.py`	Upload, list, get, and delete traces
`trace_evaluation.py`	Run judges on traces, estimate cost, get results with steps
`judge_optimization.py`	Estimate, run, and apply judge optimizations
`public_catalog.py`	Browse public models, benchmarks, evaluations, and prompts
`integration_management.py`	List, inspect, and test configured integrations

Creating Evaluations -- Sync, async, and parallel evaluations
Retrieving Results -- Paginated, bulk, and concurrent result fetching
Models and Benchmarks -- Filtering, custom models, custom/smart benchmarks, project management
Judges and Traces -- Judge CRUD, trace uploads, trace evaluations, and optimizations
Public API -- Public models, benchmarks, evaluations, and comparisons

For the complete samples catalog including industry solutions, OpenClaw agent evaluation, CI/CD integration, and more, see the Samples Guide.