mlx-Chronos

Benchmark suite and community leaderboard for local LLM inference on Apple Silicon. Run a reproducible benchmark, save a sealed JSON result, and compare engines across Macs.

Overview

mlx-chronos is a standardized benchmark tool for local LLM inference engines on Apple Silicon. It detects your Mac, runs a fixed benchmark protocol against an OpenAI-compatible engine endpoint, and writes structured result files for local analysis or public leaderboard submission.

The public leaderboard is available at igurss.github.io/mlx-chronos.

What It Measures

Metric	Meaning	Public comparison use
TTFT cold	Time from request start to first non-empty streamed token with cache-avoiding prompts	Yes
TTFT cached	Time to first token after a cache-priming call with the same prompt	Yes
Request throughput	Completion tokens divided by full client-observed request time	Yes, when engine token usage is reliable
Sustained throughput	Optional long throughput run for heat buildup and late-run degradation	Yes, under the sustained profile
System RAM peak	Peak total Mac RAM in use during the benchmark	Yes
Engine RSS	Post-warmup RSS of the engine server process when identifiable	Diagnostic only
Thermal state	Start, end, worst state, samples, and affected benchmark phases when available	Context metadata
Tool calling	Planned future success-rate benchmark	Not yet available

Current Release

0.3.0 adds guided benchmark workflows, update and preflight tools, reconstructible timing metadata, and stricter public leaderboard integrity checks while retaining internal protocol compatibility label 3.

Supported Engines

Engine	Project	Notes
Ollama	ollama/ollama	MLX backend
oMLX	jundot/omlx	OpenAI-compatible server
Rapid-MLX	raullenchai/Rapid-MLX	OpenAI-compatible server
vllm-mlx	waybarrios/vllm-mlx	OpenAI-compatible server
mlx-lm	ml-explore/mlx-lm	Apple MLX

Note The engine server must already be running before mlx-chronos run, mlx-chronos models, or mlx-chronos validate can query it. See CONTRIBUTING.md for engine setup details.

Quick Start

1. Install

pip install mlx-chronos

Optional thermal-state support through macOS Foundation/PyObjC:

pip install "mlx-chronos[thermal]"

2. Check Version and Updates

mlx-chronos --version
mlx-chronos upgrade

When run in an interactive terminal, mlx-chronos performs a best-effort background PyPI version check. If a newer release is available, it prints a short notice recommending:

mlx-chronos upgrade

Set MLX_CHRONOS_DISABLE_UPDATE_CHECK=1 to disable the automatic check.

3. Inspect Your Engine

mlx-chronos engines
mlx-chronos models --engine omlx
mlx-chronos validate --engine omlx --model "Qwen3.5-4B-OptiQ-4bit"

4. Use the Interactive Wizard

mlx-chronos wizard

The wizard provides a terminal menu for common actions and a guided benchmark builder with engine, model, profile, token bounds, output format, cooldown, preflight, notes, and other run options. When the selected engine server is running, the wizard loads /models and lets you select a model from the exposed IDs, with manual entry as a fallback. Before launching a benchmark, it shows the equivalent mlx-chronos run ... command so the same configuration can be reused in scripts. You can return to the main menu from benchmark setup without starting a run.

5. Run a Benchmark Manually

mlx-chronos run --engine omlx --model "Qwen3.5-4B-OptiQ-4bit"

Results are written to results/local/ by default.

6. Useful Run Options

# Write both JSON and Markdown outputs
mlx-chronos run --engine omlx --model "Qwen3.5-4B-OptiQ-4bit" --format all

# Choose a custom output directory
mlx-chronos run --engine omlx --model "Qwen3.5-4B-OptiQ-4bit" --output-dir ~/Desktop/benchmarks

# Request throughput output token bounds for local experiments
mlx-chronos run --engine omlx --model "Qwen3.5-4B-OptiQ-4bit" --max-tokens 100 --min-tokens 80

# Run the longer heat/throttling-sensitive sustained profile
mlx-chronos run --engine omlx --model "Qwen3.5-4B-OptiQ-4bit" --profile sustained

# Enforce cooldown after a recent run in the same output directory
mlx-chronos run --engine omlx --model "Qwen3.5-4B-OptiQ-4bit" --cooldown-seconds 300

# Fail fast with an extra model access probe before measured work starts
mlx-chronos run --engine omlx --model "Qwen3.5-4B-OptiQ-4bit" --preflight

# Include a model reference URL, required for public leaderboard submissions
mlx-chronos run --engine omlx \
  --model "Qwen3.5-4B-OptiQ-4bit" \
  --model-url "https://huggingface.co/mlx-community/Qwen3.5-4B-OptiQ-4bit"

CLI Reference

Command	Purpose
`mlx-chronos --version`	Print the installed package version
`mlx-chronos wizard`	Open an interactive menu for common commands and guided benchmark setup
`mlx-chronos upgrade`	Check PyPI and upgrade the current Python environment if a newer release exists
`mlx-chronos engines`	List supported engines and local installed/running status
`mlx-chronos models --engine <name>`	List model IDs exposed by a running engine server
`mlx-chronos validate --engine <name> --model <model>`	Validate hardware, engine, server, and optional model access
`mlx-chronos run --engine <name> --model <model>`	Run a benchmark and save local result files
`mlx-chronos submit --file <result.json> --dry-run`	Validate whether a result is publishable
`mlx-chronos submit --file <result.json>`	Send a validated result to the maintainer inbox

Configuration

Setting	Example	What it changes
`MLX_CHRONOS_<ENGINE>_PORT`	`MLX_CHRONOS_OMLX_PORT=8002`	Overrides an engine server port
`MLX_CHRONOS_CACHED_TTFT_RATIO`	`MLX_CHRONOS_CACHED_TTFT_RATIO=0.8`	Sets the cached-TTFT warning threshold
`MLX_CHRONOS_DISABLE_UPDATE_CHECK`	`MLX_CHRONOS_DISABLE_UPDATE_CHECK=1`	Disables automatic background update checks
`MLX_CHRONOS_SUBMIT_ENDPOINT`	`https://example.test/form`	Overrides the maintainer inbox endpoint

Default engine ports:

Engine	Default port
oMLX	`8000`
Rapid-MLX	`8001`
vllm-mlx	`8000`
mlx-lm	`8080`
Ollama	`11434`

oMLX and vllm-mlx both default to port 8000. To avoid mislabeling results, mlx-Chronos checks the oMLX listener process with lsof; if that process cannot be inspected, oMLX validation may fail even when /v1/models responds.

Benchmark Protocol

mlx-chronos run executes a fixed protocol against the running engine. The JSON result records exact prompt text, token bounds, benchmark profile, timing metadata, hardware metadata, and an integrity seal.

Measurement Flow

Phase	What happens
Hardware detection	Captures chip, machine model, memory, macOS, Python, architecture, battery state, Low Power Mode, and thermal context when available
Warmup	Uses a separate prompt so same-run prefix/KV cache hits do not remove throughput prefill work
Cold TTFT	Uses unique prompts inside the run to avoid same-run cache hits
Cached TTFT	Primes one fixed prompt, then measures consecutive cached trials
Throughput	Uses fixed protocol prompts and deterministic generation parameters
RAM and thermal tracking	Samples system RAM, diagnostic engine RSS, phase timings, and thermal state where available
Result sealing	Adds a tamper-evident integrity seal for public-submission validation

Important Details

Requests use deterministic generation parameters: temperature=0.0 and top_p=1.0.
Throughput is end-to-end request throughput, not pure decode speed. It includes request overhead, prefill, and decode.
Timed TTFT and throughput requests are never retried. A transient request failure invalidates the run instead of becoming part of a published timing.
Cached TTFT is recorded only after cache priming completes successfully.
Decode throughput records first-content-to-stream-end elapsed time so the value can be reconstructed from raw completion-token counts.
Throughput prompts intentionally vary to reduce cache artifacts, so run standard deviation includes workload variation plus system and engine noise.
If an engine cannot provide reliable usage.completion_tokens, the run falls back to a local estimate and is marked as not leaderboard-comparable.
p95 is reported only when at least 20 trials are available.
The default baseline run uses 5 trials. The maximum prompt pool supports 30 unique cold and throughput prompts.

Sustained Profile

--profile sustained runs one long throughput trial with max_tokens=1000 by default and records progress samples every 100 generated output units. Intermediate samples are estimates when the stream only reports exact token usage at the end.

If the sustained run observes a thermal-state change or non-nominal thermal state, result metadata includes a sustained throttling warning. The warning compares early and late progress-window averages, not a single first/last sample.

Cooldown Metadata

Before each run, mlx-Chronos checks the latest prior JSON result in the same output directory. The elapsed time is saved as meta.elapsed_since_last_benchmark_seconds.

Use --cooldown-seconds to enforce a pause before starting another run. The default recent-run warning threshold is 300 seconds.

For a fuller explanation, see docs/methodology.md.

Leaderboard Rules

Local runs are intentionally flexible. You can change trial count, profile, output token bounds, cooldown, connection mode, notes, and other parameters for your own diagnostics.

Public leaderboard submissions are stricter so rows remain comparable.

Publishable Profiles

Profile	Trials	`max_tokens`	Minimum generated output	`min_tokens`
Baseline	5	100	80 tokens	Not allowed
Sustained	1	1000	800 tokens	Not allowed

Public Submission Requirements

Throughput must use the engine response's usage.completion_tokens.
The result must include model.reference_url, a link to the model used.
The inference engine version must be known; engine.version=unknown is not accepted for public comparison.
Hardware must report an Apple M-series chip, arm64, and a valid macOS version; timestamps may not be more than 10 minutes in the future.
All warmup calls must complete successfully (warmup_failures=0).
System RAM, engine RSS, and continuous Foundation thermal monitoring must complete without sampling errors.
macOS Low Power Mode must be disabled.
Decode throughput must include reconstructible raw decode elapsed time.
The JSON must pass mlx-chronos submit --dry-run.
The result must include a valid integrity seal.
The archive rejects duplicate integrity digests and duplicate run identities.
Custom token bounds, fallback token estimates, custom public-profile trial counts, short-output runs, and Low Power Mode runs are valid local records but are not accepted into the public leaderboard.

Result JSON also contains internal benchmark-protocol labels used by validators to detect incompatible result formats. Treat labels such as 1, 2, and 3 as implementation compatibility markers, not public protocol release versions. Model reference URLs point to the model page used for the run. Model pages can change over time when maintainers update files or tags. Leaderboard comparisons keep model name, quantization, format, provenance, and revision separate so distinct variants are not grouped together.

Submit Results

Pull Request Workflow

Run mlx-chronos run on your Mac.
Find the generated JSON in results/local/.

Validate it locally:

mlx-chronos submit --file results/local/your-result.json --dry-run

Copy the checked JSON into results/submitted/ with a clear filename.
Open a pull request with only that JSON file changed.
GitHub Actions labels the PR as result-submission, validates schema and integrity, and the maintainer reviews it before merge.

Warning Do not edit submitted JSON by hand after the run. Public submissions include an integrity seal over the canonical result payload; changing any benchmark field invalidates that seal.

Inbox Fallback

If opening a PR is inconvenient, send a validated result directly:

mlx-chronos submit --file results/local/your-result.json

Maintainers can override the inbox endpoint with --endpoint or MLX_CHRONOS_SUBMIT_ENDPOINT.

See CONTRIBUTING.md for detailed contributor instructions.

Roadmap

Completed

Future

Evaluate a clearer TTFT naming model without breaking the v0.1 JSON contract
Add tool-calling success-rate benchmarks
Collect more results from M3, M4, and M5 systems

License

Apache 2.0. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 192 Commits
.github		.github
docs		docs
mlx_chronos		mlx_chronos
results		results
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

mlx-Chronos

Contents

Overview

What It Measures

Current Release

Supported Engines

Quick Start

1. Install

2. Check Version and Updates

3. Inspect Your Engine

4. Use the Interactive Wizard

5. Run a Benchmark Manually

6. Useful Run Options

CLI Reference

Configuration

Benchmark Protocol

Measurement Flow

Important Details

Sustained Profile

Cooldown Metadata

Leaderboard Rules

Publishable Profiles

Public Submission Requirements

Submit Results

Pull Request Workflow

Inbox Fallback

Roadmap

Completed

Future

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 8

Uh oh!

Contributors

Uh oh!

Languages