Skip to content

fix(llm-trace): stash litellm tool-call usage so token cost survives arg-parse failures#2444

Merged
nicoloboschi merged 2 commits into
vectorize-io:mainfrom
r266-tech:fix/litellm-tool-call-usage-stash
Jun 30, 2026
Merged

fix(llm-trace): stash litellm tool-call usage so token cost survives arg-parse failures#2444
nicoloboschi merged 2 commits into
vectorize-io:mainfrom
r266-tech:fix/litellm-tool-call-usage-stash

Conversation

@r266-tech

Copy link
Copy Markdown
Contributor

Completes the tool-calling path for #2396 (closes #2387 — keep the provider-billed token cost on error traces).

#2396 made LiteLLMLLM.call() stash the response usage right after the billed completion so that a local failure (e.g. a length check or JSON parse) doesn't drop the tokens the provider already charged for. call_with_tools() was left without that stash:

  • the response is awaited and billed at the top of the try,
  • then tool-call arguments are parsed with an unguarded arguments = json.loads(arguments) (litellm_llm.py:423),
  • a malformed arguments string raises json.JSONDecodeError after billing; the broad except Exception classifies it non-retryable and re-raises, and the wrapper error path reads current_response_usage()None → records input/output_tokens = 0/0.

That is exactly the accounting loss #2387 set out to eliminate, just on the tool-calling route — and LiteLLMRouterLLM inherits it (it overrides _acompletion/_build_common_kwargs/_stage_label/_resolve_completion_model, not call_with_tools).

This adds the same stash_response_usage(_usage_from_litellm_response(response)) immediately after the billed response, mirroring call() and the anthropic/gemini call_with_tools paths. On the success path the usage is recomputed downstream as before; on the error path the provider cost now survives so the total-vs-effective-vs-failed token split #2387 asked for stays accurate.

r266-tech and others added 2 commits June 28, 2026 16:54
Add a real-provider regression test for the fix in this PR: the existing
wrapper-level tools test uses a provider that already stashes, so it does
not guard LiteLLMLLM.call_with_tools. This drives the real provider with a
billed response whose tool arguments are malformed JSON and asserts the
error trace keeps the provider-reported tokens (input/output/cached). The
LiteLLMRouterLLM subclass inherits call_with_tools, so it is covered too.

Verified it fails (input_tokens=None) when the stash line is removed.
@nicoloboschi nicoloboschi merged commit b7080a1 into vectorize-io:main Jun 30, 2026
177 of 188 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Question: should llm_requests retain provider token usage when structured output parsing fails?

2 participants