Skip to content

Conversation

@djangoz-nv
Copy link

@djangoz-nv djangoz-nv commented Dec 19, 2025

Overview:

Details:

add hf-token-secret to DynamoGraphDeployment yamls in example directory to reduce hf rate limit in frontend service

We met following errors in frontend service frequently recent days.

2025-12-17T12:29:15.389368Z  WARN dynamo_llm::hub: Cannot connect to ModelExpress server: Transport error: transport error. Using direct download.
2025-12-17T12:29:15.389385Z  INFO modelexpress_common::download: Downloading model 'Qwen/Qwen3-0.6B' using provider: Hugging Face
2025-12-17T12:29:15.389389Z  WARN modelexpress_common::download: `ignore_weights` is set to true. All the model weight files will be ignored!
2025-12-17T12:29:15.389398Z  INFO modelexpress_common::providers::huggingface: Downloading model from Hugging Face: Qwen/Qwen3-0.6B
2025-12-17T12:29:15.389901Z  INFO modelexpress_common::providers::huggingface: Using cache directory: "/home/dynamo/.cache/huggingface/hub"
2025-12-17T12:29:15.401811Z DEBUG reqwest::connect: starting new connection: https://huggingface.co/
2025-12-17T12:29:15.526863Z  WARN dynamo_llm::hub: ModelExpress download failed for model 'Qwen/Qwen3-0.6B': Failed to fetch model 'Qwen/Qwen3-0.6B' from HuggingFace. Is this a valid HuggingFace ID? Error: request error: HTTP status client error (429 Too Many Requests) for url (https://huggingface.co/api/models/Qwen/Qwen3-0.6B/revision/main)
2025-12-17T12:29:15.526924Z ERROR dynamo_llm::discovery::watcher: Error adding model from discovery model_name="Qwen/Qwen3-0.6B" namespace="dynamo-cloud-sglang-agg-5aa3" error="Failed to fetch model 'Qwen/Qwen3-0.6B' from HuggingFace. Is this a valid HuggingFace ID? Error: request error: HTTP status client error (429 Too Many Requests) for url (https://huggingface.co/api/models/Qwen/Qwen3-0.6B/revision/main)"

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • Chores
    • Updated deployment configurations across all backends to enable secure credential management through Kubernetes secret-based environment variables for Frontend service deployments.

✏️ Tip: You can customize this high-level summary in your review settings.

…directory to reduce hf rate limit

Signed-off-by: Django Zhang <[email protected]>
@djangoz-nv djangoz-nv requested review from a team as code owners December 19, 2025 03:29
@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 19, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link

👋 Hi djangoz-nv! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions bot added external-contribution Pull request is from an external contributor chore labels Dec 19, 2025
@djangoz-nv djangoz-nv changed the title chore: add hf-token-secret to DynamoGraphDeployment yamls in example … chore: add hf-token-secret to dgd yamls in example directory Dec 19, 2025
@djangoz-nv djangoz-nv enabled auto-merge (squash) December 19, 2025 03:32
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 19, 2025

Walkthrough

This pull request adds envFromSecret: hf-token-secret to the Frontend service specification across 35+ deployment YAML files in the examples directory. The change enables the Frontend container to source environment variables from a Kubernetes secret named hf-token-secret across multiple backend implementations and deployment patterns.

Changes

Cohort / File(s) Change Summary
Mocker Backend
examples/backends/mocker/deploy/agg.yaml, disagg.yaml
Added envFromSecret: hf-token-secret to Frontend service
SGLang Backend
examples/backends/sglang/deploy/agg.yaml, agg_logging.yaml, agg_router.yaml, disagg.yaml, disagg-multinode.yaml, disagg_planner.yaml
Added envFromSecret: hf-token-secret to Frontend service
TRT-LLM Backend
examples/backends/trtllm/deploy/agg.yaml, agg-with-config.yaml, agg_router.yaml, disagg.yaml, disagg-multinode.yaml, disagg_planner.yaml, disagg_router.yaml
Added envFromSecret: hf-token-secret to Frontend service
vLLM Backend
examples/backends/vllm/deploy/agg.yaml, agg_kvbm.yaml, agg_router.yaml, disagg.yaml, disagg-multinode.yaml, disagg_kvbm.yaml, disagg_kvbm_2p2d.yaml, disagg_kvbm_tp2.yaml, disagg_planner.yaml, disagg_router.yaml, lora/agg_lora.yaml
Added envFromSecret: hf-token-secret to Frontend service
Examples & Other Deployments
examples/basics/kubernetes/Distributed_Inference/agg_router.yaml, disagg_router.yaml; examples/basics/kubernetes/shared_frontend/shared_frontend.yaml; examples/custom_backend/hello_world/deploy/hello_world.yaml; examples/deployments/GKE/sglang/disagg.yaml, vllm/disagg.yaml; examples/multimodal/deploy/agg_llava.yaml, agg_qwen.yaml
Added envFromSecret: hf-token-secret to Frontend service

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

Notes:

  • Homogeneous change applied uniformly across 35+ files with identical modification pattern
  • Simple configuration addition with no logic changes, behavioral impacts, or complexity
  • Change is already implemented consistently in other services across the codebase, representing standard practice application to Frontend

Poem

🐰 Secrets whispered to the frontend fair,
From hf-token vaults, environment to share,
Thirty files aligned in deployment grace,
Each Frontend now knows its rightful place!
~CodeRabbit 🌟

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Description check ✅ Passed The PR description follows the template structure with Overview, Details, and Related Issues sections. It explains the motivation (reducing HF rate limits), provides context with error logs, and specifies what was changed.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check ✅ Passed The title accurately describes the main change: adding hf-token-secret to DynamoGraphDeployment YAML files in the examples directory, which aligns with the comprehensive file changes throughout the pull request.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
examples/deployments/GKE/sglang/disagg.yaml (1)

10-10: Add HuggingFace token secret to Frontend service for consistency and rate limiting mitigation.

The addition of envFromSecret: hf-token-secret to the Frontend service is consistent with Dynamo's SGLang disaggregated deployment pattern, where worker services reference this secret. The secret must be created manually in the Kubernetes cluster using kubectl create secret generic hf-token-secret --from-literal=HF_TOKEN=${HF_TOKEN} before deployment. This aligns with the required setup steps in the deployment documentation.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f17fcb1 and 2541755.

📒 Files selected for processing (34)
  • examples/backends/mocker/deploy/agg.yaml (1 hunks)
  • examples/backends/mocker/deploy/disagg.yaml (1 hunks)
  • examples/backends/sglang/deploy/agg.yaml (1 hunks)
  • examples/backends/sglang/deploy/agg_logging.yaml (1 hunks)
  • examples/backends/sglang/deploy/agg_router.yaml (1 hunks)
  • examples/backends/sglang/deploy/disagg-multinode.yaml (1 hunks)
  • examples/backends/sglang/deploy/disagg.yaml (1 hunks)
  • examples/backends/sglang/deploy/disagg_planner.yaml (1 hunks)
  • examples/backends/trtllm/deploy/agg-with-config.yaml (1 hunks)
  • examples/backends/trtllm/deploy/agg.yaml (1 hunks)
  • examples/backends/trtllm/deploy/agg_router.yaml (1 hunks)
  • examples/backends/trtllm/deploy/disagg-multinode.yaml (1 hunks)
  • examples/backends/trtllm/deploy/disagg.yaml (1 hunks)
  • examples/backends/trtllm/deploy/disagg_planner.yaml (1 hunks)
  • examples/backends/trtllm/deploy/disagg_router.yaml (1 hunks)
  • examples/backends/vllm/deploy/agg.yaml (1 hunks)
  • examples/backends/vllm/deploy/agg_kvbm.yaml (1 hunks)
  • examples/backends/vllm/deploy/agg_router.yaml (1 hunks)
  • examples/backends/vllm/deploy/disagg-multinode.yaml (1 hunks)
  • examples/backends/vllm/deploy/disagg.yaml (1 hunks)
  • examples/backends/vllm/deploy/disagg_kvbm.yaml (1 hunks)
  • examples/backends/vllm/deploy/disagg_kvbm_2p2d.yaml (1 hunks)
  • examples/backends/vllm/deploy/disagg_kvbm_tp2.yaml (1 hunks)
  • examples/backends/vllm/deploy/disagg_planner.yaml (1 hunks)
  • examples/backends/vllm/deploy/disagg_router.yaml (1 hunks)
  • examples/backends/vllm/deploy/lora/agg_lora.yaml (1 hunks)
  • examples/basics/kubernetes/Distributed_Inference/agg_router.yaml (1 hunks)
  • examples/basics/kubernetes/Distributed_Inference/disagg_router.yaml (1 hunks)
  • examples/basics/kubernetes/shared_frontend/shared_frontend.yaml (1 hunks)
  • examples/custom_backend/hello_world/deploy/hello_world.yaml (1 hunks)
  • examples/deployments/GKE/sglang/disagg.yaml (1 hunks)
  • examples/deployments/GKE/vllm/disagg.yaml (1 hunks)
  • examples/multimodal/deploy/agg_llava.yaml (1 hunks)
  • examples/multimodal/deploy/agg_qwen.yaml (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (32)
examples/basics/kubernetes/Distributed_Inference/disagg_router.yaml (1)

11-11: LGTM. The addition of envFromSecret: hf-token-secret to the Frontend service is correct and consistent with the existing VllmDecodeWorker and VllmPrefillWorker services. The Kubernetes secret creation is already documented in the README with the exact command and secret name used throughout the deployment manifests.

examples/backends/trtllm/deploy/disagg.yaml (1)

11-11: LGTM!

The Frontend service now sources environment variables from hf-token-secret, matching the configuration already present in TRTLLMPrefillWorker and TRTLLMDecodeWorker services.

examples/basics/kubernetes/Distributed_Inference/agg_router.yaml (1)

11-11: LGTM!

The envFromSecret addition complements the existing envs configuration (lines 18-20) and aligns with the VllmDecodeWorker service configuration.

examples/backends/vllm/deploy/disagg_kvbm.yaml (1)

11-11: LGTM!

The Frontend service configuration now matches the VllmDecodeWorker and VllmPrefillWorker services, which already reference hf-token-secret.

examples/backends/mocker/deploy/disagg.yaml (1)

11-11: LGTM!

The Frontend service configuration now matches the prefill and decode worker services, maintaining consistency across all services in the deployment.

examples/backends/vllm/deploy/disagg_planner.yaml (1)

11-11: LGTM!

The Frontend service now sources environment variables from hf-token-secret, consistent with VllmDecodeWorker and VllmPrefillWorker. The Planner service appropriately does not include this secret as it doesn't require Hugging Face access.

examples/backends/sglang/deploy/disagg.yaml (1)

11-11: LGTM!

The Frontend service configuration now aligns with the decode and prefill worker services, which already reference hf-token-secret.

examples/backends/sglang/deploy/disagg-multinode.yaml (1)

20-20: Verify whether the global and service-level HF_TOKEN configurations are both necessary.

The SGLang deployment has both a global HF_TOKEN environment variable (lines 10-14) extracted from hf-token-secret, and service-level envFromSecret: hf-token-secret on Frontend (line 20) and decode (line 30). This pattern differs from the vLLM deployment, which uses only the service-level envFromSecret without the global env. Clarify whether both configurations are intentional (e.g., global env for all services, service-level for overrides) or if one should be removed for consistency.

examples/backends/vllm/deploy/disagg_kvbm_tp2.yaml (1)

11-11: LGTM - Frontend now matches worker secret configuration.

This addition aligns the Frontend service environment source with VllmDecodeWorker (line 20) and VllmPrefillWorker (line 48).

examples/backends/vllm/deploy/disagg_router.yaml (1)

11-11: LGTM - Aligns Frontend with worker configurations.

The Frontend service now sources secrets consistently with VllmDecodeWorker (line 23) and VllmPrefillWorker (line 43).

examples/multimodal/deploy/agg_llava.yaml (1)

12-12: LGTM - Frontend aligned with all worker services.

Consistent with EncodeWorker (line 20), VLMWorker (line 37), and Processor (line 54) configurations.

examples/backends/trtllm/deploy/disagg_router.yaml (1)

11-11: LGTM - Frontend now matches worker secret sourcing.

Aligns with TRTLLMPrefillWorker (line 23) and TRTLLMDecodeWorker (line 49).

examples/backends/vllm/deploy/disagg-multinode.yaml (1)

11-11: LGTM - Consistent secret sourcing across services.

Frontend now matches decode (line 28) and prefill (line 51) worker configurations.

examples/basics/kubernetes/shared_frontend/shared_frontend.yaml (1)

21-21: LGTM - Shared Frontend now has consistent secret access.

The Frontend service uses globalDynamoNamespace: true (line 23) and now sources the same secret as all worker services (VllmDecodeWorker line 42, and workers in agg-qwen deployment at lines 69, 88, 107). This ensures consistent Hugging Face authentication across the shared infrastructure.

examples/backends/trtllm/deploy/agg-with-config.yaml (1)

32-32: LGTM - Frontend aligned with TRTLLMWorker configuration.

Consistent with TRTLLMWorker (line 40) secret sourcing.

examples/backends/trtllm/deploy/disagg-multinode.yaml (1)

93-93: Frontend now sources environment from hf-token-secret consistently with worker services.

This change aligns the Frontend service with the prefill and decode workers (lines 116, 154), which already reference hf-token-secret. The required documentation for creating this secret already exists in the deployment README, which includes explicit instructions: kubectl create secret generic hf-token-secret --from-literal=HF_TOKEN=${HF_TOKEN} -n ${NAMESPACE}.

examples/backends/sglang/deploy/agg_router.yaml (1)

11-11: LGTM - Consistent with worker configuration.

The Frontend service correctly sources environment variables from hf-token-secret, matching the decode service pattern (line 22).

examples/backends/vllm/deploy/agg_kvbm.yaml (1)

11-11: LGTM - Aligns Frontend with VllmDecodeWorker secret configuration.

examples/backends/trtllm/deploy/agg.yaml (1)

11-11: LGTM - Consistent secret configuration across services.

examples/backends/vllm/deploy/disagg_kvbm_2p2d.yaml (1)

11-11: LGTM - Secret configuration unified across all services.

Frontend now matches the VllmDecodeWorker (line 20) and VllmPrefillWorker (line 42) secret configuration.

examples/backends/mocker/deploy/agg.yaml (1)

11-11: LGTM - Frontend aligned with decode service secret configuration.

examples/backends/sglang/deploy/agg.yaml (1)

11-11: LGTM - Completes the secret configuration alignment.

The Frontend service now sources environment variables from hf-token-secret, consistent with the decode service (line 19).

examples/backends/sglang/deploy/agg_logging.yaml (1)

14-14: Code alignment is correct; HuggingFace token secret documentation is already in place.

The Frontend service now properly sources the HuggingFace token from hf-token-secret, aligning with the decode worker configuration. Documentation for creating this secret is provided in the examples/backends/sglang/deploy/README.md file, which includes prerequisite setup instructions and a bash command example. A template secret manifest is also available at recipes/hf_hub_secret/hf_hub_secret.yaml for users to reference.

examples/custom_backend/hello_world/deploy/hello_world.yaml (1)

12-12: Remove unnecessary envFromSecret reference from Frontend service.

The Frontend service references hf-token-secret, but neither the Frontend (client.py) nor the HelloWorldWorker (hello_world.py) uses any Hugging Face APIs or requires authentication. The custom backend simply generates text responses without model discovery, tokenizer loading, or HF interactions. Remove envFromSecret: hf-token-secret from the Frontend service configuration (line 12).

Likely an incorrect or invalid review comment.

examples/multimodal/deploy/agg_qwen.yaml (1)

12-12: LGTM! Frontend service now sources HF token from secret.

The addition of envFromSecret: hf-token-secret to the Frontend service is consistent with the worker services in this deployment (EncodeWorker, VLMWorker, Processor all use the same secret). This enables authenticated Hugging Face API access to mitigate rate limiting.

examples/backends/trtllm/deploy/agg_router.yaml (1)

11-11: LGTM! Aligns with worker configuration.

The Frontend service now sources environment variables from hf-token-secret, matching the TRTLLMWorker configuration.

examples/backends/trtllm/deploy/disagg_planner.yaml (1)

11-11: LGTM! Completes secret configuration across all services.

Adding envFromSecret: hf-token-secret to Frontend aligns with the Planner and worker services, ensuring consistent HF authentication across the deployment.

examples/backends/sglang/deploy/disagg_planner.yaml (1)

11-11: LGTM! Ensures consistent authentication.

The Frontend service now uses hf-token-secret like the other services (Planner, decode, prefill workers).

examples/backends/vllm/deploy/lora/agg_lora.yaml (1)

10-10: LGTM! Matches worker configuration.

Frontend now sources HF token from the same secret as VllmDecodeWorker.

examples/backends/vllm/deploy/disagg.yaml (1)

11-11: LGTM! Consistent across disaggregated deployment.

Frontend service now uses the same secret as the decode and prefill workers.

examples/backends/vllm/deploy/agg_router.yaml (1)

11-11: LGTM! Completes secret configuration.

The Frontend service now sources environment variables from hf-token-secret, matching the VllmDecodeWorker configuration in this router-enabled deployment.

examples/backends/vllm/deploy/agg.yaml (1)

11-11: Secret creation documentation exists in the vLLM Kubernetes deployment guide.

The addition of envFromSecret: hf-token-secret is correct. Documentation for creating this secret is available in the NVIDIA Dynamo vLLM deployment guide, which explicitly instructs users to run kubectl create secret generic hf-token-secret --from-literal=HF_TOKEN=${HF_TOKEN} before deploying. No action required.

spec:
services:
Frontend:
envFromSecret: hf-token-secret
Copy link
Contributor

@coderabbitai coderabbitai bot Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Ensure secret creation is documented for users.

This change correctly adds Hugging Face authentication to the Frontend service, addressing the rate-limiting issue. The syntax aligns with existing worker service configurations and follows standard Kubernetes patterns.

However, users deploying this manifest must create the hf-token-secret before applying it, or pods will fail to start. If this example lacks setup documentation, add a comment or update the accompanying guide to include the command: kubectl create secret generic hf-token-secret --from-literal=HF_TOKEN=${HF_TOKEN}

🤖 Prompt for AI Agents
In examples/deployments/GKE/vllm/disagg.yaml around line 11, the manifest
references envFromSecret: hf-token-secret but does not document that users must
create that secret beforehand; update the example by adding an inline YAML
comment above the envFromSecret line or update the accompanying README to
instruct users to create the secret (e.g. run: kubectl create secret generic
hf-token-secret --from-literal=HF_TOKEN=${HF_TOKEN}) so pods won’t fail on
startup.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In VllmDecodeWorker, it has added envFromSecret: hf-token-secret, so I think it is ok to add it to Frontend service in same yaml.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@djangoz-nv You're absolutely right! I can see that both VllmDecodeWorker and VllmPrefillWorker already reference hf-token-secret in this same file (lines 15 and 32). Adding it to the Frontend service maintains consistency across all services in the deployment, which is the correct approach.

My documentation concern would apply to the entire deployment example rather than this specific change. Since the secret is already required by the worker services, your addition to Frontend doesn't introduce any new setup requirements. Thanks for the clarification! ✅

@erezzarum
Copy link

Why does the frontend needs access to huggingface? can you confirm those errors are from the frontend?
Dynamo uses a very old huggingface library which doesn't retry and it fails, newer huggingface libraries implement retry and backoff correctly which will not fail the deployment but will silently retry and backoff.
I recommend we also upgrade the huggingface library, i don't think there's a need to inject the hf token into the frontend service.

@djangoz-nv
Copy link
Author

Why does the frontend needs access to huggingface? can you confirm those errors are from the frontend? Dynamo uses a very old huggingface library which doesn't retry and it fails, newer huggingface libraries implement retry and backoff correctly which will not fail the deployment but will silently retry and backoff. I recommend we also upgrade the huggingface library, i don't think there's a need to inject the hf token into the frontend service.

Yes, I confirmed that it is in frontend service pod.

In dynamo code,

https://github.com/ai-dynamo/dynamo/blob/main/lib/llm/src/model_card.rs#L405

    /// Download the files this card needs to work: config.json, tokenizer.json, etc.
    pub async fn download_config(&mut self) -> anyhow::Result<()> {
        if self.has_local_files() {
            tracing::trace!("All model config is local, not downloading");
            return Ok(());
        }

        // For TensorBased models, config files are not used - they handle everything in the backend
        if self.model_type.supports_tensor() {
            tracing::debug!(
                display_name = %self.display_name,
                "Skipping config download for TensorBased model"
            );
            return Ok(());
        }

        let ignore_weights = true;
        let local_path = crate::hub::from_hf(&self.display_name, ignore_weights).await?;

        self.update_dir(&local_path);
        Ok(())
    }

@erezzarum
Copy link

erezzarum commented Dec 19, 2025

Why does the frontend needs access to huggingface? can you confirm those errors are from the frontend? Dynamo uses a very old huggingface library which doesn't retry and it fails, newer huggingface libraries implement retry and backoff correctly which will not fail the deployment but will silently retry and backoff. I recommend we also upgrade the huggingface library, i don't think there's a need to inject the hf token into the frontend service.

Yes, I confirmed that it is in frontend service pod.

In dynamo code,

https://github.com/ai-dynamo/dynamo/blob/main/lib/llm/src/model_card.rs#L405

    /// Download the files this card needs to work: config.json, tokenizer.json, etc.
    pub async fn download_config(&mut self) -> anyhow::Result<()> {
        if self.has_local_files() {
            tracing::trace!("All model config is local, not downloading");
            return Ok(());
        }

        // For TensorBased models, config files are not used - they handle everything in the backend
        if self.model_type.supports_tensor() {
            tracing::debug!(
                display_name = %self.display_name,
                "Skipping config download for TensorBased model"
            );
            return Ok(());
        }

        let ignore_weights = true;
        let local_path = crate::hub::from_hf(&self.display_name, ignore_weights).await?;

        self.update_dir(&local_path);
        Ok(())
    }

Editing: it does look like the frontend downloads the model card only (config files), adding the env is not enough as you will still hit throttling at some point even with that.
I recommend also upgrading to a more recent huggingface hub version to introduce retry and backoff without the process failing (you will still hit this if you'll scale your dgd)

It seems to use the rust version of hf-hub and it seems to lack support for the HF_TOKEN env variable, so adding it should not make any impact.
huggingface/hf-hub#133

@djangoz-nv
Copy link
Author

djangoz-nv commented Dec 19, 2025

@erezzarum
rust version of hf-hub lack support for the HF_TOKEN seems is interesting.

Because, I run several times, with HF_TOKEN frontend service always works well, but without HF_TOKEN, it always failed today.

check logs.

$ kubectl get po
NAME                                                              READY   STATUS      RESTARTS       AGE
dynamo-platform-dynamo-operator-controller-manager-f9db494ncgrt   2/2     Running     5 (5d1h ago)   7d9h
dynamo-platform-etcd-0                                            1/1     Running     2 (5d1h ago)   7d9h
dynamo-platform-nats-0                                            2/2     Running     4 (5d1h ago)   7d9h
profile-sla-online-2xgq6                                          0/2     Completed   0              8d
sglang-agg-decode-bfdfc775-brdcb                                  1/1     Running     0              5m48s
sglang-agg-frontend-b6b6585cd-bl5lx                               1/1     Running     0              5m48s
sglang-agg-with-hf-decode-5d58fb56cf-z8jft                        1/1     Running     0              63s
sglang-agg-with-hf-frontend-57f8bfcf8d-q2x7n                      1/1     Running     0              63s

$ kubectl logs sglang-agg-frontend-b6b6585cd-bl5lx
2025-12-19T10:39:05.177133Z  INFO dynamo_runtime::distributed: Initializing KV store discovery backend
2025-12-19T10:39:05.222041Z  INFO dynamo_llm::http::service::service_v2: Starting HTTP(S) service protocol="HTTP" address="0.0.0.0:8000"
2025-12-19T10:39:48.077344Z  WARN dynamo_llm::hub: Cannot connect to ModelExpress server: Transport error: transport error. Using direct download.
2025-12-19T10:39:48.077386Z  INFO modelexpress_common::download: Downloading model 'Qwen/Qwen3-0.6B' using provider: Hugging Face
2025-12-19T10:39:48.077391Z  WARN modelexpress_common::download: `ignore_weights` is set to true. All the model weight files will be ignored!
2025-12-19T10:39:48.077401Z  INFO modelexpress_common::providers::huggingface: Downloading model from Hugging Face: Qwen/Qwen3-0.6B
2025-12-19T10:39:48.078492Z  INFO modelexpress_common::providers::huggingface: Using cache directory: "/home/dynamo/.cache/huggingface/hub"
2025-12-19T10:39:48.193113Z  WARN dynamo_llm::hub: ModelExpress download failed for model 'Qwen/Qwen3-0.6B': Failed to fetch model 'Qwen/Qwen3-0.6B' from HuggingFace. Is this a valid HuggingFace ID? Error: request error: HTTP status client error (429 Too Many Requests) for url (https://huggingface.co/api/models/Qwen/Qwen3-0.6B/revision/main)
2025-12-19T10:39:48.193171Z ERROR dynamo_llm::discovery::watcher: Error adding model from discovery model_name="Qwen/Qwen3-0.6B" namespace="dynamo-cloud-sglang-agg" error="Failed to fetch model 'Qwen/Qwen3-0.6B' from HuggingFace. Is this a valid HuggingFace ID? Error: request error: HTTP status client error (429 Too Many Requests) for url (https://huggingface.co/api/models/Qwen/Qwen3-0.6B/revision/main)"
2025-12-19T10:43:42.752711Z ERROR dynamo_llm::discovery::watcher: error removing model error=Missing ModelDeploymentCard for 66c29b1c3f37faff


$ kubectl logs sglang-agg-with-hf-frontend-57f8bfcf8d-q2x7n
2025-12-19T10:43:51.707374Z  INFO dynamo_runtime::distributed: Initializing KV store discovery backend
2025-12-19T10:43:51.733627Z  INFO dynamo_llm::http::service::service_v2: Starting HTTP(S) service protocol="HTTP" address="0.0.0.0:8000"
2025-12-19T10:44:29.653420Z  WARN dynamo_llm::hub: Cannot connect to ModelExpress server: Transport error: transport error. Using direct download.
2025-12-19T10:44:29.653458Z  INFO modelexpress_common::download: Downloading model 'Qwen/Qwen3-0.6B' using provider: Hugging Face
2025-12-19T10:44:29.653464Z  WARN modelexpress_common::download: `ignore_weights` is set to true. All the model weight files will be ignored!
2025-12-19T10:44:29.653473Z  INFO modelexpress_common::providers::huggingface: Downloading model from Hugging Face: Qwen/Qwen3-0.6B
2025-12-19T10:44:29.654226Z  INFO modelexpress_common::providers::huggingface: Using cache directory: "/home/dynamo/.cache/huggingface/hub"
2025-12-19T10:44:29.844905Z  INFO modelexpress_common::providers::huggingface: Got model info: RepoInfo { siblings: [Siblings { rfilename: ".gitattributes" }, Siblings { rfilename: "LICENSE" }, Siblings { rfilename: "README.md" }, Siblings { rfilename: "config.json" }, Siblings { rfilename: "generation_config.json" }, Siblings { rfilename: "merges.txt" }, Siblings { rfilename: "model.safetensors" }, Siblings { rfilename: "tokenizer.json" }, Siblings { rfilename: "tokenizer_config.json" }, Siblings { rfilename: "vocab.json" }], sha: "c1899de289a04d12100db370d81485cdf75e47ca" }
2025-12-19T10:44:31.630035Z  INFO modelexpress_common::providers::huggingface: Downloaded model files for Qwen/Qwen3-0.6B
2025-12-19T10:44:31.630096Z  INFO dynamo_llm::hub: ModelExpress download completed successfully for model: Qwen/Qwen3-0.6B
2025-12-19T10:44:32.036361Z  INFO dynamo_llm::http::service::service_v2: chat endpoints enabled
2025-12-19T10:44:32.036416Z  INFO dynamo_llm::http::service::service_v2: completion endpoints enabled
2025-12-19T10:44:32.076350Z  INFO dynamo_runtime::pipeline::network::manager: Initializing NetworkManager mode=nats http_port=8888 tcp_port=9999
2025-12-19T10:44:32.076474Z  INFO dynamo_llm::discovery::watcher: Chat completions is ready
2025-12-19T10:44:32.096963Z  INFO dynamo_llm::discovery::watcher: Completions is ready
2025-12-19T10:44:32.096975Z  INFO dynamo_llm::discovery::watcher: added model model_name="Qwen/Qwen3-0.6B" namespace="dynamo-cloud-sglang-agg-with-hf"


$ kubectl describe po sglang-agg-frontend-b6b6585cd-bl5lx
Name:             sglang-agg-frontend-b6b6585cd-bl5lx
Namespace:        dynamo-cloud
Priority:         0
Service Account:  dynamo-platform-dynamo-operator-component
Node:             viking-cr-201/10.176.195.47
Start Time:       Fri, 19 Dec 2025 10:39:02 +0000
Labels:           nvidia.com/dynamo-component=Frontend
                  nvidia.com/dynamo-component-type=frontend
                  nvidia.com/dynamo-graph-deployment-name=sglang-agg
                  nvidia.com/dynamo-namespace=dynamo-cloud-sglang-agg
                  nvidia.com/metrics-enabled=true
                  nvidia.com/selector=sglang-agg-frontend
                  pod-template-hash=b6b6585cd
Annotations:      <none>
Status:           Running
IP:               10.42.0.22
IPs:
  IP:           10.42.0.22
Controlled By:  ReplicaSet/sglang-agg-frontend-b6b6585cd
Containers:
  main:
    Container ID:  containerd://e2c68456b39ef385456e9ed91d5e3ea2b34446808b8e274b12419aaa5a5f7758
    Image:         nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.7.1
    Image ID:      nvcr.io/nvidia/ai-dynamo/sglang-runtime@sha256:5e9f8bacc7921353c48c6b83d24ee3c18f5c39744cbb7de0ac0cc91e15f39e17
    Port:          8000/TCP (http)
    Host Port:     0/TCP (http)
    Command:
      python3
    Args:
      -m
      dynamo.frontend
    State:          Running
      Started:      Fri, 19 Dec 2025 10:39:04 +0000
    Ready:          True
    Restart Count:  0
    Liveness:       http-get http://:http/live delay=15s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get http://:http/health delay=10s timeout=3s period=10s #success=1 #failure=3
    Environment:
      DYNAMO_PORT:                   8000
      DYN_COMPONENT:                 frontend
      DYN_HTTP_PORT:                 8000
      DYN_NAMESPACE:                 dynamo-cloud-sglang-agg
      DYN_PARENT_DGD_K8S_NAME:       sglang-agg
      DYN_PARENT_DGD_K8S_NAMESPACE:  dynamo-cloud
      ETCD_ENDPOINTS:                dynamo-platform-etcd.dynamo-cloud.svc.cluster.local:2379
      NATS_SERVER:                   nats://dynamo-platform-nats.dynamo-cloud.svc.cluster.local:4222
      POD_NAME:                      sglang-agg-frontend-b6b6585cd-bl5lx (v1:metadata.name)
      POD_NAMESPACE:                 dynamo-cloud (v1:metadata.namespace)
      PROMETHEUS_ENDPOINT:           http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090
    Mounts:
      /dev/shm from shared-memory (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qc9gq (ro)



$ kubectl describe po sglang-agg-with-hf-frontend-57f8bfcf8d-q2x7n
Name:             sglang-agg-with-hf-frontend-57f8bfcf8d-q2x7n
Namespace:        dynamo-cloud
Priority:         0
Service Account:  dynamo-platform-dynamo-operator-component
Node:             viking-cr-201/10.176.195.47
Start Time:       Fri, 19 Dec 2025 10:43:47 +0000
Labels:           nvidia.com/dynamo-component=Frontend
                  nvidia.com/dynamo-component-type=frontend
                  nvidia.com/dynamo-graph-deployment-name=sglang-agg-with-hf
                  nvidia.com/dynamo-namespace=dynamo-cloud-sglang-agg-with-hf
                  nvidia.com/metrics-enabled=true
                  nvidia.com/selector=sglang-agg-with-hf-frontend
                  pod-template-hash=57f8bfcf8d
Annotations:      <none>
Status:           Running
IP:               10.42.0.26
IPs:
  IP:           10.42.0.26
Controlled By:  ReplicaSet/sglang-agg-with-hf-frontend-57f8bfcf8d
Containers:
  main:
    Container ID:  containerd://3c5fc0599f5d29f5bf88500daedf8be431334a6626f1e1225c4a0a85b5d78649
    Image:         nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.7.1
    Image ID:      nvcr.io/nvidia/ai-dynamo/sglang-runtime@sha256:5e9f8bacc7921353c48c6b83d24ee3c18f5c39744cbb7de0ac0cc91e15f39e17
    Port:          8000/TCP (http)
    Host Port:     0/TCP (http)
    Command:
      python3
    Args:
      -m
      dynamo.frontend
    State:          Running
      Started:      Fri, 19 Dec 2025 10:43:49 +0000
    Ready:          True
    Restart Count:  0
    Liveness:       http-get http://:http/live delay=15s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get http://:http/health delay=10s timeout=3s period=10s #success=1 #failure=3
    Environment Variables from:
      hf-token-secret  Secret  Optional: false
    Environment:
      DYNAMO_PORT:                   8000
      DYN_COMPONENT:                 frontend
      DYN_HTTP_PORT:                 8000
      DYN_NAMESPACE:                 dynamo-cloud-sglang-agg-with-hf
      DYN_PARENT_DGD_K8S_NAME:       sglang-agg-with-hf
      DYN_PARENT_DGD_K8S_NAMESPACE:  dynamo-cloud
      ETCD_ENDPOINTS:                dynamo-platform-etcd.dynamo-cloud.svc.cluster.local:2379
      NATS_SERVER:                   nats://dynamo-platform-nats.dynamo-cloud.svc.cluster.local:4222
      POD_NAME:                      sglang-agg-with-hf-frontend-57f8bfcf8d-q2x7n (v1:metadata.name)
      POD_NAMESPACE:                 dynamo-cloud (v1:metadata.namespace)
      PROMETHEUS_ENDPOINT:           http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090
    Mounts:
      /dev/shm from shared-memory (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-rx8jl (ro)

And without HF_TOKEN, frontend service cannot get model.

$ kubectl exec -it sglang-agg-frontend-b6b6585cd-bl5lx -- curl 127.0.0.1:8000/v1/models
{"object":"list","data":[]}

$ kubectl exec -it sglang-agg-with-hf-frontend-57f8bfcf8d-q2x7n -- curl 127.0.0.1:8000/v1/models
{"object":"list","data":[{"id":"Qwen/Qwen3-0.6B","object":"object","created":1766141330,"owned_by":"nvidia"}]}

@erezzarum
Copy link

erezzarum commented Dec 19, 2025

@erezzarum rust version of hf-hub lack support for the HF_TOKEN seems is interesting.

Because, I run several times, with HF_TOKEN frontend service always works well, but without HF_TOKEN, it always failed today.

check logs.

$ kubectl get po
NAME                                                              READY   STATUS      RESTARTS       AGE
dynamo-platform-dynamo-operator-controller-manager-f9db494ncgrt   2/2     Running     5 (5d1h ago)   7d9h
dynamo-platform-etcd-0                                            1/1     Running     2 (5d1h ago)   7d9h
dynamo-platform-nats-0                                            2/2     Running     4 (5d1h ago)   7d9h
profile-sla-online-2xgq6                                          0/2     Completed   0              8d
sglang-agg-decode-bfdfc775-brdcb                                  1/1     Running     0              5m48s
sglang-agg-frontend-b6b6585cd-bl5lx                               1/1     Running     0              5m48s
sglang-agg-with-hf-decode-5d58fb56cf-z8jft                        1/1     Running     0              63s
sglang-agg-with-hf-frontend-57f8bfcf8d-q2x7n                      1/1     Running     0              63s

$ kubectl logs sglang-agg-frontend-b6b6585cd-bl5lx
2025-12-19T10:39:05.177133Z  INFO dynamo_runtime::distributed: Initializing KV store discovery backend
2025-12-19T10:39:05.222041Z  INFO dynamo_llm::http::service::service_v2: Starting HTTP(S) service protocol="HTTP" address="0.0.0.0:8000"
2025-12-19T10:39:48.077344Z  WARN dynamo_llm::hub: Cannot connect to ModelExpress server: Transport error: transport error. Using direct download.
2025-12-19T10:39:48.077386Z  INFO modelexpress_common::download: Downloading model 'Qwen/Qwen3-0.6B' using provider: Hugging Face
2025-12-19T10:39:48.077391Z  WARN modelexpress_common::download: `ignore_weights` is set to true. All the model weight files will be ignored!
2025-12-19T10:39:48.077401Z  INFO modelexpress_common::providers::huggingface: Downloading model from Hugging Face: Qwen/Qwen3-0.6B
2025-12-19T10:39:48.078492Z  INFO modelexpress_common::providers::huggingface: Using cache directory: "/home/dynamo/.cache/huggingface/hub"
2025-12-19T10:39:48.193113Z  WARN dynamo_llm::hub: ModelExpress download failed for model 'Qwen/Qwen3-0.6B': Failed to fetch model 'Qwen/Qwen3-0.6B' from HuggingFace. Is this a valid HuggingFace ID? Error: request error: HTTP status client error (429 Too Many Requests) for url (https://huggingface.co/api/models/Qwen/Qwen3-0.6B/revision/main)
2025-12-19T10:39:48.193171Z ERROR dynamo_llm::discovery::watcher: Error adding model from discovery model_name="Qwen/Qwen3-0.6B" namespace="dynamo-cloud-sglang-agg" error="Failed to fetch model 'Qwen/Qwen3-0.6B' from HuggingFace. Is this a valid HuggingFace ID? Error: request error: HTTP status client error (429 Too Many Requests) for url (https://huggingface.co/api/models/Qwen/Qwen3-0.6B/revision/main)"
2025-12-19T10:43:42.752711Z ERROR dynamo_llm::discovery::watcher: error removing model error=Missing ModelDeploymentCard for 66c29b1c3f37faff


$ kubectl logs sglang-agg-with-hf-frontend-57f8bfcf8d-q2x7n
2025-12-19T10:43:51.707374Z  INFO dynamo_runtime::distributed: Initializing KV store discovery backend
2025-12-19T10:43:51.733627Z  INFO dynamo_llm::http::service::service_v2: Starting HTTP(S) service protocol="HTTP" address="0.0.0.0:8000"
2025-12-19T10:44:29.653420Z  WARN dynamo_llm::hub: Cannot connect to ModelExpress server: Transport error: transport error. Using direct download.
2025-12-19T10:44:29.653458Z  INFO modelexpress_common::download: Downloading model 'Qwen/Qwen3-0.6B' using provider: Hugging Face
2025-12-19T10:44:29.653464Z  WARN modelexpress_common::download: `ignore_weights` is set to true. All the model weight files will be ignored!
2025-12-19T10:44:29.653473Z  INFO modelexpress_common::providers::huggingface: Downloading model from Hugging Face: Qwen/Qwen3-0.6B
2025-12-19T10:44:29.654226Z  INFO modelexpress_common::providers::huggingface: Using cache directory: "/home/dynamo/.cache/huggingface/hub"
2025-12-19T10:44:29.844905Z  INFO modelexpress_common::providers::huggingface: Got model info: RepoInfo { siblings: [Siblings { rfilename: ".gitattributes" }, Siblings { rfilename: "LICENSE" }, Siblings { rfilename: "README.md" }, Siblings { rfilename: "config.json" }, Siblings { rfilename: "generation_config.json" }, Siblings { rfilename: "merges.txt" }, Siblings { rfilename: "model.safetensors" }, Siblings { rfilename: "tokenizer.json" }, Siblings { rfilename: "tokenizer_config.json" }, Siblings { rfilename: "vocab.json" }], sha: "c1899de289a04d12100db370d81485cdf75e47ca" }
2025-12-19T10:44:31.630035Z  INFO modelexpress_common::providers::huggingface: Downloaded model files for Qwen/Qwen3-0.6B
2025-12-19T10:44:31.630096Z  INFO dynamo_llm::hub: ModelExpress download completed successfully for model: Qwen/Qwen3-0.6B
2025-12-19T10:44:32.036361Z  INFO dynamo_llm::http::service::service_v2: chat endpoints enabled
2025-12-19T10:44:32.036416Z  INFO dynamo_llm::http::service::service_v2: completion endpoints enabled
2025-12-19T10:44:32.076350Z  INFO dynamo_runtime::pipeline::network::manager: Initializing NetworkManager mode=nats http_port=8888 tcp_port=9999
2025-12-19T10:44:32.076474Z  INFO dynamo_llm::discovery::watcher: Chat completions is ready
2025-12-19T10:44:32.096963Z  INFO dynamo_llm::discovery::watcher: Completions is ready
2025-12-19T10:44:32.096975Z  INFO dynamo_llm::discovery::watcher: added model model_name="Qwen/Qwen3-0.6B" namespace="dynamo-cloud-sglang-agg-with-hf"


$ kubectl describe po sglang-agg-frontend-b6b6585cd-bl5lx
Name:             sglang-agg-frontend-b6b6585cd-bl5lx
Namespace:        dynamo-cloud
Priority:         0
Service Account:  dynamo-platform-dynamo-operator-component
Node:             viking-cr-201/10.176.195.47
Start Time:       Fri, 19 Dec 2025 10:39:02 +0000
Labels:           nvidia.com/dynamo-component=Frontend
                  nvidia.com/dynamo-component-type=frontend
                  nvidia.com/dynamo-graph-deployment-name=sglang-agg
                  nvidia.com/dynamo-namespace=dynamo-cloud-sglang-agg
                  nvidia.com/metrics-enabled=true
                  nvidia.com/selector=sglang-agg-frontend
                  pod-template-hash=b6b6585cd
Annotations:      <none>
Status:           Running
IP:               10.42.0.22
IPs:
  IP:           10.42.0.22
Controlled By:  ReplicaSet/sglang-agg-frontend-b6b6585cd
Containers:
  main:
    Container ID:  containerd://e2c68456b39ef385456e9ed91d5e3ea2b34446808b8e274b12419aaa5a5f7758
    Image:         nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.7.1
    Image ID:      nvcr.io/nvidia/ai-dynamo/sglang-runtime@sha256:5e9f8bacc7921353c48c6b83d24ee3c18f5c39744cbb7de0ac0cc91e15f39e17
    Port:          8000/TCP (http)
    Host Port:     0/TCP (http)
    Command:
      python3
    Args:
      -m
      dynamo.frontend
    State:          Running
      Started:      Fri, 19 Dec 2025 10:39:04 +0000
    Ready:          True
    Restart Count:  0
    Liveness:       http-get http://:http/live delay=15s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get http://:http/health delay=10s timeout=3s period=10s #success=1 #failure=3
    Environment:
      DYNAMO_PORT:                   8000
      DYN_COMPONENT:                 frontend
      DYN_HTTP_PORT:                 8000
      DYN_NAMESPACE:                 dynamo-cloud-sglang-agg
      DYN_PARENT_DGD_K8S_NAME:       sglang-agg
      DYN_PARENT_DGD_K8S_NAMESPACE:  dynamo-cloud
      ETCD_ENDPOINTS:                dynamo-platform-etcd.dynamo-cloud.svc.cluster.local:2379
      NATS_SERVER:                   nats://dynamo-platform-nats.dynamo-cloud.svc.cluster.local:4222
      POD_NAME:                      sglang-agg-frontend-b6b6585cd-bl5lx (v1:metadata.name)
      POD_NAMESPACE:                 dynamo-cloud (v1:metadata.namespace)
      PROMETHEUS_ENDPOINT:           http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090
    Mounts:
      /dev/shm from shared-memory (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qc9gq (ro)



$ kubectl describe po sglang-agg-with-hf-frontend-57f8bfcf8d-q2x7n
Name:             sglang-agg-with-hf-frontend-57f8bfcf8d-q2x7n
Namespace:        dynamo-cloud
Priority:         0
Service Account:  dynamo-platform-dynamo-operator-component
Node:             viking-cr-201/10.176.195.47
Start Time:       Fri, 19 Dec 2025 10:43:47 +0000
Labels:           nvidia.com/dynamo-component=Frontend
                  nvidia.com/dynamo-component-type=frontend
                  nvidia.com/dynamo-graph-deployment-name=sglang-agg-with-hf
                  nvidia.com/dynamo-namespace=dynamo-cloud-sglang-agg-with-hf
                  nvidia.com/metrics-enabled=true
                  nvidia.com/selector=sglang-agg-with-hf-frontend
                  pod-template-hash=57f8bfcf8d
Annotations:      <none>
Status:           Running
IP:               10.42.0.26
IPs:
  IP:           10.42.0.26
Controlled By:  ReplicaSet/sglang-agg-with-hf-frontend-57f8bfcf8d
Containers:
  main:
    Container ID:  containerd://3c5fc0599f5d29f5bf88500daedf8be431334a6626f1e1225c4a0a85b5d78649
    Image:         nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.7.1
    Image ID:      nvcr.io/nvidia/ai-dynamo/sglang-runtime@sha256:5e9f8bacc7921353c48c6b83d24ee3c18f5c39744cbb7de0ac0cc91e15f39e17
    Port:          8000/TCP (http)
    Host Port:     0/TCP (http)
    Command:
      python3
    Args:
      -m
      dynamo.frontend
    State:          Running
      Started:      Fri, 19 Dec 2025 10:43:49 +0000
    Ready:          True
    Restart Count:  0
    Liveness:       http-get http://:http/live delay=15s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get http://:http/health delay=10s timeout=3s period=10s #success=1 #failure=3
    Environment Variables from:
      hf-token-secret  Secret  Optional: false
    Environment:
      DYNAMO_PORT:                   8000
      DYN_COMPONENT:                 frontend
      DYN_HTTP_PORT:                 8000
      DYN_NAMESPACE:                 dynamo-cloud-sglang-agg-with-hf
      DYN_PARENT_DGD_K8S_NAME:       sglang-agg-with-hf
      DYN_PARENT_DGD_K8S_NAMESPACE:  dynamo-cloud
      ETCD_ENDPOINTS:                dynamo-platform-etcd.dynamo-cloud.svc.cluster.local:2379
      NATS_SERVER:                   nats://dynamo-platform-nats.dynamo-cloud.svc.cluster.local:4222
      POD_NAME:                      sglang-agg-with-hf-frontend-57f8bfcf8d-q2x7n (v1:metadata.name)
      POD_NAMESPACE:                 dynamo-cloud (v1:metadata.namespace)
      PROMETHEUS_ENDPOINT:           http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090
    Mounts:
      /dev/shm from shared-memory (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-rx8jl (ro)

And without HF_TOKEN, frontend service cannot get model.

$ kubectl exec -it sglang-agg-frontend-b6b6585cd-bl5lx -- curl 127.0.0.1:8000/v1/models
{"object":"list","data":[]}

$ kubectl exec -it sglang-agg-with-hf-frontend-57f8bfcf8d-q2x7n -- curl 127.0.0.1:8000/v1/models
{"object":"list","data":[{"id":"Qwen/Qwen3-0.6B","object":"object","created":1766141330,"owned_by":"nvidia"}]}

That is probably because the 2nd time your rate bucket got filled (it's per IP), it doesn't seems to be related to you injecting HF_TOKEN to the frontend service.
It looks like the latest rust version doesn't implement retry and backoff, so it's worth to implement this in Dynamo directly.
The python version does have this.

Edit: it seems we need to wrap around modelexpress client to retry/backoff when we get throttled, as dynamo uses modelexpress to download models, it fallback to direct download if modelexpress server is not available

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

chore external-contribution Pull request is from an external contributor size/M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants