-
Notifications
You must be signed in to change notification settings - Fork 750
chore: add hf-token-secret to dgd yamls in example directory #5028
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…directory to reduce hf rate limit Signed-off-by: Django Zhang <[email protected]>
|
👋 Hi djangoz-nv! Thank you for contributing to ai-dynamo/dynamo. Just a reminder: The 🚀 |
WalkthroughThis pull request adds Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~5 minutes Notes:
Poem
Pre-merge checks✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
examples/deployments/GKE/sglang/disagg.yaml (1)
10-10: Add HuggingFace token secret to Frontend service for consistency and rate limiting mitigation.The addition of
envFromSecret: hf-token-secretto the Frontend service is consistent with Dynamo's SGLang disaggregated deployment pattern, where worker services reference this secret. The secret must be created manually in the Kubernetes cluster usingkubectl create secret generic hf-token-secret --from-literal=HF_TOKEN=${HF_TOKEN}before deployment. This aligns with the required setup steps in the deployment documentation.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (34)
examples/backends/mocker/deploy/agg.yaml(1 hunks)examples/backends/mocker/deploy/disagg.yaml(1 hunks)examples/backends/sglang/deploy/agg.yaml(1 hunks)examples/backends/sglang/deploy/agg_logging.yaml(1 hunks)examples/backends/sglang/deploy/agg_router.yaml(1 hunks)examples/backends/sglang/deploy/disagg-multinode.yaml(1 hunks)examples/backends/sglang/deploy/disagg.yaml(1 hunks)examples/backends/sglang/deploy/disagg_planner.yaml(1 hunks)examples/backends/trtllm/deploy/agg-with-config.yaml(1 hunks)examples/backends/trtllm/deploy/agg.yaml(1 hunks)examples/backends/trtllm/deploy/agg_router.yaml(1 hunks)examples/backends/trtllm/deploy/disagg-multinode.yaml(1 hunks)examples/backends/trtllm/deploy/disagg.yaml(1 hunks)examples/backends/trtllm/deploy/disagg_planner.yaml(1 hunks)examples/backends/trtllm/deploy/disagg_router.yaml(1 hunks)examples/backends/vllm/deploy/agg.yaml(1 hunks)examples/backends/vllm/deploy/agg_kvbm.yaml(1 hunks)examples/backends/vllm/deploy/agg_router.yaml(1 hunks)examples/backends/vllm/deploy/disagg-multinode.yaml(1 hunks)examples/backends/vllm/deploy/disagg.yaml(1 hunks)examples/backends/vllm/deploy/disagg_kvbm.yaml(1 hunks)examples/backends/vllm/deploy/disagg_kvbm_2p2d.yaml(1 hunks)examples/backends/vllm/deploy/disagg_kvbm_tp2.yaml(1 hunks)examples/backends/vllm/deploy/disagg_planner.yaml(1 hunks)examples/backends/vllm/deploy/disagg_router.yaml(1 hunks)examples/backends/vllm/deploy/lora/agg_lora.yaml(1 hunks)examples/basics/kubernetes/Distributed_Inference/agg_router.yaml(1 hunks)examples/basics/kubernetes/Distributed_Inference/disagg_router.yaml(1 hunks)examples/basics/kubernetes/shared_frontend/shared_frontend.yaml(1 hunks)examples/custom_backend/hello_world/deploy/hello_world.yaml(1 hunks)examples/deployments/GKE/sglang/disagg.yaml(1 hunks)examples/deployments/GKE/vllm/disagg.yaml(1 hunks)examples/multimodal/deploy/agg_llava.yaml(1 hunks)examples/multimodal/deploy/agg_qwen.yaml(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (32)
examples/basics/kubernetes/Distributed_Inference/disagg_router.yaml (1)
11-11: LGTM. The addition ofenvFromSecret: hf-token-secretto the Frontend service is correct and consistent with the existing VllmDecodeWorker and VllmPrefillWorker services. The Kubernetes secret creation is already documented in the README with the exact command and secret name used throughout the deployment manifests.examples/backends/trtllm/deploy/disagg.yaml (1)
11-11: LGTM!The Frontend service now sources environment variables from
hf-token-secret, matching the configuration already present in TRTLLMPrefillWorker and TRTLLMDecodeWorker services.examples/basics/kubernetes/Distributed_Inference/agg_router.yaml (1)
11-11: LGTM!The
envFromSecretaddition complements the existingenvsconfiguration (lines 18-20) and aligns with the VllmDecodeWorker service configuration.examples/backends/vllm/deploy/disagg_kvbm.yaml (1)
11-11: LGTM!The Frontend service configuration now matches the VllmDecodeWorker and VllmPrefillWorker services, which already reference
hf-token-secret.examples/backends/mocker/deploy/disagg.yaml (1)
11-11: LGTM!The Frontend service configuration now matches the prefill and decode worker services, maintaining consistency across all services in the deployment.
examples/backends/vllm/deploy/disagg_planner.yaml (1)
11-11: LGTM!The Frontend service now sources environment variables from
hf-token-secret, consistent with VllmDecodeWorker and VllmPrefillWorker. The Planner service appropriately does not include this secret as it doesn't require Hugging Face access.examples/backends/sglang/deploy/disagg.yaml (1)
11-11: LGTM!The Frontend service configuration now aligns with the decode and prefill worker services, which already reference
hf-token-secret.examples/backends/sglang/deploy/disagg-multinode.yaml (1)
20-20: Verify whether the global and service-level HF_TOKEN configurations are both necessary.The SGLang deployment has both a global
HF_TOKENenvironment variable (lines 10-14) extracted fromhf-token-secret, and service-levelenvFromSecret: hf-token-secreton Frontend (line 20) and decode (line 30). This pattern differs from the vLLM deployment, which uses only the service-levelenvFromSecretwithout the global env. Clarify whether both configurations are intentional (e.g., global env for all services, service-level for overrides) or if one should be removed for consistency.examples/backends/vllm/deploy/disagg_kvbm_tp2.yaml (1)
11-11: LGTM - Frontend now matches worker secret configuration.This addition aligns the Frontend service environment source with VllmDecodeWorker (line 20) and VllmPrefillWorker (line 48).
examples/backends/vllm/deploy/disagg_router.yaml (1)
11-11: LGTM - Aligns Frontend with worker configurations.The Frontend service now sources secrets consistently with VllmDecodeWorker (line 23) and VllmPrefillWorker (line 43).
examples/multimodal/deploy/agg_llava.yaml (1)
12-12: LGTM - Frontend aligned with all worker services.Consistent with EncodeWorker (line 20), VLMWorker (line 37), and Processor (line 54) configurations.
examples/backends/trtllm/deploy/disagg_router.yaml (1)
11-11: LGTM - Frontend now matches worker secret sourcing.Aligns with TRTLLMPrefillWorker (line 23) and TRTLLMDecodeWorker (line 49).
examples/backends/vllm/deploy/disagg-multinode.yaml (1)
11-11: LGTM - Consistent secret sourcing across services.Frontend now matches decode (line 28) and prefill (line 51) worker configurations.
examples/basics/kubernetes/shared_frontend/shared_frontend.yaml (1)
21-21: LGTM - Shared Frontend now has consistent secret access.The Frontend service uses
globalDynamoNamespace: true(line 23) and now sources the same secret as all worker services (VllmDecodeWorker line 42, and workers in agg-qwen deployment at lines 69, 88, 107). This ensures consistent Hugging Face authentication across the shared infrastructure.examples/backends/trtllm/deploy/agg-with-config.yaml (1)
32-32: LGTM - Frontend aligned with TRTLLMWorker configuration.Consistent with TRTLLMWorker (line 40) secret sourcing.
examples/backends/trtllm/deploy/disagg-multinode.yaml (1)
93-93: Frontend now sources environment from hf-token-secret consistently with worker services.This change aligns the Frontend service with the prefill and decode workers (lines 116, 154), which already reference
hf-token-secret. The required documentation for creating this secret already exists in the deployment README, which includes explicit instructions:kubectl create secret generic hf-token-secret --from-literal=HF_TOKEN=${HF_TOKEN} -n ${NAMESPACE}.examples/backends/sglang/deploy/agg_router.yaml (1)
11-11: LGTM - Consistent with worker configuration.The Frontend service correctly sources environment variables from
hf-token-secret, matching the decode service pattern (line 22).examples/backends/vllm/deploy/agg_kvbm.yaml (1)
11-11: LGTM - Aligns Frontend with VllmDecodeWorker secret configuration.examples/backends/trtllm/deploy/agg.yaml (1)
11-11: LGTM - Consistent secret configuration across services.examples/backends/vllm/deploy/disagg_kvbm_2p2d.yaml (1)
11-11: LGTM - Secret configuration unified across all services.Frontend now matches the VllmDecodeWorker (line 20) and VllmPrefillWorker (line 42) secret configuration.
examples/backends/mocker/deploy/agg.yaml (1)
11-11: LGTM - Frontend aligned with decode service secret configuration.examples/backends/sglang/deploy/agg.yaml (1)
11-11: LGTM - Completes the secret configuration alignment.The Frontend service now sources environment variables from
hf-token-secret, consistent with the decode service (line 19).examples/backends/sglang/deploy/agg_logging.yaml (1)
14-14: Code alignment is correct; HuggingFace token secret documentation is already in place.The Frontend service now properly sources the HuggingFace token from
hf-token-secret, aligning with the decode worker configuration. Documentation for creating this secret is provided in theexamples/backends/sglang/deploy/README.mdfile, which includes prerequisite setup instructions and a bash command example. A template secret manifest is also available atrecipes/hf_hub_secret/hf_hub_secret.yamlfor users to reference.examples/custom_backend/hello_world/deploy/hello_world.yaml (1)
12-12: Remove unnecessaryenvFromSecretreference from Frontend service.The Frontend service references
hf-token-secret, but neither the Frontend (client.py) nor the HelloWorldWorker (hello_world.py) uses any Hugging Face APIs or requires authentication. The custom backend simply generates text responses without model discovery, tokenizer loading, or HF interactions. RemoveenvFromSecret: hf-token-secretfrom the Frontend service configuration (line 12).Likely an incorrect or invalid review comment.
examples/multimodal/deploy/agg_qwen.yaml (1)
12-12: LGTM! Frontend service now sources HF token from secret.The addition of
envFromSecret: hf-token-secretto the Frontend service is consistent with the worker services in this deployment (EncodeWorker, VLMWorker, Processor all use the same secret). This enables authenticated Hugging Face API access to mitigate rate limiting.examples/backends/trtllm/deploy/agg_router.yaml (1)
11-11: LGTM! Aligns with worker configuration.The Frontend service now sources environment variables from
hf-token-secret, matching the TRTLLMWorker configuration.examples/backends/trtllm/deploy/disagg_planner.yaml (1)
11-11: LGTM! Completes secret configuration across all services.Adding
envFromSecret: hf-token-secretto Frontend aligns with the Planner and worker services, ensuring consistent HF authentication across the deployment.examples/backends/sglang/deploy/disagg_planner.yaml (1)
11-11: LGTM! Ensures consistent authentication.The Frontend service now uses
hf-token-secretlike the other services (Planner, decode, prefill workers).examples/backends/vllm/deploy/lora/agg_lora.yaml (1)
10-10: LGTM! Matches worker configuration.Frontend now sources HF token from the same secret as VllmDecodeWorker.
examples/backends/vllm/deploy/disagg.yaml (1)
11-11: LGTM! Consistent across disaggregated deployment.Frontend service now uses the same secret as the decode and prefill workers.
examples/backends/vllm/deploy/agg_router.yaml (1)
11-11: LGTM! Completes secret configuration.The Frontend service now sources environment variables from
hf-token-secret, matching the VllmDecodeWorker configuration in this router-enabled deployment.examples/backends/vllm/deploy/agg.yaml (1)
11-11: Secret creation documentation exists in the vLLM Kubernetes deployment guide.The addition of
envFromSecret: hf-token-secretis correct. Documentation for creating this secret is available in the NVIDIA Dynamo vLLM deployment guide, which explicitly instructs users to runkubectl create secret generic hf-token-secret --from-literal=HF_TOKEN=${HF_TOKEN}before deploying. No action required.
| spec: | ||
| services: | ||
| Frontend: | ||
| envFromSecret: hf-token-secret |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure secret creation is documented for users.
This change correctly adds Hugging Face authentication to the Frontend service, addressing the rate-limiting issue. The syntax aligns with existing worker service configurations and follows standard Kubernetes patterns.
However, users deploying this manifest must create the hf-token-secret before applying it, or pods will fail to start. If this example lacks setup documentation, add a comment or update the accompanying guide to include the command: kubectl create secret generic hf-token-secret --from-literal=HF_TOKEN=${HF_TOKEN}
🤖 Prompt for AI Agents
In examples/deployments/GKE/vllm/disagg.yaml around line 11, the manifest
references envFromSecret: hf-token-secret but does not document that users must
create that secret beforehand; update the example by adding an inline YAML
comment above the envFromSecret line or update the accompanying README to
instruct users to create the secret (e.g. run: kubectl create secret generic
hf-token-secret --from-literal=HF_TOKEN=${HF_TOKEN}) so pods won’t fail on
startup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In VllmDecodeWorker, it has added envFromSecret: hf-token-secret, so I think it is ok to add it to Frontend service in same yaml.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@djangoz-nv You're absolutely right! I can see that both VllmDecodeWorker and VllmPrefillWorker already reference hf-token-secret in this same file (lines 15 and 32). Adding it to the Frontend service maintains consistency across all services in the deployment, which is the correct approach.
My documentation concern would apply to the entire deployment example rather than this specific change. Since the secret is already required by the worker services, your addition to Frontend doesn't introduce any new setup requirements. Thanks for the clarification! ✅
Signed-off-by: Django Zhang <[email protected]>
|
Why does the frontend needs access to huggingface? can you confirm those errors are from the frontend? |
Yes, I confirmed that it is in frontend service pod. In dynamo code, https://github.com/ai-dynamo/dynamo/blob/main/lib/llm/src/model_card.rs#L405 |
Editing: it does look like the frontend downloads the model card only (config files), adding the env is not enough as you will still hit throttling at some point even with that. It seems to use the rust version of hf-hub and it seems to lack support for the |
|
@erezzarum Because, I run several times, with HF_TOKEN frontend service always works well, but without HF_TOKEN, it always failed today. check logs. And without HF_TOKEN, frontend service cannot get model. |
That is probably because the 2nd time your rate bucket got filled (it's per IP), it doesn't seems to be related to you injecting Edit: it seems we need to wrap around modelexpress client to retry/backoff when we get throttled, as dynamo uses modelexpress to download models, it fallback to direct download if modelexpress server is not available |
Overview:
Details:
add hf-token-secret to DynamoGraphDeployment yamls in example directory to reduce hf rate limit in frontend service
We met following errors in frontend service frequently recent days.
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.