Well-lit Path: Wide Expert Parallelism (EP/DP) with LeaderWorkerSet
Overview
- This example demonstrates how to deploy DeepSeek-R1-0528 using vLLM's P/D disaggregation support with NIXL in a wide expert parallel pattern with LeaderWorkerSets
- This "path" has been validated on a Cluster with 16xH200 GPUs split across two nodes with infiniband networking
WARNING: We are still investigating and optimizing performance for other hardware and networking configurations
In this example, we will demonstrate a deployment of DeepSeek-R1-0528
with:
- 1 DP=8 Prefill Workers
- 2 DP=4 Decode Worker
Installation
- Install your local dependencies (from
/llm-d-infra/quickstart
)
./install-deps.sh
- Use the quickstart to deploy Gateway CRDS + Gateway provider + Infra chart (from
/llm-d-infra/quickstart
). This example only works out of the box withIstio
as a provider, but with changes its possible to run this withkgateway
.
export HF_TOKEN=${HFTOKEN}
./llmd-infra-installer.sh --namespace llm-d-wide-ep -r infra-wide-ep -f examples/wide-ep-lws/infra-wide-ep/values.yaml --disable-metrics-collection
NOTE: The release name infra-wide-ep
is important here, because it matches up with pre-built values files used in this example.
- Use the helmfile to apply the modelservice chart on top of it
cd examples/wide-ep-lws
helmfile --selector managedBy=helmfile apply -f helmfile.yaml --skip-diff-on-install
Verifying the installation
- First you should be able to see that both of your release of infra and modelservice charts went smoothly:
$ helm list -n llm-d-wide-ep
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
infra-wide-ep llm-d-wide-ep 1 2025-07-25 05:43:35.263697 -0700 PDT deployed llm-d-infra-v1.1.1 v0.2.0
ms-wide-ep llm-d-wide-ep 1 2025-07-25 06:16:29.31419 -0700 PDT deployed llm-d-modelservice-v0.2.0 v0.2.0
- You should all the pods you expect to (2 decodes, 1 prefill, 1 gateway pod, 1 EPP pod):
$ kubectl get pods -n llm-d-wide-ep
NAME READY STATUS RESTARTS AGE
infra-wide-ep-inference-gateway-istio-7f4cf9f5-hpqg4 1/1 Running 0 55m
ms-wide-ep-llm-d-modelservice-decode-0 2/2 Running 0 22m
ms-wide-ep-llm-d-modelservice-decode-0-1 2/2 Running 0 22m
ms-wide-ep-llm-d-modelservice-epp-749696866d-n24tx 1/1 Running 0 22m
ms-wide-ep-llm-d-modelservice-prefill-0 1/1 Running 0 22m
- You should be able to do inferencing requests. The first thing we need to check is that all our vllm servers have started which can take some time. We recommend using
stern
to grep the decode logs together wand wait for the messaging saying that the vllm API server is spun up:
DECODE_PODS=$(kubectl get pods -n llm-d-wide-ep -l "llm-d.ai/inferenceServing=true,llm-d.ai/role=decode" --no-headers | awk ``{print $1}`` | tail -n 2)
stern "$(echo "$DECODE_PODS" | paste -sd'|' -)" -c vllm | grep -v "Avg prompt throughput"
Eventually you should see something logs indicating vllm is ready to start accepting requests:
ms-pd-llm-d-modelservice-decode-9666b4775-z8k46 vllm INFO 07-25 13:57:57 [api_server.py:1818] Starting vLLM API server 0 on http://0.0.0.0:8200
ms-pd-llm-d-modelservice-decode-9666b4775-z8k46 vllm INFO 07-25 13:57:57 [launcher.py:29] Available routes are:
ms-pd-llm-d-modelservice-decode-9666b4775-z8k46 vllm INFO 07-25 13:57:57 [launcher.py:37] Route: /openapi.json, Methods: GET, HEAD
ms-pd-llm-d-modelservice-decode-9666b4775-z8k46 vllm INFO 07-25 13:57:57 [launcher.py:37] Route: /docs, Methods: GET, HEAD
...
ms-pd-llm-d-modelservice-decode-9666b4775-z8k46 vllm INFO: Started server process [1]
ms-pd-llm-d-modelservice-decode-9666b4775-z8k46 vllm INFO: Waiting for application startup.
ms-pd-llm-d-modelservice-decode-9666b4775-z8k46 vllm INFO: Application startup complete.
We also should make sure that prefill has come up:
PREFILL_POD=$(kubectl get pods -n llm-d-wide-ep -l "llm-d.ai/inferenceServing=true,llm-d.ai/role=prefill" | tail -n 1 | awk ``{print $1}``)
kubectl logs pod/${PREFILL_POD} | grep -v "Avg prompt throughput"
Again look for the same server startup message, but instead of 2 aggregated into a single log stream with decode, you should only see 1 set of startup logs for prefill (hence the lack of stern
here):
INFO 07-25 18:46:12 [api_server.py:1818] Starting vLLM API server 0 on http://0.0.0.0:8000
INFO 07-25 18:46:12 [launcher.py:29] Available routes are:
INFO 07-25 18:46:12 [launcher.py:37] Route: /openapi.json, Methods: GET, HEAD
INFO 07-25 18:46:12 [launcher.py:37] Route: /docs, Methods: GET, HEAD
...
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
After this we can port-forwarding your gateway service in one terminal:
$ kubectl port-forward -n llm-d-wide-ep service/infra-wide-ep-inference-gateway-istio 8000:80
And then you should be able to curl your gateway service:
$ curl -s http://localhost:8000/v1/models \
-H "Content-Type: application/json" | jq
`{
"data": [
{
"created": 1753469354,
"id": "deepseek-ai/DeepSeek-R1-0528",
"max_model_len": 163840,
"object": "model",
"owned_by": "vllm",
"parent": null,
"permission": [
{
"allow_create_engine": false,
"allow_fine_tuning": false,
"allow_logprobs": true,
"allow_sampling": true,
"allow_search_indices": false,
"allow_view": true,
"created": 1753469354,
"group": null,
"id": "modelperm-7e5c28aac82549b09291f748cf209bf4",
"is_blocking": false,
"object": "model_permission",
"organization": "*"
}`
],
"root": "deepseek-ai/DeepSeek-R1-0528"
}
],
"object": "list"
}
Finally, we should be able to inferencing curl
s:
$ curl -s http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d ``{
"model": "deepseek-ai/DeepSeek-R1-0528",
"prompt": "I will start this from the first set of prompts and see where this gets routed. Were going to start by significantly jacking up the tokens so that we can ensure that this request gets routed properly with regard to PD. I also verified that all the gateway assets seem to be properly configured and as far as I can tell, there are no mismatches between assets. Everything seems set, lets hope that this works right now!",
"max_tokens": 100,
"ignore_eos": "true",
"seed": "'$(date +%M%H%M%S)'"
}`` | jq
`{
"choices": [
{
"finish_reason": "length",
"index": 0,
"logprobs": null,
"prompt_logprobs": null,
"stop_reason": null,
"text": " I'm going to use the following tokens to ensure that we get a proper response: \n\nToken: 250\nTemperature: 0.7\nMax Length: 500\nTop P: 1.0\nFrequency Penalty: 0.0\nPresence Penalty: 0.0\nStop Sequence: None\n\nNow, we are going to use the following prompt:\n\n\"Write a comprehensive and detailed tutorial on how to write a prompt that would be used with an AI like"
}`
],
"created": 1753469430,
"id": "cmpl-882f51e0-c2df-4284-a9a4-557b44ed00b9",
"kv_transfer_params": null,
"model": "deepseek-ai/DeepSeek-R1-0528",
"object": "text_completion",
"service_tier": null,
"system_fingerprint": null,
"usage": `{
"completion_tokens": 100,
"prompt_tokens": 86,
"prompt_tokens_details": null,
"total_tokens": 186
}`
}
Cleanup
To remove the deployment:
# Remove the model services
# From examples/wide-ep-lws
helmfile --selector managedBy=helmfile destroy -f helmfile.yaml
# Remove the infrastructure
helm uninstall infra-wide-ep -n llm-d-wide-ep
Customization
- Change model: Edit
ms-wide-ep/values.yaml
and update themodelArtifacts.uri
androuting.modelName
- Adjust resources: Modify the GPU/CPU/memory requests in the container specifications
- Scale workers: Change the
replicas
count for decode/prefill deployments
This content is automatically synced from quickstart/examples/wide-ep-lws/README.md in the llm-d-incubation/llm-d-infra repository.
📝 To suggest changes, please edit the source file or create an issue.