Benchmark

`llm-d`-benchmark

This repository provides an automated workflow for benchmarking LLM inference using the llm-d stack. It includes tools for deployment, experiment execution, data collection, and teardown across multiple environments and deployment styles.

Goal

Provide a single source of automation for repeatable and reproducible experiments and performance evaluation on llm-d.

📦 Repository Setup

git clone https://github.com/llm-d/llm-d-benchmark.git
cd llm-d-benchmark
./setup/install_deps.sh

Quickstart

Standup an `llm-d` stack model (default deployment method is `llm-d-modelservice`, serving `llama-1b`), run a harness (default `vllm-benchmark`) with a load profile (default `simple-random`) and teardown the stack

./e2e.sh

Run harness `inference-perf` with load profile `chatbot_synthetic` againsta a pre-deployed stack

./run.sh --harness inference-perf --workload chatbot_synthetic --methods <a string that matches a inference service or pod>`

Architecture

The benchmarking system drives synthetic or trace-based traffic into an llm-d-powered inference stack, orchestrated via Kubernetes. Requests are routed through a scalable load generator, with results collected and visualized for latency, throughput, and cache effectiveness.

Goals

Reproducibility

Each benchmark run collects enough information to enable the execution on different clusters/environments with minimal setup effort

Flexibility

Multiple load generators and multiple load profiles available, in a plugable architecture that allows expansion

Well defined set of Metrics

Define and measure a representative set of metrics that allows not only meaningful comparisons between different stacks, but also performance characterization for different components.

For a discussion of candidate relevant metrics, please consult this document

Category	Metric	Unit
Throughput	Output tokens / second	tokens / second
Throughput	Input tokens / second	tokens / second
Throughput	Requests / second	qps
Latency	Time per output token (TPOT)	ms per output token
Latency	Time to first token (TTFT)	ms
Latency	Time per request (TTFT + TPOT * output length)	seconds per request
Latency	Normalized time per output token (TTFT/output length +TPOT) aka NTPOT	ms per output token
Latency	Inter Token Latency (ITL) - Time between decode tokens within a request	ms per output token
Correctness	Failure rate	queries
Experiment	Benchmark duration	seconds

Relevant collection of Workloads

Define a mix of workloads that express real-world use cases, allowing for llm-d performance characterization, evaluation, stress investigation.

For a discussion of relevant workloads, please consult this document

Workload	Use Case	ISL	ISV	OSL	OSV	OSP	Latency
Interactive Chat	Chat agent	Medium	High	Medium	Medium	Medium	Per token
Classification of text	Sentiment analysis	Medium		Short	Low	High	Request
Classification of images	Nudity filter	Long	Low	Short	Low	High	Request
Summarization / Information Retrieval	Q&A from docs, RAG	Long	High	Short	Medium	Medium	Per token
Text generation		Short	High	Long	Medium	Low	Per token
Translation		Medium	High	Medium	Medium	High	Per token
Code completion	Type ahead	Long	High	Short	Medium	Medium	Request
Code generation	Adding a feature	Long	High	Medium	High	Medium	Request

Design and Roadmap

llm-d-benchmark follows the practice of its parent project (llm-d) by having also it is own Northstar design (a work in progress)

Main concepts (identified by specific directories)

Scenarios

Pieces of information identifying a particular cluster. This information includes, but it is not limited to, GPU model, llm model and llm-d parameters (an environment file, and optionally a values.yaml file for modelservice helm charts)

Harnesses

Load Generator (python code) which drives the benchmark load. Today, llm-d-benchmark supports fmperf, inference-perf, guidellm and the benchmarks found on the benchmarks folder on vllm. There are ongoing efforts to consolidate and provide an easier way to support different load generators.

The nop harness, combined with env. variables and when using in standalone mode, will parse the vLLM log and create reports with loading time statistics.

The additional env. variables to set are:

Environment Variable	Example Values
LLMDBENCH_VLLM_STANDALONE_VLLM_LOAD_FORMAT	`safetensors, tensorizer, runai_streamer, fastsafetensors`
LLMDBENCH_VLLM_STANDALONE_VLLM_LOGGING_LEVEL	`DEBUG, INFO, WARNING` etc
LLMDBENCH_VLLM_STANDALONE_PREPROCESS	`source /setup/preprocess/standalone-preprocess.sh ; /setup/preprocess/standalone-preprocess.py`

The env. LMDBENCH_VLLM_STANDALONE_VLLM_LOGGING_LEVEL must be set to DEBUG so that the nop categories report finds all categories.

The env. LLMDBENCH_VLLM_STANDALONE_PREPROCESS must be set to the above value for the nop harness in order to install load format dependencies, export additional env. variables and pre-serialize models when using the tensorizer load format. The preprocess scripts will run in the vLLM standalone pod before the vLLM server starts.

Workload

Workload is the actual benchmark load specification which includes the LLM use case to benchmark, traffic pattern, input / output distribution and dataset. Supported workload profiles can be found under workload/profiles.

[!IMPORTANT] The triple <scenario>,<harness>,<workload>, combined with the standup/teardown capabilities provided by llm-d-infra and llm-d-modelservice should provide enough information to allow an experiment to be reproduced.

Dependencies

Contribute

Instructions on how to contribute including details on our development process and governance.
We use Slack to discuss development across organizations. Please join: Slack. There is a sig-benchmarking channel there.
We host a weekly standup for contributors on Thursdays at 13:30 ET. Please join: Meeting Details. The meeting notes can be found here. Joining the llm-d google groups will grant you access.

License

This project is licensed under Apache License 2.0. See the LICENSE file for details.

Content Source

This content is automatically synced from README.md in the llm-d/llm-d-benchmark repository.

📝 To suggest changes, please edit the source file or create an issue.

Benchmark

`llm-d`-benchmark

Goal

📦 Repository Setup

Quickstart

Standup an `llm-d` stack model (default deployment method is `llm-d-modelservice`, serving `llama-1b`), run a harness (default `vllm-benchmark`) with a load profile (default `simple-random`) and teardown the stack

Run harness `inference-perf` with load profile `chatbot_synthetic` againsta a pre-deployed stack

Architecture

Goals

Reproducibility

Flexibility

Well defined set of Metrics

Relevant collection of Workloads

Design and Roadmap

Main concepts (identified by specific directories)

Scenarios

Harnesses

Workload

Dependencies

Topics

Lifecycle

Reproducibility

Observability

Quickstart

FAQ

Contribute

License

llm-d-benchmark​

Goal​

📦 Repository Setup​

Quickstart​

Standup an llm-d stack model (default deployment method is llm-d-modelservice, serving llama-1b), run a harness (default vllm-benchmark) with a load profile (default simple-random) and teardown the stack​

Run harness inference-perf with load profile chatbot_synthetic againsta a pre-deployed stack​

Architecture​

Goals​

Reproducibility​

Flexibility​

Well defined set of Metrics​

Relevant collection of Workloads​

Design and Roadmap​

Main concepts (identified by specific directories)​

Scenarios​

Harnesses​

Workload​

Dependencies​

Topics​

Lifecycle​

Reproducibility​

Observability​

Quickstart​

FAQ​

Contribute​

License​

`llm-d`-benchmark

Goal

📦 Repository Setup

Quickstart

Standup an `llm-d` stack model (default deployment method is `llm-d-modelservice`, serving `llama-1b`), run a harness (default `vllm-benchmark`) with a load profile (default `simple-random`) and teardown the stack

Run harness `inference-perf` with load profile `chatbot_synthetic` againsta a pre-deployed stack

Architecture

Goals

Reproducibility

Flexibility

Well defined set of Metrics

Relevant collection of Workloads

Design and Roadmap

Main concepts (identified by specific directories)

Scenarios

Harnesses

Workload

Dependencies

Topics

Lifecycle

Reproducibility

Observability

Quickstart

FAQ

Contribute

License