Skip to main content

Benchmark

llm-d-benchmark

This repository provides an automated workflow for benchmarking LLM inference using the llm-d stack. It includes tools for deployment, experiment execution, data collection, and teardown across multiple environments and deployment styles.

Goal

Provide a single source of automation for repeatable and reproducible experiments and performance evaluation on llm-d.

📦 Repository Setup

git clone https://github.com/llm-d/llm-d-benchmark.git
cd llm-d-benchmark
./setup/install_deps.sh

Quickstart

Standup an llm-d stack model (default deployment method is llm-d-modelservice, serving llama-1b), run a harness (default vllm-benchmark) with a load profile (default simple-random) and teardown the stack

./e2e.sh

Run harness inference-perf with load profile chatbot_synthetic againsta a pre-deployed stack

./run.sh --harness inference-perf --workload chatbot_synthetic --methods <a string that matches a inference service or pod>`

Architecture

The benchmarking system drives synthetic or trace-based traffic into an llm-d-powered inference stack, orchestrated via Kubernetes. Requests are routed through a scalable load generator, with results collected and visualized for latency, throughput, and cache effectiveness.

llm-d Logo

Goals

Reproducibility

Each benchmark run collects enough information to enable the execution on different clusters/environments with minimal setup effort

Flexibility

Multiple load generators and multiple load profiles available, in a plugable architecture that allows expansion

Well defined set of Metrics

Define and measure a representative set of metrics that allows not only meaningful comparisons between different stacks, but also performance characterization for different components.

For a discussion of candidate relevant metrics, please consult this document

CategoryMetricUnit
ThroughputOutput tokens / secondtokens / second
ThroughputInput tokens / secondtokens / second
ThroughputRequests / secondqps
LatencyTime per output token (TPOT)ms per output token
LatencyTime to first token (TTFT)ms
LatencyTime per request (TTFT + TPOT * output length)seconds per request
LatencyNormalized time per output token (TTFT/output length +TPOT) aka NTPOTms per output token
LatencyInter Token Latency (ITL) - Time between decode tokens within a requestms per output token
CorrectnessFailure ratequeries
ExperimentBenchmark durationseconds

Relevant collection of Workloads

Define a mix of workloads that express real-world use cases, allowing for llm-d performance characterization, evaluation, stress investigation.

For a discussion of relevant workloads, please consult this document

WorkloadUse CaseISLISVOSLOSVOSPLatency
Interactive ChatChat agentMediumHighMediumMediumMediumPer token
Classification of textSentiment analysisMediumShortLowHighRequest
Classification of imagesNudity filterLongLowShortLowHighRequest
Summarization / Information RetrievalQ&A from docs, RAGLongHighShortMediumMediumPer token
Text generationShortHighLongMediumLowPer token
TranslationMediumHighMediumMediumHighPer token
Code completionType aheadLongHighShortMediumMediumRequest
Code generationAdding a featureLongHighMediumHighMediumRequest

Design and Roadmap

llm-d-benchmark follows the practice of its parent project (llm-d) by having also it is own Northstar design (a work in progress)

Main concepts (identified by specific directories)

Scenarios

Pieces of information identifying a particular cluster. This information includes, but it is not limited to, GPU model, llm model and llm-d parameters (an environment file, and optionally a values.yaml file for modelservice helm charts)

Harnesses

Load Generator (python code) which drives the benchmark load. Today, llm-d-benchmark supports fmperf, inference-perf, guidellm and the benchmarks found on the benchmarks folder on vllm. There are ongoing efforts to consolidate and provide an easier way to support different load generators.

The nop harness, combined with env. variables and when using in standalone mode, will parse the vLLM log and create reports with loading time statistics.

The additional env. variables to set are:

Environment VariableExample Values
LLMDBENCH_VLLM_STANDALONE_VLLM_LOAD_FORMATsafetensors, tensorizer, runai_streamer, fastsafetensors
LLMDBENCH_VLLM_STANDALONE_VLLM_LOGGING_LEVELDEBUG, INFO, WARNING etc
LLMDBENCH_VLLM_STANDALONE_PREPROCESSsource /setup/preprocess/standalone-preprocess.sh ; /setup/preprocess/standalone-preprocess.py

The env. LMDBENCH_VLLM_STANDALONE_VLLM_LOGGING_LEVEL must be set to DEBUG so that the nop categories report finds all categories.

The env. LLMDBENCH_VLLM_STANDALONE_PREPROCESS must be set to the above value for the nop harness in order to install load format dependencies, export additional env. variables and pre-serialize models when using the tensorizer load format. The preprocess scripts will run in the vLLM standalone pod before the vLLM server starts.

Workload

Workload is the actual benchmark load specification which includes the LLM use case to benchmark, traffic pattern, input / output distribution and dataset. Supported workload profiles can be found under workload/profiles.

[!IMPORTANT] The triple <scenario>,<harness>,<workload>, combined with the standup/teardown capabilities provided by llm-d-infra and llm-d-modelservice should provide enough information to allow an experiment to be reproduced.

Dependencies

Topics

Lifecycle

Reproducibility

Observability

Quickstart

FAQ

Contribute

  • Instructions on how to contribute including details on our development process and governance.
  • We use Slack to discuss development across organizations. Please join: Slack. There is a sig-benchmarking channel there.
  • We host a weekly standup for contributors on Thursdays at 13:30 ET. Please join: Meeting Details. The meeting notes can be found here. Joining the llm-d google groups will grant you access.

License

This project is licensed under Apache License 2.0. See the LICENSE file for details.

Content Source

This content is automatically synced from README.md in the llm-d/llm-d-benchmark repository.

📝 To suggest changes, please edit the source file or create an issue.