DrDroid
Resource Cheatsheet

VLLM

Quick reference guide and commands for VLLM.

Download PDF

Connection

Connect to vLLM server
curl -X POST http://{server_address}:{port}/v1/completions -H "Content-Type: application/json" -d '{"prompt": "Hello", "max_tokens": 100}'

Set up vLLM server
python -m vllm.entrypoints.api_server --model {model_name} --port {port}

Basic Operations

Text completion
curl -X POST http://{server_address}:{port}/v1/completions -H "Content-Type: application/json" -d '{"prompt": "Your prompt here", "max_tokens": 100, "temperature": 0.7}'

Chat completion
curl -X POST http://{server_address}:{port}/v1/chat/completions -H "Content-Type: application/json" -d '{"messages": [{"role": "user", "content": "Hello"}], "max_tokens": 100}'

Streaming completion
curl -X POST http://{server_address}:{port}/v1/completions -H "Content-Type: application/json" -d '{"prompt": "Your prompt", "max_tokens": 100, "stream": true}'

Engine Management

Load model
python -m vllm.entrypoints.api_server --model {model_name}

Specify GPU devices
python -m vllm.entrypoints.api_server --model {model_name} --gpu-ids 0,1,2,3

Set tensor parallelism
python -m vllm.entrypoints.api_server --model {model_name} --tensor-parallel-size 4

Inference Settings

Set temperature
Add "temperature": 0.7 in request JSON

Set top-p sampling
Add "top_p": 0.95 in request JSON

Set maximum output length
Add "max_tokens": 512 in request JSON

Set repetition penalty
Add "presence_penalty": 1.0 in request JSON

Batch Processing

Set maximum batch size
python -m vllm.entrypoints.api_server --model {model_name} --max-model-len {length} --max-num-batched-tokens {num_tokens}

Multi-prompt batching
curl -X POST http://{server_address}:{port}/v1/completions -H "Content-Type: application/json" -d '{"prompt": ["Prompt 1", "Prompt 2"], "max_tokens": 100}'

Monitoring

Check server status
curl http://{server_address}:{port}/health

Get server metrics
curl http://{server_address}:{port}/metrics