Commands Cheat Sheet

Evaluating engineering tools? Get the comparison in Google Sheets

(Perfect for making buy/build decisions or internal reviews.)

Most-used commands
Your email is safe thing.

Thankyou for your submission

We have sent the cheatsheet on your email!
Oops! Something went wrong while submitting the form.

Connection

Connect to vLLM server
curl -X POST http://{server_address}:{port}/v1/completions -H "Content-Type: application/json" -d '{"prompt": "Hello", "max_tokens": 100}'

Set up vLLM server
python -m vllm.entrypoints.api_server --model {model_name} --port {port}

Basic Operations

Text completion
curl -X POST http://{server_address}:{port}/v1/completions -H "Content-Type: application/json" -d '{"prompt": "Your prompt here", "max_tokens": 100, "temperature": 0.7}'

Chat completion
curl -X POST http://{server_address}:{port}/v1/chat/completions -H "Content-Type: application/json" -d '{"messages": [{"role": "user", "content": "Hello"}], "max_tokens": 100}'

Streaming completion
curl -X POST http://{server_address}:{port}/v1/completions -H "Content-Type: application/json" -d '{"prompt": "Your prompt", "max_tokens": 100, "stream": true}'

Engine Management

Load model
python -m vllm.entrypoints.api_server --model {model_name}

Specify GPU devices
python -m vllm.entrypoints.api_server --model {model_name} --gpu-ids 0,1,2,3

Set tensor parallelism
python -m vllm.entrypoints.api_server --model {model_name} --tensor-parallel-size 4

Inference Settings

Set temperature
Add "temperature": 0.7 in request JSON

Set top-p sampling
Add "top_p": 0.95 in request JSON

Set maximum output length
Add "max_tokens": 512 in request JSON

Set repetition penalty
Add "presence_penalty": 1.0 in request JSON

Batch Processing

Set maximum batch size
python -m vllm.entrypoints.api_server --model {model_name} --max-model-len {length} --max-num-batched-tokens {num_tokens}

Multi-prompt batching
curl -X POST http://{server_address}:{port}/v1/completions -H "Content-Type: application/json" -d '{"prompt": ["Prompt 1", "Prompt 2"], "max_tokens": 100}'

Monitoring

Check server status
curl http://{server_address}:{port}/health

Get server metrics
curl http://{server_address}:{port}/metrics