Commands Cheat Sheet

Try DrDroid: AI Agent for Debugging

80+ monitoring tool integrations
Long term memory about your stack
Locally run Mac App available

Thankyou for your submission

We have sent the cheatsheet on your email!
Oops! Something went wrong while submitting the form.

Thank you for your submission

We have sent the cheatsheet on your email!
Oops! Something went wrong while submitting the form.
Read more
Time to stop copy pasting your errors onto Google!

Connection

Connect to vLLM server
curl -X POST http://{server_address}:{port}/v1/completions -H "Content-Type: application/json" -d '{"prompt": "Hello", "max_tokens": 100}'

Set up vLLM server
python -m vllm.entrypoints.api_server --model {model_name} --port {port}

Basic Operations

Text completion
curl -X POST http://{server_address}:{port}/v1/completions -H "Content-Type: application/json" -d '{"prompt": "Your prompt here", "max_tokens": 100, "temperature": 0.7}'

Chat completion
curl -X POST http://{server_address}:{port}/v1/chat/completions -H "Content-Type: application/json" -d '{"messages": [{"role": "user", "content": "Hello"}], "max_tokens": 100}'

Streaming completion
curl -X POST http://{server_address}:{port}/v1/completions -H "Content-Type: application/json" -d '{"prompt": "Your prompt", "max_tokens": 100, "stream": true}'

Engine Management

Load model
python -m vllm.entrypoints.api_server --model {model_name}

Specify GPU devices
python -m vllm.entrypoints.api_server --model {model_name} --gpu-ids 0,1,2,3

Set tensor parallelism
python -m vllm.entrypoints.api_server --model {model_name} --tensor-parallel-size 4

Inference Settings

Set temperature
Add "temperature": 0.7 in request JSON

Set top-p sampling
Add "top_p": 0.95 in request JSON

Set maximum output length
Add "max_tokens": 512 in request JSON

Set repetition penalty
Add "presence_penalty": 1.0 in request JSON

Batch Processing

Set maximum batch size
python -m vllm.entrypoints.api_server --model {model_name} --max-model-len {length} --max-num-batched-tokens {num_tokens}

Multi-prompt batching
curl -X POST http://{server_address}:{port}/v1/completions -H "Content-Type: application/json" -d '{"prompt": ["Prompt 1", "Prompt 2"], "max_tokens": 100}'

Monitoring

Check server status
curl http://{server_address}:{port}/health

Get server metrics
curl http://{server_address}:{port}/metrics