Connection
Connect to vLLM server
curl -X POST http://{server_address}:{port}/v1/completions -H "Content-Type: application/json" -d '{"prompt": "Hello", "max_tokens": 100}'
Set up vLLM server
python -m vllm.entrypoints.api_server --model {model_name} --port {port}
Basic Operations
Text completion
curl -X POST http://{server_address}:{port}/v1/completions -H "Content-Type: application/json" -d '{"prompt": "Your prompt here", "max_tokens": 100, "temperature": 0.7}'
Chat completion
curl -X POST http://{server_address}:{port}/v1/chat/completions -H "Content-Type: application/json" -d '{"messages": [{"role": "user", "content": "Hello"}], "max_tokens": 100}'
Streaming completion
curl -X POST http://{server_address}:{port}/v1/completions -H "Content-Type: application/json" -d '{"prompt": "Your prompt", "max_tokens": 100, "stream": true}'
Engine Management
Load model
python -m vllm.entrypoints.api_server --model {model_name}
Specify GPU devices
python -m vllm.entrypoints.api_server --model {model_name} --gpu-ids 0,1,2,3
Set tensor parallelism
python -m vllm.entrypoints.api_server --model {model_name} --tensor-parallel-size 4
Inference Settings
Set temperature
Add "temperature": 0.7 in request JSON
Set top-p sampling
Add "top_p": 0.95 in request JSON
Set maximum output length
Add "max_tokens": 512 in request JSON
Set repetition penalty
Add "presence_penalty": 1.0 in request JSON
Batch Processing
Set maximum batch size
python -m vllm.entrypoints.api_server --model {model_name} --max-model-len {length} --max-num-batched-tokens {num_tokens}
Multi-prompt batching
curl -X POST http://{server_address}:{port}/v1/completions -H "Content-Type: application/json" -d '{"prompt": ["Prompt 1", "Prompt 2"], "max_tokens": 100}'
Monitoring
Check server status
curl http://{server_address}:{port}/health
Get server metrics
curl http://{server_address}:{port}/metrics