跳到主要内容

云端用SkyPilot运行

vLLM可以通过SkyPilot这个开源框架在任何云环境扩展多个GPU运行

安装SkyPilot并设置你的运,执行:

pip install skypilot
sky check

看SkyPilot YAML配置, serving.yaml.

resources:
accelerators: A100

envs:
MODEL_NAME: decapoda-research/llama-13b-hf
TOKENIZER: hf-internal-testing/llama-tokenizer

setup: |
conda create -n vllm python=3.9 -y
conda activate vllm
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install .
pip install gradio

run: |
conda activate vllm
echo 'Starting vllm api server...'
python -u -m vllm.entrypoints.api_server \
--model $MODEL_NAME \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--tokenizer $TOKENIZER 2>&1 | tee api_server.log &
echo 'Waiting for vllm api server to start...'
while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done
echo 'Starting gradio server...'
python vllm/examples/gradio_webserver.py

A100 GPU运行LLaMA-13B模型:

sky launch serving.yaml

检查输出命令,会有个gradio共享链接, 在你的浏览器中打开 LLaMA 来文本生成.

(task, pid=7431) Running on public URL: https://<gradio-hash>.gradio.live

可选: 用65B代替13B并使用更多的GPU:

sky launch -c vllm-serve-new -s serve.yaml --gpus A100:8 --env MODEL_NAME=decapoda-research/llama-65b-hf