vllm运行huggingface的deepseek蒸馏模型-Eoser's page!

anaconda安装配置，以及huggingface镜像配置不在这里介绍了，可以看看《BitNet运行大模型折腾事件》

创建一个运行环境

conda create -n vllm python=3.12 -y
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple

GPU安装vllm

# 安装vllm
pip install vllm

CPU安装vllm

官方文档：https://docs.vllm.ai/en/latest/getting_started/installation/cpu/index.html

1.cpu版本需要自己编译，先安装编译环境

sudo apt-get update  -y
sudo apt-get install -y gcc-12 g++-12 libnuma-dev
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12

2.拉取代码

git clone --recursive http://github.com/vllm-project/vllm.git

3.编译cpu的依赖

cd vllm
pip install --upgrade pip

pip install "cmake>=3.26" wheel packaging ninja "setuptools-scm>=8" numpy
# pytorch 下载太慢，通过 --proxy 加本地的外网代理加速
pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu --proxy=http://192.168.3.2:10083

4.编译安装

VLLM_TARGET_DEVICE=cpu python setup.py install

下载大模型


export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --local-dir deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
# 运行大模型
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --trust-remote-code --host=0.0.0.0 --port=8080 --max-model-len=1024 --max-num-batched-tokens=6000 --enable-prefix-caching
# 测试大模型
curl http://localhost:8080/v1/chat/completions     -H "Content-Type: application/json"     -d '{
        "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
        "messages": [
            {"role": "system", "content": "你是一个全能助手。"},
            {"role": "user", "content": "谁赢了2020年的世界杯？"}
        ]
    }'

目录CONTENT

vllm运行huggingface的deepseek蒸馏模型

创建一个运行环境

GPU安装vllm

CPU安装vllm

下载大模型

评论区