fix with vllm requirements.txt

This commit is contained in:
zR 2024-06-06 13:57:22 +08:00
parent 1683d673d2
commit 8102212b9f
6 changed files with 78 additions and 72 deletions

View File

@ -1,6 +1,6 @@
# Basic Demo
Read this in [English](README_en.md)
Read this in [English](README_en.md).
本 demo 中,你将体验到如何使用 GLM-4-9B 开源模型进行基本的任务。
@ -11,6 +11,7 @@ Read this in [English](README_en.md)
### 相关推理测试数据
**本文档的数据均在以下硬件环境测试,实际运行环境需求和运行占用的显存略有不同,请以实际运行环境为准。**
测试硬件信息:
+ OS: Ubuntu 22.04
@ -26,38 +27,38 @@ Read this in [English](README_en.md)
#### GLM-4-9B-Chat
| 精度 | 显存占用 | Prefilling / 首响 | Decode Speed | Remarks |
|------|----------|-----------------|------------------|--------------|
| BF16 | 19047MiB | 0.1554s | 27.8193 tokens/s | 输入长度为 1000 |
| BF16 | 20629MiB | 0.8199s | 31.8613 tokens/s | 输入长度为 8000 |
| BF16 | 27779MiB | 4.3554s | 14.4108 tokens/s | 输入长度为 32000 |
| BF16 | 57379MiB | 38.1467s | 3.4205 tokens/s | 输入长度为 128000 |
| 精度 | 显存占用 | Prefilling | Decode Speed | Remarks |
|------|-------|------------|---------------|--------------|
| BF16 | 19 GB | 0.2s | 27.8 tokens/s | 输入长度为 1000 |
| BF16 | 21 GB | 0.8s | 31.8 tokens/s | 输入长度为 8000 |
| BF16 | 28 GB | 4.3s | 14.4 tokens/s | 输入长度为 32000 |
| BF16 | 58 GB | 38.1s | 3.4 tokens/s | 输入长度为 128000 |
| 精度 | 显存占用 | Prefilling / 首响 | Decode Speed | Remarks |
|------|----------|-----------------|------------------|-------------|
| Int4 | 8251MiB | 0.1667s | 23.3903 tokens/s | 输入长度为 1000 |
| Int4 | 9613MiB | 0.8629s | 23.4248 tokens/s | 输入长度为 8000 |
| Int4 | 16065MiB | 4.3906s | 14.6553 tokens/s | 输入长度为 32000 |
| 精度 | 显存占用 | Prefilling | Decode Speed | Remarks |
|------|-------|------------|---------------|-------------|
| INT4 | 8 GB | 0.2s | 23.3 tokens/s | 输入长度为 1000 |
| INT4 | 10 GB | 0.8s | 23.4 tokens/s | 输入长度为 8000 |
| INT4 | 17 GB | 4.3s | 14.6 tokens/s | 输入长度为 32000 |
### GLM-4-9B-Chat-1M
| 精度 | 显存占用 | Prefilling / 首响 | Decode Speed | Remarks |
|------|----------|-----------------|------------------|--------------|
| BF16 | 74497MiB | 98.4930s | 2.3653 tokens/s | 输入长度为 200000 |
| 精度 | 显存占用 | Prefilling | Decode Speed | Remarks |
|------|-------|------------|--------------|--------------|
| BF16 | 75 GB | 98.4s | 2.3 tokens/s | 输入长度为 200000 |
如果您的输入超过200K我们建议您使用vLLM后端进行多卡推理以获得更好的性能。
#### GLM-4V-9B
| 精度 | 显存占用 | Prefilling / 首响 | Decode Speed | Remarks |
|------|----------|-----------------|------------------|------------|
| BF16 | 28131MiB | 0.1016s | 33.4660 tokens/s | 输入长度为 1000 |
| BF16 | 33043MiB | 0.7935a | 39.2444 tokens/s | 输入长度为 8000 |
| 精度 | 显存占用 | Prefilling | Decode Speed | Remarks |
|------|-------|------------|---------------|------------|
| BF16 | 28 GB | 0.1s | 33.4 tokens/s | 输入长度为 1000 |
| BF16 | 33 GB | 0.7s | 39.2 tokens/s | 输入长度为 8000 |
| 精度 | 显存占用 | Prefilling / 首响 | Decode Speed | Remarks |
|------|----------|-----------------|------------------|------------|
| Int4 | 10267MiB | 0.1685a | 28.7101 tokens/s | 输入长度为 1000 |
| Int4 | 14105MiB | 0.8629s | 24.2370 tokens/s | 输入长度为 8000 |
| 精度 | 显存占用 | Prefilling | Decode Speed | Remarks |
|------|-------|------------|---------------|------------|
| INT4 | 10 GB | 0.1s | 28.7 tokens/s | 输入长度为 1000 |
| INT4 | 15 GB | 0.8s | 24.2 tokens/s | 输入长度为 8000 |
### 最低硬件要求
@ -69,7 +70,7 @@ Read this in [English](README_en.md)
如果您希望运行官方提供的本文件夹的所有代码,您还需要:
+ Linux 操作系统 (Debian 系列最佳)
+ 大于 8GB 显存的,支持 CUDA 或者 ROCM 并且支持 `BF16` 推理的 GPU 设备 (A100以上GPUV10020以及更老的GPU架构不受支持)
+ 大于 8GB 显存的,支持 CUDA 或者 ROCM 并且支持 `BF16` 推理的 GPU 设备。(`FP16` 精度无法训练,推理有小概率出现问题)
安装依赖
@ -85,7 +86,6 @@ pip install -r requirements.txt
+ 使用命令行与 GLM-4-9B 模型进行对话。
```shell
python trans_cli_demo.py # GLM-4-9B-Chat
python trans_cli_vision_demo.py # GLM-4V-9B

View File

@ -9,8 +9,9 @@ Please follow the steps in the document strictly to avoid unnecessary errors.
### Related inference test data
**The data in this document are tested in the following hardware environment. The actual operating environment
requirements and the video memory occupied by the operation are slightly different. Please refer to the actual operating
requirements and the GPU memory occupied by the operation are slightly different. Please refer to the actual operating
environment.**
Test hardware information:
+ OS: Ubuntu 22.04
@ -22,44 +23,45 @@ Test hardware information:
The stress test data of relevant inference are as follows:
**All tests are performed on a single GPU, and all video memory consumption is calculated based on the peak value**
**All tests are performed on a single GPU, and all GPU memory consumption is calculated based on the peak value**
#
### GLM-4-9B-Chat
| Dtype | GPU Memory | Prefilling | Decode Speed | Remarks |
|-------|------------|------------|------------------|------------------------|
| BF16 | 19047MiB | 0.1554s | 27.8193 tokens/s | Input length is 1000 |
| BF16 | 20629MiB | 0.8199s | 31.8613 tokens/s | Input length is 8000 |
| BF16 | 27779MiB | 4.3554s | 14.4108 tokens/s | Input length is 32000 |
| BF16 | 57379MiB | 38.1467s | 3.4205 tokens/s | Input length is 128000 |
|-------|------------|------------|---------------|------------------------|
| BF16 | 19 GB | 0.2s | 27.8 tokens/s | Input length is 1000 |
| BF16 | 21 GB | 0.8s | 31.8 tokens/s | Input length is 8000 |
| BF16 | 28 GB | 4.3s | 14.4 tokens/s | Input length is 32000 |
| BF16 | 58 GB | 38.1s | 3.4 tokens/s | Input length is 128000 |
| Dtype | GPU Memory | Prefilling | Decode Speed | Remarks |
|-------|------------|------------|------------------|-----------------------|
| Int4 | 8251MiB | 0.1667s | 23.3903 tokens/s | Input length is 1000 |
| Int4 | 9613MiB | 0.8629s | 23.4248 tokens/s | Input length is 8000 |
| Int4 | 16065MiB | 4.3906s | 14.6553 tokens/s | Input length is 32000 |
|-------|------------|------------|---------------|-----------------------|
| INT4 | 8 GB | 0.2s | 23.3 tokens/s | Input length is 1000 |
| INT4 | 10 GB | 0.8s | 23.4 tokens/s | Input length is 8000 |
| INT4 | 17 GB | 4.3s | 14.6 tokens/s | Input length is 32000 |
### GLM-4-9B-Chat-1M
| Dtype | GPU Memory | Prefilling | Decode Speed | Remarks |
|-------|------------|------------|------------------|--------------|
| BF16 | 74497MiB | 98.4930s | 2.3653 tokens/s | 输入长度为 200000 |
|-------|------------|------------|------------------|------------------------|
| BF16 | 74497MiB | 98.4s | 2.3653 tokens/s | Input length is 200000 |
If your input exceeds 200K, we recommend that you use the vLLM backend with multi gpus for inference to get better performance.
If your input exceeds 200K, we recommend that you use the vLLM backend with multi gpus for inference to get better
performance.
#### GLM-4V-9B
| Dtype | GPU Memory | Prefilling | Decode Speed | Remarks |
|-------|------------|------------|------------------|----------------------|
| BF16 | 28131MiB | 0.1016s | 33.4660 tokens/s | Input length is 1000 |
| BF16 | 33043MiB | 0.7935a | 39.2444 tokens/s | Input length is 8000 |
|-------|------------|------------|---------------|----------------------|
| BF16 | 28 GB | 0.1s | 33.4 tokens/s | Input length is 1000 |
| BF16 | 33 GB | 0.7s | 39.2 tokens/s | Input length is 8000 |
| Dtype | GPU Memory | Prefilling | Decode Speed | Remarks |
|-------|------------|------------|------------------|----------------------|
| Int4 | 10267MiB | 0.1685a | 28.7101 tokens/s | Input length is 1000 |
| Int4 | 14105MiB | 0.8629s | 24.2370 tokens/s | Input length is 8000 |
|-------|------------|------------|---------------|----------------------|
| INT4 | 10 GB | 0.1s | 28.7 tokens/s | Input length is 1000 |
| INT4 | 15 GB | 0.8s | 24.2 tokens/s | Input length is 8000 |
### Minimum hardware requirements
@ -71,8 +73,8 @@ If you want to run the most basic code provided by the official (transformers ba
If you want to run all the codes in this folder provided by the official, you also need:
+ Linux operating system (Debian series is best)
+ GPU device with more than 8GB video memory, supporting CUDA or ROCM and supporting `BF16` reasoning (GPUs above A100,
V100, 20 and older GPU architectures are not supported)
+ GPU device with more than 8GB GPU memory, supporting CUDA or ROCM and supporting `BF16` reasoning (`FP16` precision
cannot be finetuned, and there is a small probability of problems in infering)
Install dependencies

View File

@ -1,3 +1,6 @@
# use vllm
# vllm>=0.4.3
torch>=2.3.0
torchvision>=0.18.0
transformers==4.40.0
@ -8,16 +11,14 @@ timm>=0.9.16
tiktoken>=0.7.0
accelerate>=0.30.1
sentence_transformers>=2.7.0
vllm>=0.4.3
# web demo
gradio>=4.31.5
gradio>=4.33.0
# openai demo
openai>=1.30.3
openai>=1.31.1
einops>=0.7.0
sse-starlette>=2.1.0
# Int4
# INT4
bitsandbytes>=0.43.1

View File

@ -141,7 +141,7 @@ Users can upload documents and use the long text capability of GLM-4-9B to under
pdf and other files.
+ Tool calls and system prompt words are not supported in this mode.
+ If the text is very long, the model may require a high amount of video memory. Please confirm your hardware
+ If the text is very long, the model may require a high amount of GPU memory. Please confirm your hardware
configuration.
## Image Understanding Mode

View File

@ -1,22 +1,25 @@
accelerate
# use vllm
# vllm>=0.4.3
accelerate>=0.30.1
huggingface_hub>=0.19.4
ipykernel>=6.26.0
ipython>=8.18.1
jupyter_client>=8.6.0
langchain
langchain-community
matplotlib
langchain>=0.2.1
langchain-community>=0.2.1
matplotlib>=3.9.0
pillow>=10.1.0
pymupdf
python-docx
python-pptx
pymupdf>=1.24.5
python-docx>=1.1.2
python-pptx>=0.6.23
pyyaml>=6.0.1
requests>=2.31.0
sentencepiece
streamlit>=1.35.0
tiktoken
tiktoken>=0.7.0
transformers==4.40.0
zhipuai>=2.1.0
# Please install vllm if you'd like to use long context model.
# vllm

View File

@ -6,7 +6,7 @@ not supported). Please strictly follow the steps in the document to avoid unnece
## Hardware check
**The data in this document are tested in the following hardware environment. The actual operating environment
requirements and the video memory occupied by the operation are slightly different. Please refer to the actual operating
requirements and the GPU memory occupied by the operation are slightly different. Please refer to the actual operating
environment.**
Test hardware information:
@ -17,7 +17,7 @@ Test hardware information:
+ GPU Driver: 535.104.05
+ GPU: NVIDIA A100-SXM4-80GB * 8
| Fine-tuning solution | Video memory usage | Weight save point size |
| Fine-tuning solution | GPU memory usage | Weight save point size |
|----------------------|----------------------------------------------|------------------------|
| lora (PEFT) | 21531MiB | 17M |
| p-tuning v2 (PEFT) | 21381MiB | 121M |