fix with vllm requirements.txt
This commit is contained in:
parent
1683d673d2
commit
8102212b9f
|
@ -1,6 +1,6 @@
|
||||||
# Basic Demo
|
# Basic Demo
|
||||||
|
|
||||||
Read this in [English](README_en.md)
|
Read this in [English](README_en.md).
|
||||||
|
|
||||||
本 demo 中,你将体验到如何使用 GLM-4-9B 开源模型进行基本的任务。
|
本 demo 中,你将体验到如何使用 GLM-4-9B 开源模型进行基本的任务。
|
||||||
|
|
||||||
|
@ -11,6 +11,7 @@ Read this in [English](README_en.md)
|
||||||
### 相关推理测试数据
|
### 相关推理测试数据
|
||||||
|
|
||||||
**本文档的数据均在以下硬件环境测试,实际运行环境需求和运行占用的显存略有不同,请以实际运行环境为准。**
|
**本文档的数据均在以下硬件环境测试,实际运行环境需求和运行占用的显存略有不同,请以实际运行环境为准。**
|
||||||
|
|
||||||
测试硬件信息:
|
测试硬件信息:
|
||||||
|
|
||||||
+ OS: Ubuntu 22.04
|
+ OS: Ubuntu 22.04
|
||||||
|
@ -26,38 +27,38 @@ Read this in [English](README_en.md)
|
||||||
|
|
||||||
#### GLM-4-9B-Chat
|
#### GLM-4-9B-Chat
|
||||||
|
|
||||||
| 精度 | 显存占用 | Prefilling / 首响 | Decode Speed | Remarks |
|
| 精度 | 显存占用 | Prefilling | Decode Speed | Remarks |
|
||||||
|------|----------|-----------------|------------------|--------------|
|
|------|-------|------------|---------------|--------------|
|
||||||
| BF16 | 19047MiB | 0.1554s | 27.8193 tokens/s | 输入长度为 1000 |
|
| BF16 | 19 GB | 0.2s | 27.8 tokens/s | 输入长度为 1000 |
|
||||||
| BF16 | 20629MiB | 0.8199s | 31.8613 tokens/s | 输入长度为 8000 |
|
| BF16 | 21 GB | 0.8s | 31.8 tokens/s | 输入长度为 8000 |
|
||||||
| BF16 | 27779MiB | 4.3554s | 14.4108 tokens/s | 输入长度为 32000 |
|
| BF16 | 28 GB | 4.3s | 14.4 tokens/s | 输入长度为 32000 |
|
||||||
| BF16 | 57379MiB | 38.1467s | 3.4205 tokens/s | 输入长度为 128000 |
|
| BF16 | 58 GB | 38.1s | 3.4 tokens/s | 输入长度为 128000 |
|
||||||
|
|
||||||
| 精度 | 显存占用 | Prefilling / 首响 | Decode Speed | Remarks |
|
| 精度 | 显存占用 | Prefilling | Decode Speed | Remarks |
|
||||||
|------|----------|-----------------|------------------|-------------|
|
|------|-------|------------|---------------|-------------|
|
||||||
| Int4 | 8251MiB | 0.1667s | 23.3903 tokens/s | 输入长度为 1000 |
|
| INT4 | 8 GB | 0.2s | 23.3 tokens/s | 输入长度为 1000 |
|
||||||
| Int4 | 9613MiB | 0.8629s | 23.4248 tokens/s | 输入长度为 8000 |
|
| INT4 | 10 GB | 0.8s | 23.4 tokens/s | 输入长度为 8000 |
|
||||||
| Int4 | 16065MiB | 4.3906s | 14.6553 tokens/s | 输入长度为 32000 |
|
| INT4 | 17 GB | 4.3s | 14.6 tokens/s | 输入长度为 32000 |
|
||||||
|
|
||||||
### GLM-4-9B-Chat-1M
|
### GLM-4-9B-Chat-1M
|
||||||
|
|
||||||
| 精度 | 显存占用 | Prefilling / 首响 | Decode Speed | Remarks |
|
| 精度 | 显存占用 | Prefilling | Decode Speed | Remarks |
|
||||||
|------|----------|-----------------|------------------|--------------|
|
|------|-------|------------|--------------|--------------|
|
||||||
| BF16 | 74497MiB | 98.4930s | 2.3653 tokens/s | 输入长度为 200000 |
|
| BF16 | 75 GB | 98.4s | 2.3 tokens/s | 输入长度为 200000 |
|
||||||
|
|
||||||
如果您的输入超过200K,我们建议您使用vLLM后端进行多卡推理,以获得更好的性能。
|
如果您的输入超过200K,我们建议您使用vLLM后端进行多卡推理,以获得更好的性能。
|
||||||
|
|
||||||
#### GLM-4V-9B
|
#### GLM-4V-9B
|
||||||
|
|
||||||
| 精度 | 显存占用 | Prefilling / 首响 | Decode Speed | Remarks |
|
| 精度 | 显存占用 | Prefilling | Decode Speed | Remarks |
|
||||||
|------|----------|-----------------|------------------|------------|
|
|------|-------|------------|---------------|------------|
|
||||||
| BF16 | 28131MiB | 0.1016s | 33.4660 tokens/s | 输入长度为 1000 |
|
| BF16 | 28 GB | 0.1s | 33.4 tokens/s | 输入长度为 1000 |
|
||||||
| BF16 | 33043MiB | 0.7935a | 39.2444 tokens/s | 输入长度为 8000 |
|
| BF16 | 33 GB | 0.7s | 39.2 tokens/s | 输入长度为 8000 |
|
||||||
|
|
||||||
| 精度 | 显存占用 | Prefilling / 首响 | Decode Speed | Remarks |
|
| 精度 | 显存占用 | Prefilling | Decode Speed | Remarks |
|
||||||
|------|----------|-----------------|------------------|------------|
|
|------|-------|------------|---------------|------------|
|
||||||
| Int4 | 10267MiB | 0.1685a | 28.7101 tokens/s | 输入长度为 1000 |
|
| INT4 | 10 GB | 0.1s | 28.7 tokens/s | 输入长度为 1000 |
|
||||||
| Int4 | 14105MiB | 0.8629s | 24.2370 tokens/s | 输入长度为 8000 |
|
| INT4 | 15 GB | 0.8s | 24.2 tokens/s | 输入长度为 8000 |
|
||||||
|
|
||||||
### 最低硬件要求
|
### 最低硬件要求
|
||||||
|
|
||||||
|
@ -69,7 +70,7 @@ Read this in [English](README_en.md)
|
||||||
如果您希望运行官方提供的本文件夹的所有代码,您还需要:
|
如果您希望运行官方提供的本文件夹的所有代码,您还需要:
|
||||||
|
|
||||||
+ Linux 操作系统 (Debian 系列最佳)
|
+ Linux 操作系统 (Debian 系列最佳)
|
||||||
+ 大于 8GB 显存的,支持 CUDA 或者 ROCM 并且支持 `BF16` 推理的 GPU 设备 (A100以上GPU,V100,20以及更老的GPU架构不受支持)
|
+ 大于 8GB 显存的,支持 CUDA 或者 ROCM 并且支持 `BF16` 推理的 GPU 设备。(`FP16` 精度无法训练,推理有小概率出现问题)
|
||||||
|
|
||||||
安装依赖
|
安装依赖
|
||||||
|
|
||||||
|
@ -85,7 +86,6 @@ pip install -r requirements.txt
|
||||||
|
|
||||||
+ 使用命令行与 GLM-4-9B 模型进行对话。
|
+ 使用命令行与 GLM-4-9B 模型进行对话。
|
||||||
|
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
python trans_cli_demo.py # GLM-4-9B-Chat
|
python trans_cli_demo.py # GLM-4-9B-Chat
|
||||||
python trans_cli_vision_demo.py # GLM-4V-9B
|
python trans_cli_vision_demo.py # GLM-4V-9B
|
||||||
|
|
|
@ -9,8 +9,9 @@ Please follow the steps in the document strictly to avoid unnecessary errors.
|
||||||
### Related inference test data
|
### Related inference test data
|
||||||
|
|
||||||
**The data in this document are tested in the following hardware environment. The actual operating environment
|
**The data in this document are tested in the following hardware environment. The actual operating environment
|
||||||
requirements and the video memory occupied by the operation are slightly different. Please refer to the actual operating
|
requirements and the GPU memory occupied by the operation are slightly different. Please refer to the actual operating
|
||||||
environment. **
|
environment.**
|
||||||
|
|
||||||
Test hardware information:
|
Test hardware information:
|
||||||
|
|
||||||
+ OS: Ubuntu 22.04
|
+ OS: Ubuntu 22.04
|
||||||
|
@ -22,44 +23,45 @@ Test hardware information:
|
||||||
|
|
||||||
The stress test data of relevant inference are as follows:
|
The stress test data of relevant inference are as follows:
|
||||||
|
|
||||||
**All tests are performed on a single GPU, and all video memory consumption is calculated based on the peak value**
|
**All tests are performed on a single GPU, and all GPU memory consumption is calculated based on the peak value**
|
||||||
|
|
||||||
#
|
#
|
||||||
|
|
||||||
### GLM-4-9B-Chat
|
### GLM-4-9B-Chat
|
||||||
|
|
||||||
| Dtype | GPU Memory | Prefilling | Decode Speed | Remarks |
|
| Dtype | GPU Memory | Prefilling | Decode Speed | Remarks |
|
||||||
|-------|------------|------------|------------------|------------------------|
|
|-------|------------|------------|---------------|------------------------|
|
||||||
| BF16 | 19047MiB | 0.1554s | 27.8193 tokens/s | Input length is 1000 |
|
| BF16 | 19 GB | 0.2s | 27.8 tokens/s | Input length is 1000 |
|
||||||
| BF16 | 20629MiB | 0.8199s | 31.8613 tokens/s | Input length is 8000 |
|
| BF16 | 21 GB | 0.8s | 31.8 tokens/s | Input length is 8000 |
|
||||||
| BF16 | 27779MiB | 4.3554s | 14.4108 tokens/s | Input length is 32000 |
|
| BF16 | 28 GB | 4.3s | 14.4 tokens/s | Input length is 32000 |
|
||||||
| BF16 | 57379MiB | 38.1467s | 3.4205 tokens/s | Input length is 128000 |
|
| BF16 | 58 GB | 38.1s | 3.4 tokens/s | Input length is 128000 |
|
||||||
|
|
||||||
| Dtype | GPU Memory | Prefilling | Decode Speed | Remarks |
|
| Dtype | GPU Memory | Prefilling | Decode Speed | Remarks |
|
||||||
|-------|------------|------------|------------------|-----------------------|
|
|-------|------------|------------|---------------|-----------------------|
|
||||||
| Int4 | 8251MiB | 0.1667s | 23.3903 tokens/s | Input length is 1000 |
|
| INT4 | 8 GB | 0.2s | 23.3 tokens/s | Input length is 1000 |
|
||||||
| Int4 | 9613MiB | 0.8629s | 23.4248 tokens/s | Input length is 8000 |
|
| INT4 | 10 GB | 0.8s | 23.4 tokens/s | Input length is 8000 |
|
||||||
| Int4 | 16065MiB | 4.3906s | 14.6553 tokens/s | Input length is 32000 |
|
| INT4 | 17 GB | 4.3s | 14.6 tokens/s | Input length is 32000 |
|
||||||
|
|
||||||
### GLM-4-9B-Chat-1M
|
### GLM-4-9B-Chat-1M
|
||||||
|
|
||||||
| Dtype | GPU Memory | Prefilling | Decode Speed | Remarks |
|
| Dtype | GPU Memory | Prefilling | Decode Speed | Remarks |
|
||||||
|-------|------------|------------|------------------|--------------|
|
|-------|------------|------------|------------------|------------------------|
|
||||||
| BF16 | 74497MiB | 98.4930s | 2.3653 tokens/s | 输入长度为 200000 |
|
| BF16 | 74497MiB | 98.4s | 2.3653 tokens/s | Input length is 200000 |
|
||||||
|
|
||||||
If your input exceeds 200K, we recommend that you use the vLLM backend with multi gpus for inference to get better performance.
|
If your input exceeds 200K, we recommend that you use the vLLM backend with multi gpus for inference to get better
|
||||||
|
performance.
|
||||||
|
|
||||||
#### GLM-4V-9B
|
#### GLM-4V-9B
|
||||||
|
|
||||||
| Dtype | GPU Memory | Prefilling | Decode Speed | Remarks |
|
| Dtype | GPU Memory | Prefilling | Decode Speed | Remarks |
|
||||||
|-------|------------|------------|------------------|----------------------|
|
|-------|------------|------------|---------------|----------------------|
|
||||||
| BF16 | 28131MiB | 0.1016s | 33.4660 tokens/s | Input length is 1000 |
|
| BF16 | 28 GB | 0.1s | 33.4 tokens/s | Input length is 1000 |
|
||||||
| BF16 | 33043MiB | 0.7935a | 39.2444 tokens/s | Input length is 8000 |
|
| BF16 | 33 GB | 0.7s | 39.2 tokens/s | Input length is 8000 |
|
||||||
|
|
||||||
| Dtype | GPU Memory | Prefilling | Decode Speed | Remarks |
|
| Dtype | GPU Memory | Prefilling | Decode Speed | Remarks |
|
||||||
|-------|------------|------------|------------------|----------------------|
|
|-------|------------|------------|---------------|----------------------|
|
||||||
| Int4 | 10267MiB | 0.1685a | 28.7101 tokens/s | Input length is 1000 |
|
| INT4 | 10 GB | 0.1s | 28.7 tokens/s | Input length is 1000 |
|
||||||
| Int4 | 14105MiB | 0.8629s | 24.2370 tokens/s | Input length is 8000 |
|
| INT4 | 15 GB | 0.8s | 24.2 tokens/s | Input length is 8000 |
|
||||||
|
|
||||||
### Minimum hardware requirements
|
### Minimum hardware requirements
|
||||||
|
|
||||||
|
@ -71,8 +73,8 @@ If you want to run the most basic code provided by the official (transformers ba
|
||||||
If you want to run all the codes in this folder provided by the official, you also need:
|
If you want to run all the codes in this folder provided by the official, you also need:
|
||||||
|
|
||||||
+ Linux operating system (Debian series is best)
|
+ Linux operating system (Debian series is best)
|
||||||
+ GPU device with more than 8GB video memory, supporting CUDA or ROCM and supporting `BF16` reasoning (GPUs above A100,
|
+ GPU device with more than 8GB GPU memory, supporting CUDA or ROCM and supporting `BF16` reasoning (`FP16` precision
|
||||||
V100, 20 and older GPU architectures are not supported)
|
cannot be finetuned, and there is a small probability of problems in infering)
|
||||||
|
|
||||||
Install dependencies
|
Install dependencies
|
||||||
|
|
||||||
|
|
|
@ -1,3 +1,6 @@
|
||||||
|
# use vllm
|
||||||
|
# vllm>=0.4.3
|
||||||
|
|
||||||
torch>=2.3.0
|
torch>=2.3.0
|
||||||
torchvision>=0.18.0
|
torchvision>=0.18.0
|
||||||
transformers==4.40.0
|
transformers==4.40.0
|
||||||
|
@ -8,16 +11,14 @@ timm>=0.9.16
|
||||||
tiktoken>=0.7.0
|
tiktoken>=0.7.0
|
||||||
accelerate>=0.30.1
|
accelerate>=0.30.1
|
||||||
sentence_transformers>=2.7.0
|
sentence_transformers>=2.7.0
|
||||||
vllm>=0.4.3
|
|
||||||
|
|
||||||
# web demo
|
# web demo
|
||||||
gradio>=4.31.5
|
gradio>=4.33.0
|
||||||
|
|
||||||
# openai demo
|
# openai demo
|
||||||
openai>=1.30.3
|
openai>=1.31.1
|
||||||
einops>=0.7.0
|
einops>=0.7.0
|
||||||
sse-starlette>=2.1.0
|
sse-starlette>=2.1.0
|
||||||
|
|
||||||
# Int4
|
# INT4
|
||||||
|
|
||||||
bitsandbytes>=0.43.1
|
bitsandbytes>=0.43.1
|
|
@ -141,7 +141,7 @@ Users can upload documents and use the long text capability of GLM-4-9B to under
|
||||||
pdf and other files.
|
pdf and other files.
|
||||||
|
|
||||||
+ Tool calls and system prompt words are not supported in this mode.
|
+ Tool calls and system prompt words are not supported in this mode.
|
||||||
+ If the text is very long, the model may require a high amount of video memory. Please confirm your hardware
|
+ If the text is very long, the model may require a high amount of GPU memory. Please confirm your hardware
|
||||||
configuration.
|
configuration.
|
||||||
|
|
||||||
## Image Understanding Mode
|
## Image Understanding Mode
|
||||||
|
|
|
@ -1,22 +1,25 @@
|
||||||
accelerate
|
# use vllm
|
||||||
|
# vllm>=0.4.3
|
||||||
|
|
||||||
|
accelerate>=0.30.1
|
||||||
huggingface_hub>=0.19.4
|
huggingface_hub>=0.19.4
|
||||||
ipykernel>=6.26.0
|
ipykernel>=6.26.0
|
||||||
ipython>=8.18.1
|
ipython>=8.18.1
|
||||||
jupyter_client>=8.6.0
|
jupyter_client>=8.6.0
|
||||||
langchain
|
|
||||||
langchain-community
|
langchain>=0.2.1
|
||||||
matplotlib
|
langchain-community>=0.2.1
|
||||||
|
|
||||||
|
matplotlib>=3.9.0
|
||||||
pillow>=10.1.0
|
pillow>=10.1.0
|
||||||
pymupdf
|
pymupdf>=1.24.5
|
||||||
python-docx
|
python-docx>=1.1.2
|
||||||
python-pptx
|
python-pptx>=0.6.23
|
||||||
pyyaml>=6.0.1
|
pyyaml>=6.0.1
|
||||||
requests>=2.31.0
|
requests>=2.31.0
|
||||||
sentencepiece
|
sentencepiece
|
||||||
streamlit>=1.35.0
|
streamlit>=1.35.0
|
||||||
tiktoken
|
tiktoken>=0.7.0
|
||||||
transformers==4.40.0
|
transformers==4.40.0
|
||||||
zhipuai>=2.1.0
|
zhipuai>=2.1.0
|
||||||
|
|
||||||
# Please install vllm if you'd like to use long context model.
|
|
||||||
# vllm
|
|
||||||
|
|
|
@ -6,8 +6,8 @@ not supported). Please strictly follow the steps in the document to avoid unnece
|
||||||
## Hardware check
|
## Hardware check
|
||||||
|
|
||||||
**The data in this document are tested in the following hardware environment. The actual operating environment
|
**The data in this document are tested in the following hardware environment. The actual operating environment
|
||||||
requirements and the video memory occupied by the operation are slightly different. Please refer to the actual operating
|
requirements and the GPU memory occupied by the operation are slightly different. Please refer to the actual operating
|
||||||
environment. **
|
environment.**
|
||||||
Test hardware information:
|
Test hardware information:
|
||||||
|
|
||||||
+ OS: Ubuntu 22.04
|
+ OS: Ubuntu 22.04
|
||||||
|
@ -17,7 +17,7 @@ Test hardware information:
|
||||||
+ GPU Driver: 535.104.05
|
+ GPU Driver: 535.104.05
|
||||||
+ GPU: NVIDIA A100-SXM4-80GB * 8
|
+ GPU: NVIDIA A100-SXM4-80GB * 8
|
||||||
|
|
||||||
| Fine-tuning solution | Video memory usage | Weight save point size |
|
| Fine-tuning solution | GPU memory usage | Weight save point size |
|
||||||
|----------------------|----------------------------------------------|------------------------|
|
|----------------------|----------------------------------------------|------------------------|
|
||||||
| lora (PEFT) | 21531MiB | 17M |
|
| lora (PEFT) | 21531MiB | 17M |
|
||||||
| p-tuning v2 (PEFT) | 21381MiB | 121M |
|
| p-tuning v2 (PEFT) | 21381MiB | 121M |
|
||||||
|
|
Loading…
Reference in New Issue