fix with vllm requirements.txt

2024-06-06 13:57:22 +08:00 · 2024-06-06 13:57:22 +08:00 · 8102212b9f
parent 1683d673d2
commit 8102212b9f
6 changed files with 78 additions and 72 deletions
--- a/basic_demo/README.md
+++ b/basic_demo/README.md
@ -1,6 +1,6 @@
 # Basic Demo

-Read this in [English](README_en.md)
+Read this in [English](README_en.md).

 本 demo 中，你将体验到如何使用 GLM-4-9B 开源模型进行基本的任务。

@ -11,6 +11,7 @@ Read this in [English](README_en.md)
 ### 相关推理测试数据

 **本文档的数据均在以下硬件环境测试,实际运行环境需求和运行占用的显存略有不同，请以实际运行环境为准。**
+
 测试硬件信息:

 + OS: Ubuntu 22.04
@ -26,38 +27,38 @@ Read this in [English](README_en.md)

 #### GLM-4-9B-Chat

-| 精度   | 显存占用     | Prefilling / 首响 | Decode Speed     | Remarks      |
-|------|----------|-----------------|------------------|--------------|
-| BF16 | 19047MiB | 0.1554s         | 27.8193 tokens/s | 输入长度为 1000   |
-| BF16 | 20629MiB | 0.8199s         | 31.8613 tokens/s | 输入长度为 8000   |
-| BF16 | 27779MiB | 4.3554s         | 14.4108 tokens/s | 输入长度为 32000  |
-| BF16 | 57379MiB | 38.1467s        | 3.4205  tokens/s | 输入长度为 128000 |
+| 精度   | 显存占用  | Prefilling | Decode Speed  | Remarks      |
+|------|-------|------------|---------------|--------------|
+| BF16 | 19 GB | 0.2s       | 27.8 tokens/s | 输入长度为 1000   |
+| BF16 | 21 GB | 0.8s       | 31.8 tokens/s | 输入长度为 8000   |
+| BF16 | 28 GB | 4.3s       | 14.4 tokens/s | 输入长度为 32000  |
+| BF16 | 58 GB | 38.1s      | 3.4  tokens/s | 输入长度为 128000 |

-| 精度   | 显存占用     | Prefilling / 首响 | Decode Speed     | Remarks     |
-|------|----------|-----------------|------------------|-------------|
-| Int4 | 8251MiB  | 0.1667s         | 23.3903 tokens/s | 输入长度为 1000  |
-| Int4 | 9613MiB  | 0.8629s         | 23.4248 tokens/s | 输入长度为 8000  |
-| Int4 | 16065MiB | 4.3906s         | 14.6553 tokens/s | 输入长度为 32000 |
+| 精度   | 显存占用  | Prefilling | Decode Speed  | Remarks     |
+|------|-------|------------|---------------|-------------|
+| INT4 | 8 GB  | 0.2s       | 23.3 tokens/s | 输入长度为 1000  |
+| INT4 | 10 GB | 0.8s       | 23.4 tokens/s | 输入长度为 8000  |
+| INT4 | 17 GB | 4.3s       | 14.6 tokens/s | 输入长度为 32000 |

 ### GLM-4-9B-Chat-1M

-| 精度   | 显存占用     | Prefilling / 首响 | Decode Speed     | Remarks      |
-|------|----------|-----------------|------------------|--------------|
-| BF16 | 74497MiB | 98.4930s        | 2.3653  tokens/s | 输入长度为 200000 |
+| 精度   | 显存占用  | Prefilling | Decode Speed | Remarks      |
+|------|-------|------------|--------------|--------------|
+| BF16 | 75 GB | 98.4s      | 2.3 tokens/s | 输入长度为 200000 |

 如果您的输入超过200K，我们建议您使用vLLM后端进行多卡推理，以获得更好的性能。

 #### GLM-4V-9B

-| 精度   | 显存占用     | Prefilling / 首响 | Decode Speed     | Remarks    |
-|------|----------|-----------------|------------------|------------|
-| BF16 | 28131MiB | 0.1016s         | 33.4660 tokens/s | 输入长度为 1000 |
-| BF16 | 33043MiB | 0.7935a         | 39.2444 tokens/s | 输入长度为 8000 |
+| 精度   | 显存占用  | Prefilling | Decode Speed  | Remarks    |
+|------|-------|------------|---------------|------------|
+| BF16 | 28 GB | 0.1s       | 33.4 tokens/s | 输入长度为 1000 |
+| BF16 | 33 GB | 0.7s       | 39.2 tokens/s | 输入长度为 8000 |

-| 精度   | 显存占用     | Prefilling / 首响 | Decode Speed     | Remarks    |
-|------|----------|-----------------|------------------|------------|
-| Int4 | 10267MiB | 0.1685a         | 28.7101 tokens/s | 输入长度为 1000 |
-| Int4 | 14105MiB | 0.8629s         | 24.2370 tokens/s | 输入长度为 8000 |
+| 精度   | 显存占用  | Prefilling | Decode Speed  | Remarks    |
+|------|-------|------------|---------------|------------|
+| INT4 | 10 GB | 0.1s       | 28.7 tokens/s | 输入长度为 1000 |
+| INT4 | 15 GB | 0.8s       | 24.2 tokens/s | 输入长度为 8000 |

 ### 最低硬件要求

@ -69,7 +70,7 @@ Read this in [English](README_en.md)
 如果您希望运行官方提供的本文件夹的所有代码，您还需要：

 + Linux 操作系统 (Debian 系列最佳)
-+ 大于 8GB 显存的，支持 CUDA 或者 ROCM 并且支持 `BF16` 推理的 GPU 设备 (A100以上GPU，V100，20以及更老的GPU架构不受支持)
+ 大于 8GB 显存的，支持 CUDA 或者 ROCM 并且支持 `BF16` 推理的 GPU 设备。(`FP16` 精度无法训练，推理有小概率出现问题)

 安装依赖

@ -85,7 +86,6 @@ pip install -r requirements.txt

 + 使用命令行与 GLM-4-9B 模型进行对话。

-
 ```shell
 python trans_cli_demo.py # GLM-4-9B-Chat
 python trans_cli_vision_demo.py # GLM-4V-9B
--- a/basic_demo/README_en.md
+++ b/basic_demo/README_en.md
@ -9,8 +9,9 @@ Please follow the steps in the document strictly to avoid unnecessary errors.
 ### Related inference test data

 **The data in this document are tested in the following hardware environment. The actual operating environment
-requirements and the video memory occupied by the operation are slightly different. Please refer to the actual operating
+requirements and the GPU memory occupied by the operation are slightly different. Please refer to the actual operating
 environment.**
+
 Test hardware information:

 + OS: Ubuntu 22.04
@ -22,44 +23,45 @@ Test hardware information:

 The stress test data of relevant inference are as follows:

-**All tests are performed on a single GPU, and all video memory consumption is calculated based on the peak value**
+**All tests are performed on a single GPU, and all GPU memory consumption is calculated based on the peak value**

 #

 ### GLM-4-9B-Chat

 | Dtype | GPU Memory | Prefilling | Decode Speed  | Remarks                |
-|-------|------------|------------|------------------|------------------------|
-| BF16  | 19047MiB   | 0.1554s    | 27.8193 tokens/s | Input length is 1000   |
-| BF16  | 20629MiB   | 0.8199s    | 31.8613 tokens/s | Input length is 8000   |
-| BF16  | 27779MiB   | 4.3554s    | 14.4108 tokens/s | Input length is 32000  |
-| BF16  | 57379MiB   | 38.1467s   | 3.4205  tokens/s | Input length is 128000 |
+|-------|------------|------------|---------------|------------------------|
+| BF16  | 19 GB      | 0.2s       | 27.8 tokens/s | Input length is 1000   |
+| BF16  | 21 GB      | 0.8s       | 31.8 tokens/s | Input length is 8000   |
+| BF16  | 28 GB      | 4.3s       | 14.4 tokens/s | Input length is 32000  |
+| BF16  | 58 GB      | 38.1s      | 3.4  tokens/s | Input length is 128000 |

 | Dtype | GPU Memory | Prefilling | Decode Speed  | Remarks               |
-|-------|------------|------------|------------------|-----------------------|
-| Int4  | 8251MiB    | 0.1667s    | 23.3903 tokens/s | Input length is 1000  |
-| Int4  | 9613MiB    | 0.8629s    | 23.4248 tokens/s | Input length is 8000  |
-| Int4  | 16065MiB   | 4.3906s    | 14.6553 tokens/s | Input length is 32000 |
+|-------|------------|------------|---------------|-----------------------|
+| INT4  | 8 GB       | 0.2s       | 23.3 tokens/s | Input length is 1000  |
+| INT4  | 10 GB      | 0.8s       | 23.4 tokens/s | Input length is 8000  |
+| INT4  | 17 GB      | 4.3s       | 14.6 tokens/s | Input length is 32000 |

 ### GLM-4-9B-Chat-1M

 | Dtype | GPU Memory | Prefilling | Decode Speed     | Remarks                |
-|-------|------------|------------|------------------|--------------|
-| BF16  | 74497MiB   | 98.4930s   | 2.3653  tokens/s | 输入长度为 200000 |
+|-------|------------|------------|------------------|------------------------|
+| BF16  | 74497MiB   | 98.4s      | 2.3653  tokens/s | Input length is 200000 |

-If your input exceeds 200K, we recommend that you use the vLLM backend with multi gpus for inference to get better performance.
+If your input exceeds 200K, we recommend that you use the vLLM backend with multi gpus for inference to get better
+performance.

 #### GLM-4V-9B

 | Dtype | GPU Memory | Prefilling | Decode Speed  | Remarks              |
-|-------|------------|------------|------------------|----------------------|
-| BF16  | 28131MiB   | 0.1016s    | 33.4660 tokens/s | Input length is 1000 |
-| BF16  | 33043MiB   | 0.7935a    | 39.2444 tokens/s | Input length is 8000 |
+|-------|------------|------------|---------------|----------------------|
+| BF16  | 28 GB      | 0.1s       | 33.4 tokens/s | Input length is 1000 |
+| BF16  | 33 GB      | 0.7s       | 39.2 tokens/s | Input length is 8000 |

 | Dtype | GPU Memory | Prefilling | Decode Speed  | Remarks              |
-|-------|------------|------------|------------------|----------------------|
-| Int4  | 10267MiB   | 0.1685a    | 28.7101 tokens/s | Input length is 1000 |
-| Int4  | 14105MiB   | 0.8629s    | 24.2370 tokens/s | Input length is 8000 |
+|-------|------------|------------|---------------|----------------------|
+| INT4  | 10 GB      | 0.1s       | 28.7 tokens/s | Input length is 1000 |
+| INT4  | 15 GB      | 0.8s       | 24.2 tokens/s | Input length is 8000 |

 ### Minimum hardware requirements

@ -71,8 +73,8 @@ If you want to run the most basic code provided by the official (transformers ba
 If you want to run all the codes in this folder provided by the official, you also need:

 + Linux operating system (Debian series is best)
-+ GPU device with more than 8GB video memory, supporting CUDA or ROCM and supporting `BF16` reasoning (GPUs above A100,
-  V100, 20 and older GPU architectures are not supported)
+ GPU device with more than 8GB GPU memory, supporting CUDA or ROCM and supporting `BF16` reasoning (`FP16` precision
+  cannot be finetuned, and there is a small probability of problems in infering)

 Install dependencies

--- a/basic_demo/requirements.txt
+++ b/basic_demo/requirements.txt
@ -1,3 +1,6 @@
+# use vllm
+# vllm>=0.4.3
+
 torch>=2.3.0
 torchvision>=0.18.0
 transformers==4.40.0
@ -8,16 +11,14 @@ timm>=0.9.16
 tiktoken>=0.7.0
 accelerate>=0.30.1
 sentence_transformers>=2.7.0
-vllm>=0.4.3

 # web demo
-gradio>=4.31.5
+gradio>=4.33.0

 # openai demo
-openai>=1.30.3
+openai>=1.31.1
 einops>=0.7.0
 sse-starlette>=2.1.0

-# Int4
-
+# INT4
 bitsandbytes>=0.43.1
--- a/composite_demo/README_en.md
+++ b/composite_demo/README_en.md
@ -141,7 +141,7 @@ Users can upload documents and use the long text capability of GLM-4-9B to under
 pdf and other files.

 + Tool calls and system prompt words are not supported in this mode.
-+ If the text is very long, the model may require a high amount of video memory. Please confirm your hardware
+ If the text is very long, the model may require a high amount of GPU memory. Please confirm your hardware
  configuration.

 ## Image Understanding Mode
--- a/composite_demo/requirements.txt
+++ b/composite_demo/requirements.txt
@ -1,22 +1,25 @@
-accelerate
+# use vllm
+# vllm>=0.4.3
+
+accelerate>=0.30.1
 huggingface_hub>=0.19.4
 ipykernel>=6.26.0
 ipython>=8.18.1
 jupyter_client>=8.6.0
-langchain
-langchain-community
-matplotlib
+
+langchain>=0.2.1
+langchain-community>=0.2.1
+
+matplotlib>=3.9.0
 pillow>=10.1.0
-pymupdf
-python-docx
-python-pptx
+pymupdf>=1.24.5
+python-docx>=1.1.2
+python-pptx>=0.6.23
 pyyaml>=6.0.1
 requests>=2.31.0
 sentencepiece
 streamlit>=1.35.0
-tiktoken
+tiktoken>=0.7.0
 transformers==4.40.0
 zhipuai>=2.1.0

-# Please install vllm if you'd like to use long context model.
-# vllm
--- a/finetune_demo/README_en.md
+++ b/finetune_demo/README_en.md
@ -6,7 +6,7 @@ not supported). Please strictly follow the steps in the document to avoid unnece
 ## Hardware check

 **The data in this document are tested in the following hardware environment. The actual operating environment
-requirements and the video memory occupied by the operation are slightly different. Please refer to the actual operating
+requirements and the GPU memory occupied by the operation are slightly different. Please refer to the actual operating
 environment.**
 Test hardware information:

@ -17,7 +17,7 @@ Test hardware information:
 + GPU Driver: 535.104.05
 + GPU: NVIDIA A100-SXM4-80GB * 8

-| Fine-tuning solution | Video memory usage                           | Weight save point size |
+| Fine-tuning solution | GPU memory usage                             | Weight save point size |
 |----------------------|----------------------------------------------|------------------------|
 | lora (PEFT)          | 21531MiB                                     | 17M                    |
 | p-tuning v2 (PEFT)   | 21381MiB                                     | 121M                   |