diff --git a/basic_demo/README.md b/basic_demo/README.md index 510d04a..0e404b2 100644 --- a/basic_demo/README.md +++ b/basic_demo/README.md @@ -1,6 +1,6 @@ # Basic Demo -Read this in [English](README_en.md) +Read this in [English](README_en.md). 本 demo 中,你将体验到如何使用 GLM-4-9B 开源模型进行基本的任务。 @@ -11,6 +11,7 @@ Read this in [English](README_en.md) ### 相关推理测试数据 **本文档的数据均在以下硬件环境测试,实际运行环境需求和运行占用的显存略有不同,请以实际运行环境为准。** + 测试硬件信息: + OS: Ubuntu 22.04 @@ -26,38 +27,38 @@ Read this in [English](README_en.md) #### GLM-4-9B-Chat -| 精度 | 显存占用 | Prefilling / 首响 | Decode Speed | Remarks | -|------|----------|-----------------|------------------|--------------| -| BF16 | 19047MiB | 0.1554s | 27.8193 tokens/s | 输入长度为 1000 | -| BF16 | 20629MiB | 0.8199s | 31.8613 tokens/s | 输入长度为 8000 | -| BF16 | 27779MiB | 4.3554s | 14.4108 tokens/s | 输入长度为 32000 | -| BF16 | 57379MiB | 38.1467s | 3.4205 tokens/s | 输入长度为 128000 | +| 精度 | 显存占用 | Prefilling | Decode Speed | Remarks | +|------|-------|------------|---------------|--------------| +| BF16 | 19 GB | 0.2s | 27.8 tokens/s | 输入长度为 1000 | +| BF16 | 21 GB | 0.8s | 31.8 tokens/s | 输入长度为 8000 | +| BF16 | 28 GB | 4.3s | 14.4 tokens/s | 输入长度为 32000 | +| BF16 | 58 GB | 38.1s | 3.4 tokens/s | 输入长度为 128000 | -| 精度 | 显存占用 | Prefilling / 首响 | Decode Speed | Remarks | -|------|----------|-----------------|------------------|-------------| -| Int4 | 8251MiB | 0.1667s | 23.3903 tokens/s | 输入长度为 1000 | -| Int4 | 9613MiB | 0.8629s | 23.4248 tokens/s | 输入长度为 8000 | -| Int4 | 16065MiB | 4.3906s | 14.6553 tokens/s | 输入长度为 32000 | +| 精度 | 显存占用 | Prefilling | Decode Speed | Remarks | +|------|-------|------------|---------------|-------------| +| INT4 | 8 GB | 0.2s | 23.3 tokens/s | 输入长度为 1000 | +| INT4 | 10 GB | 0.8s | 23.4 tokens/s | 输入长度为 8000 | +| INT4 | 17 GB | 4.3s | 14.6 tokens/s | 输入长度为 32000 | ### GLM-4-9B-Chat-1M -| 精度 | 显存占用 | Prefilling / 首响 | Decode Speed | Remarks | -|------|----------|-----------------|------------------|--------------| -| BF16 | 74497MiB | 98.4930s | 2.3653 tokens/s | 输入长度为 200000 | +| 精度 | 显存占用 | Prefilling | Decode Speed | Remarks | +|------|-------|------------|--------------|--------------| +| BF16 | 75 GB | 98.4s | 2.3 tokens/s | 输入长度为 200000 | 如果您的输入超过200K,我们建议您使用vLLM后端进行多卡推理,以获得更好的性能。 #### GLM-4V-9B -| 精度 | 显存占用 | Prefilling / 首响 | Decode Speed | Remarks | -|------|----------|-----------------|------------------|------------| -| BF16 | 28131MiB | 0.1016s | 33.4660 tokens/s | 输入长度为 1000 | -| BF16 | 33043MiB | 0.7935a | 39.2444 tokens/s | 输入长度为 8000 | +| 精度 | 显存占用 | Prefilling | Decode Speed | Remarks | +|------|-------|------------|---------------|------------| +| BF16 | 28 GB | 0.1s | 33.4 tokens/s | 输入长度为 1000 | +| BF16 | 33 GB | 0.7s | 39.2 tokens/s | 输入长度为 8000 | -| 精度 | 显存占用 | Prefilling / 首响 | Decode Speed | Remarks | -|------|----------|-----------------|------------------|------------| -| Int4 | 10267MiB | 0.1685a | 28.7101 tokens/s | 输入长度为 1000 | -| Int4 | 14105MiB | 0.8629s | 24.2370 tokens/s | 输入长度为 8000 | +| 精度 | 显存占用 | Prefilling | Decode Speed | Remarks | +|------|-------|------------|---------------|------------| +| INT4 | 10 GB | 0.1s | 28.7 tokens/s | 输入长度为 1000 | +| INT4 | 15 GB | 0.8s | 24.2 tokens/s | 输入长度为 8000 | ### 最低硬件要求 @@ -69,7 +70,7 @@ Read this in [English](README_en.md) 如果您希望运行官方提供的本文件夹的所有代码,您还需要: + Linux 操作系统 (Debian 系列最佳) -+ 大于 8GB 显存的,支持 CUDA 或者 ROCM 并且支持 `BF16` 推理的 GPU 设备 (A100以上GPU,V100,20以及更老的GPU架构不受支持) ++ 大于 8GB 显存的,支持 CUDA 或者 ROCM 并且支持 `BF16` 推理的 GPU 设备。(`FP16` 精度无法训练,推理有小概率出现问题) 安装依赖 @@ -85,7 +86,6 @@ pip install -r requirements.txt + 使用命令行与 GLM-4-9B 模型进行对话。 - ```shell python trans_cli_demo.py # GLM-4-9B-Chat python trans_cli_vision_demo.py # GLM-4V-9B diff --git a/basic_demo/README_en.md b/basic_demo/README_en.md index 8ec5c48..6f4f74f 100644 --- a/basic_demo/README_en.md +++ b/basic_demo/README_en.md @@ -9,8 +9,9 @@ Please follow the steps in the document strictly to avoid unnecessary errors. ### Related inference test data **The data in this document are tested in the following hardware environment. The actual operating environment -requirements and the video memory occupied by the operation are slightly different. Please refer to the actual operating -environment. ** +requirements and the GPU memory occupied by the operation are slightly different. Please refer to the actual operating +environment.** + Test hardware information: + OS: Ubuntu 22.04 @@ -22,44 +23,45 @@ Test hardware information: The stress test data of relevant inference are as follows: -**All tests are performed on a single GPU, and all video memory consumption is calculated based on the peak value** +**All tests are performed on a single GPU, and all GPU memory consumption is calculated based on the peak value** # ### GLM-4-9B-Chat -| Dtype | GPU Memory | Prefilling | Decode Speed | Remarks | -|-------|------------|------------|------------------|------------------------| -| BF16 | 19047MiB | 0.1554s | 27.8193 tokens/s | Input length is 1000 | -| BF16 | 20629MiB | 0.8199s | 31.8613 tokens/s | Input length is 8000 | -| BF16 | 27779MiB | 4.3554s | 14.4108 tokens/s | Input length is 32000 | -| BF16 | 57379MiB | 38.1467s | 3.4205 tokens/s | Input length is 128000 | +| Dtype | GPU Memory | Prefilling | Decode Speed | Remarks | +|-------|------------|------------|---------------|------------------------| +| BF16 | 19 GB | 0.2s | 27.8 tokens/s | Input length is 1000 | +| BF16 | 21 GB | 0.8s | 31.8 tokens/s | Input length is 8000 | +| BF16 | 28 GB | 4.3s | 14.4 tokens/s | Input length is 32000 | +| BF16 | 58 GB | 38.1s | 3.4 tokens/s | Input length is 128000 | -| Dtype | GPU Memory | Prefilling | Decode Speed | Remarks | -|-------|------------|------------|------------------|-----------------------| -| Int4 | 8251MiB | 0.1667s | 23.3903 tokens/s | Input length is 1000 | -| Int4 | 9613MiB | 0.8629s | 23.4248 tokens/s | Input length is 8000 | -| Int4 | 16065MiB | 4.3906s | 14.6553 tokens/s | Input length is 32000 | +| Dtype | GPU Memory | Prefilling | Decode Speed | Remarks | +|-------|------------|------------|---------------|-----------------------| +| INT4 | 8 GB | 0.2s | 23.3 tokens/s | Input length is 1000 | +| INT4 | 10 GB | 0.8s | 23.4 tokens/s | Input length is 8000 | +| INT4 | 17 GB | 4.3s | 14.6 tokens/s | Input length is 32000 | ### GLM-4-9B-Chat-1M -| Dtype | GPU Memory | Prefilling | Decode Speed | Remarks | -|-------|------------|------------|------------------|--------------| -| BF16 | 74497MiB | 98.4930s | 2.3653 tokens/s | 输入长度为 200000 | +| Dtype | GPU Memory | Prefilling | Decode Speed | Remarks | +|-------|------------|------------|------------------|------------------------| +| BF16 | 74497MiB | 98.4s | 2.3653 tokens/s | Input length is 200000 | -If your input exceeds 200K, we recommend that you use the vLLM backend with multi gpus for inference to get better performance. +If your input exceeds 200K, we recommend that you use the vLLM backend with multi gpus for inference to get better +performance. #### GLM-4V-9B -| Dtype | GPU Memory | Prefilling | Decode Speed | Remarks | -|-------|------------|------------|------------------|----------------------| -| BF16 | 28131MiB | 0.1016s | 33.4660 tokens/s | Input length is 1000 | -| BF16 | 33043MiB | 0.7935a | 39.2444 tokens/s | Input length is 8000 | +| Dtype | GPU Memory | Prefilling | Decode Speed | Remarks | +|-------|------------|------------|---------------|----------------------| +| BF16 | 28 GB | 0.1s | 33.4 tokens/s | Input length is 1000 | +| BF16 | 33 GB | 0.7s | 39.2 tokens/s | Input length is 8000 | -| Dtype | GPU Memory | Prefilling | Decode Speed | Remarks | -|-------|------------|------------|------------------|----------------------| -| Int4 | 10267MiB | 0.1685a | 28.7101 tokens/s | Input length is 1000 | -| Int4 | 14105MiB | 0.8629s | 24.2370 tokens/s | Input length is 8000 | +| Dtype | GPU Memory | Prefilling | Decode Speed | Remarks | +|-------|------------|------------|---------------|----------------------| +| INT4 | 10 GB | 0.1s | 28.7 tokens/s | Input length is 1000 | +| INT4 | 15 GB | 0.8s | 24.2 tokens/s | Input length is 8000 | ### Minimum hardware requirements @@ -71,8 +73,8 @@ If you want to run the most basic code provided by the official (transformers ba If you want to run all the codes in this folder provided by the official, you also need: + Linux operating system (Debian series is best) -+ GPU device with more than 8GB video memory, supporting CUDA or ROCM and supporting `BF16` reasoning (GPUs above A100, - V100, 20 and older GPU architectures are not supported) ++ GPU device with more than 8GB GPU memory, supporting CUDA or ROCM and supporting `BF16` reasoning (`FP16` precision + cannot be finetuned, and there is a small probability of problems in infering) Install dependencies diff --git a/basic_demo/requirements.txt b/basic_demo/requirements.txt index 95dccd6..0f2ce2e 100644 --- a/basic_demo/requirements.txt +++ b/basic_demo/requirements.txt @@ -1,3 +1,6 @@ +# use vllm +# vllm>=0.4.3 + torch>=2.3.0 torchvision>=0.18.0 transformers==4.40.0 @@ -8,16 +11,14 @@ timm>=0.9.16 tiktoken>=0.7.0 accelerate>=0.30.1 sentence_transformers>=2.7.0 -vllm>=0.4.3 # web demo -gradio>=4.31.5 +gradio>=4.33.0 # openai demo -openai>=1.30.3 +openai>=1.31.1 einops>=0.7.0 sse-starlette>=2.1.0 -# Int4 - +# INT4 bitsandbytes>=0.43.1 \ No newline at end of file diff --git a/composite_demo/README_en.md b/composite_demo/README_en.md index 5b386ed..c586a20 100644 --- a/composite_demo/README_en.md +++ b/composite_demo/README_en.md @@ -141,7 +141,7 @@ Users can upload documents and use the long text capability of GLM-4-9B to under pdf and other files. + Tool calls and system prompt words are not supported in this mode. -+ If the text is very long, the model may require a high amount of video memory. Please confirm your hardware ++ If the text is very long, the model may require a high amount of GPU memory. Please confirm your hardware configuration. ## Image Understanding Mode diff --git a/composite_demo/requirements.txt b/composite_demo/requirements.txt index df67213..0cf6992 100644 --- a/composite_demo/requirements.txt +++ b/composite_demo/requirements.txt @@ -1,22 +1,25 @@ -accelerate +# use vllm +# vllm>=0.4.3 + +accelerate>=0.30.1 huggingface_hub>=0.19.4 ipykernel>=6.26.0 ipython>=8.18.1 jupyter_client>=8.6.0 -langchain -langchain-community -matplotlib + +langchain>=0.2.1 +langchain-community>=0.2.1 + +matplotlib>=3.9.0 pillow>=10.1.0 -pymupdf -python-docx -python-pptx +pymupdf>=1.24.5 +python-docx>=1.1.2 +python-pptx>=0.6.23 pyyaml>=6.0.1 requests>=2.31.0 sentencepiece streamlit>=1.35.0 -tiktoken +tiktoken>=0.7.0 transformers==4.40.0 zhipuai>=2.1.0 -# Please install vllm if you'd like to use long context model. -# vllm diff --git a/finetune_demo/README_en.md b/finetune_demo/README_en.md index d406c18..b1dd9c4 100644 --- a/finetune_demo/README_en.md +++ b/finetune_demo/README_en.md @@ -6,8 +6,8 @@ not supported). Please strictly follow the steps in the document to avoid unnece ## Hardware check **The data in this document are tested in the following hardware environment. The actual operating environment -requirements and the video memory occupied by the operation are slightly different. Please refer to the actual operating -environment. ** +requirements and the GPU memory occupied by the operation are slightly different. Please refer to the actual operating +environment.** Test hardware information: + OS: Ubuntu 22.04 @@ -17,7 +17,7 @@ Test hardware information: + GPU Driver: 535.104.05 + GPU: NVIDIA A100-SXM4-80GB * 8 -| Fine-tuning solution | Video memory usage | Weight save point size | +| Fine-tuning solution | GPU memory usage | Weight save point size | |----------------------|----------------------------------------------|------------------------| | lora (PEFT) | 21531MiB | 17M | | p-tuning v2 (PEFT) | 21381MiB | 121M |