fix #232

2024-06-24 23:45:04 +08:00 · 2024-06-24 23:45:04 +08:00 · e5b5630498
parent 5722878e25
commit e5b5630498
7 changed files with 19 additions and 6 deletions
--- a/README.md
+++ b/README.md
@ -11,7 +11,9 @@ Read this in [English](README_en.md)

 ## 项目更新

- 🔥🔥 **News**: ``2024/6/19``: 我们更新了模型仓库的运行文件和配置文件，修复了部分已知的模型推理的问题，欢迎大家克隆最新的模型仓库。
+- 🔥🔥 **News**: ``2024/6/24``: 我们更新了模型仓库的运行文件和配置文件，支持 Flash Attention 2, 
+请更新模型配置文件并参考 `basic_demo/trans_cli_demo.py` 中的示例代码。
+- 🔥 **News**: ``2024/6/19``: 我们更新了模型仓库的运行文件和配置文件，修复了部分已知的模型推理的问题，欢迎大家克隆最新的模型仓库。
 - 🔥 **News**: ``2024/6/18``: 我们发布 [技术报告](https://arxiv.org/pdf/2406.12793), 欢迎查看。
 - 🔥 **News**: ``2024/6/05``: 我们发布 GLM-4-9B 系列开源模型

--- a/README_en.md
+++ b/README_en.md
@ -9,6 +9,8 @@
 </p>

 ## Update
+- 🔥🔥 **News**: ``2024/6/24``: We have updated the running files and configuration files of the model repository to support Flash Attention 2,
+Please update the model configuration file and refer to the sample code in `basic_demo/trans_cli_demo.py`.
 - 🔥🔥 **News**: ``2024/6/19``: We updated the running files and configuration files of the model repository and fixed some model inference issues. Welcome to clone the latest model repository.
 - 🔥 **News**: ``2024/6/18``: We released a [technical report](https://arxiv.org/pdf/2406.12793), welcome to check it out.
 - 🔥 **News**: ``2024/6/05``: We released the GLM-4-9B series of open source models
--- a/basic_demo/README.md
+++ b/basic_demo/README.md
@ -91,10 +91,11 @@ python trans_cli_demo.py # GLM-4-9B-Chat
 python trans_cli_vision_demo.py # GLM-4V-9B
 ```

-+ 使用 Gradio 网页端与 GLM-4-9B-Chat 模型进行对话。
+ 使用 Gradio 网页端与 GLM-4-9B 模型进行对话。

 ```shell
-python trans_web_demo.py
+python trans_web_demo.py  # GLM-4-9B-Chat
+python trans_web_vision_demo.py # GLM-4V-9B
 ```

 + 使用 Batch 推理。
--- a/basic_demo/README_en.md
+++ b/basic_demo/README_en.md
@ -96,10 +96,11 @@ python trans_cli_demo.py # GLM-4-9B-Chat
 python trans_cli_vision_demo.py # GLM-4V-9B
 ```

-+ Use the Gradio web client to communicate with the GLM-4-9B-Chat model.
+ Use the Gradio web client to communicate with the  GLM-4-9B model.

 ```shell
-python trans_web_demo.py
+python trans_web_demo.py  # GLM-4-9B-Chat
+python trans_web_vision_demo.py # GLM-4V-9B
 ```

 + Use Batch inference.
--- a/basic_demo/trans_cli_demo.py
+++ b/basic_demo/trans_cli_demo.py
@ -8,6 +8,8 @@ Usage:

 Note: The script includes a modification to handle markdown to plain text conversion,
 ensuring that the CLI interface displays formatted text correctly.
+
+If you use flash attention, you should install the flash-attn and  add attn_implementation="flash_attention_2" in model loading.
 """

 import os
@ -40,9 +42,12 @@ tokenizer = AutoTokenizer.from_pretrained(
    trust_remote_code=True,
    encode_special_tokens=True
 )
+
 model = AutoModel.from_pretrained(
    MODEL_PATH,
    trust_remote_code=True,
+    # attn_implementation="flash_attention_2", # Use Flash Attention
+    # torch_dtype=torch.bfloat16, #using flash-attn must use bfloat16 or float16
    device_map="auto").eval()


--- a/basic_demo/trans_cli_vision_demo.py
+++ b/basic_demo/trans_cli_vision_demo.py
@ -32,10 +32,12 @@ tokenizer = AutoTokenizer.from_pretrained(
 model = AutoModel.from_pretrained(
    MODEL_PATH,
    trust_remote_code=True,
+    # attn_implementation="flash_attention_2",  # Use Flash Attention
+    # torch_dtype=torch.bfloat16,  # using flash-attn must use bfloat16 or float16,
    device_map="auto",
-    torch_dtype=torch.bfloat16
 ).eval()

+
 ## For INT4 inference
 # model = AutoModel.from_pretrained(
 #     MODEL_PATH,
--- a/basic_demo/trans_cli_vision_gradio_demo.py
+++ b/basic_demo/trans_cli_vision_gradio_demo.py