intel openvino support

2024-06-28 16:09:51 +08:00 · 2024-06-28 16:09:51 +08:00 · d8828b19fd
parent 6ae0f088ac
commit d8828b19fd
7 changed files with 342 additions and 3 deletions
--- a/README.md
+++ b/README.md
@ -11,7 +11,10 @@ Read this in [English](README_en.md)

 ## 项目更新

- 🔥🔥 **News**: ``2024/6/24``: 我们更新了模型仓库的运行文件和配置文件，支持 Flash Attention 2, 
+- 🔥 **News**: ``2024/6/28``: We have worked with the Intel technical team to improve the ITREX and OpenVINO deployment 
+tutorials for GLM-4-9B-Chat. You can use Intel CPU/GPU devices to efficiently deploy the GLM-4-9B open source model. 
+Welcome to [view](intel_device_demo).
+- 🔥 **News**: ``2024/6/24``: 我们更新了模型仓库的运行文件和配置文件，支持 Flash Attention 2, 
 请更新模型配置文件并参考 `basic_demo/trans_cli_demo.py` 中的示例代码。
 - 🔥 **News**: ``2024/6/19``: 我们更新了模型仓库的运行文件和配置文件，修复了部分已知的模型推理的问题，欢迎大家克隆最新的模型仓库。
 - 🔥 **News**: ``2024/6/18``: 我们发布 [技术报告](https://arxiv.org/pdf/2406.12793), 欢迎查看。
--- a/README_en.md
+++ b/README_en.md
@ -5,11 +5,11 @@
 </p>
 <p align="center">
 📍Experience and use a larger-scale GLM business model on the <a href="https://open.bigmodel.cn/?utm_campaign=open&_channel_track_key=OWTVNma9">Zhipu AI Open Platform</a>
-
 </p>

 ## Update
- 🔥🔥 **News**: ``2024/6/24``: We have updated the running files and configuration files of the model repository to support Flash Attention 2,
+- 🔥 **News**: ``2024/6/28``: We have updated the running files and configuration files of the model repository to support Flash Attention 2,
+- 🔥 **News**: ``2024/6/24``: We have updated the running files and configuration files of the model repository to support Flash Attention 2,
 Please update the model configuration file and refer to the sample code in `basic_demo/trans_cli_demo.py`.
 - 🔥🔥 **News**: ``2024/6/19``: We updated the running files and configuration files of the model repository and fixed some model inference issues. Welcome to clone the latest model repository.
 - 🔥 **News**: ``2024/6/18``: We released a [technical report](https://arxiv.org/pdf/2406.12793), welcome to check it out.
--- a/intel_device_demo/openvino/README.md
+++ b/intel_device_demo/openvino/README.md
@ -0,0 +1,70 @@
+# 使用 OpenVINO 部署 GLM-4-9B-Chat 模型
+
+Read this in [English](README_en.md).
+
+[OpenVINO](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html) 
+是 Intel 为深度学习推理而设计的开源工具包。它可以帮助开发者优化模型，提高推理性能，减少模型的内存占用。
+本示例将展示如何使用 OpenVINO 部署 GLM-4-9B-Chat 模型。
+
+## 1. 环境配置
+
+首先，你需要安装依赖
+
+```bash
+pip install -r requirements.txt
+```
+
+## 2. 转换模型
+
+由于需要将Huggingface模型转换为OpenVINO IR模型，因此您需要下载模型并转换。
+
+```
+python3 convert.py --model_id THUDM/glm-4-9b-chat --output {your_path}/glm-4-9b-chat-ov
+```
+
+### 可以选择的参数
+
+* `--model_id` - 模型所在目录的路径（绝对路径）。
+* `--output` - 转换后模型保存的地址。
+* `--precision` - 转换的精度。
+
+
+转换过程如下：
+```
+====Exporting IR=====
+Framework not specified. Using pt to export the model.
+Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:04<00:00,  2.14it/s]
+Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
+Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
+Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
+Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
+Using framework PyTorch: 2.3.1+cu121
+Mixed-Precision assignment ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 160/160 • 0:01:45 • 0:00:00
+INFO:nncf:Statistics of the bitwidth distribution:
+┍━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
+│   Num bits (N) │ % all parameters (layers)   │ % ratio-defining parameters (layers)   │
+┝━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
+│              8 │ 31% (76 / 163)              │ 20% (73 / 160)                         │
+├────────────────┼─────────────────────────────┼────────────────────────────────────────┤
+│              4 │ 69% (87 / 163)              │ 80% (87 / 160)                         │
+┕━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙
+Applying Weight Compression ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 163/163 • 0:03:46 • 0:00:00
+Configuration saved in glm-4-9b-ov/openvino_config.json
+====Exporting tokenizer=====
+Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
+```
+## 3. 运行 GLM-4-9B-Chat 模型
+
+```
+python3 chat.py --model_path {your_path}/glm-4-9b-chat-ov --max_sequence_length 4096 --device CPU
+```
+
+### 可以选择的参数
+
+* `--model_path` - OpenVINO IR 模型所在目录的路径。
+* `--max_sequence_length` - 输出标记的最大大小。
+* `--device` - 运行推理的设备。
+
+### 参考代码
+
+本代码参考 [OpenVINO 官方示例](https://github.com/OpenVINO-dev-contest/chatglm3.openvino) 进行修改。
--- a/intel_device_demo/openvino/README_en.md
+++ b/intel_device_demo/openvino/README_en.md
@ -0,0 +1,70 @@
+# Deploy the GLM-4-9B-Chat model using OpenVINO
+
+[OpenVINO](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html)
+is an open source toolkit designed by Intel for deep learning inference. It can help developers optimize models, improve inference performance, and reduce model memory usage.
+This example will show how to deploy the GLM-4-9B-Chat model using OpenVINO.
+
+## 1. Environment configuration
+
+First, you need to install the dependencies
+
+```bash
+pip install -r requirements.txt
+```
+
+## 2. Convert the model
+
+Since the Huggingface model needs to be converted to an OpenVINO IR model, you need to download the model and convert it.
+
+```
+python3 convert.py --model_id THUDM/glm-4-9b-chat --output {your_path}/glm-4-9b-chat-ov
+```
+The conversion process is as follows:
+```
+====Exporting IR=====
+Framework not specified. Using pt to export the model.
+Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:04<00:00,  2.14it/s]
+Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
+Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
+Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
+Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
+Using framework PyTorch: 2.3.1+cu121
+Mixed-Precision assignment ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 160/160 • 0:01:45 • 0:00:00
+INFO:nncf:Statistics of the bitwidth distribution:
+┍━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
+│   Num bits (N) │ % all parameters (layers)   │ % ratio-defining parameters (layers)   │
+┝━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
+│              8 │ 31% (76 / 163)              │ 20% (73 / 160)                         │
+├────────────────┼─────────────────────────────┼────────────────────────────────────────┤
+│              4 │ 69% (87 / 163)              │ 80% (87 / 160)                         │
+┕━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙
+Applying Weight Compression ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 163/163 • 0:03:46 • 0:00:00
+Configuration saved in glm-4-9b-ov/openvino_config.json
+====Exporting tokenizer=====
+Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
+```
+
+### Optional parameters
+
+* `--model_id` - Path to the directory where the model is located (absolute path).
+
+* `--output` - Path to where the converted model is saved.
+
+* `--precision` - Precision of the conversion.
+
+## 3. Run the GLM-4-9B-Chat model
+
+```
+python3 chat.py --model_path {your_path}glm-4-9b-chat-ov --max_sequence_length 4096 --device CPU
+```
+
+### Optional parameters
+
+* `--model_path` - Path to the directory where the OpenVINO IR model is located.
+
+* `--max_sequence_length` - Maximum size of the output token.
+* `--device` - the device to run inference on.
+
+### Reference code
+
+This code is modified based on the [OpenVINO official example](https://github.com/OpenVINO-dev-contest/chatglm3.openvino).
--- a/intel_device_demo/openvino/convert.py
+++ b/intel_device_demo/openvino/convert.py
@ -0,0 +1,72 @@
+"""
+This script is used to convert the original model to OpenVINO IR format.
+The Origin Code can check https://github.com/OpenVINO-dev-contest/chatglm3.openvino/blob/main/convert.py
+"""
+from transformers import AutoTokenizer, AutoConfig
+from optimum.intel import OVWeightQuantizationConfig
+from optimum.intel.openvino import OVModelForCausalLM
+
+import os
+from pathlib import Path
+import argparse
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(add_help=False)
+    parser.add_argument('-h',
+                        '--help',
+                        action='help',
+                        help='Show this help message and exit.')
+    parser.add_argument('-m',
+                        '--model_id',
+                        default='THUDM/glm-4-9b-chat',
+                        required=False,
+                        type=str,
+                        help='orignal model path')
+    parser.add_argument('-p',
+                        '--precision',
+                        required=False,
+                        default="int4",
+                        type=str,
+                        choices=["fp16", "int8", "int4"],
+                        help='fp16, int8 or int4')
+    parser.add_argument('-o',
+                        '--output',
+                        default='./glm-4-9b-ov',
+                        required=False,
+                        type=str,
+                        help='Required. path to save the ir model')
+    args = parser.parse_args()
+
+    ir_model_path = Path(args.output)
+    if ir_model_path.exists() == False:
+        os.mkdir(ir_model_path)
+
+    model_kwargs = {
+        "trust_remote_code": True,
+        "config": AutoConfig.from_pretrained(args.model_id, trust_remote_code=True),
+    }
+    compression_configs = {
+        "sym": False,
+        "group_size": 128,
+        "ratio": 0.8,
+    }
+
+    print("====Exporting IR=====")
+    if args.precision == "int4":
+        ov_model = OVModelForCausalLM.from_pretrained(args.model_id, export=True,
+                                                      compile=False, quantization_config=OVWeightQuantizationConfig(
+                                                          bits=4, **compression_configs), **model_kwargs)
+    elif args.precision == "int8":
+        ov_model = OVModelForCausalLM.from_pretrained(args.model_id, export=True,
+                                                      compile=False, load_in_8bit=True, **model_kwargs)
+    else:
+        ov_model = OVModelForCausalLM.from_pretrained(args.model_id, export=True,
+                                                      compile=False, load_in_8bit=False, **model_kwargs)
+
+    ov_model.save_pretrained(ir_model_path)
+
+    print("====Exporting tokenizer=====")
+    tokenizer = AutoTokenizer.from_pretrained(
+        args.model_id, trust_remote_code=True)
+    tokenizer.save_pretrained(ir_model_path)
--- a/intel_device_demo/openvino/openvino_cli_demo.py
+++ b/intel_device_demo/openvino/openvino_cli_demo.py
@ -0,0 +1,122 @@
+import argparse
+from typing import List, Tuple
+from threading import Thread
+import torch
+from optimum.intel.openvino import OVModelForCausalLM
+from transformers import (AutoTokenizer, AutoConfig,
+                          TextIteratorStreamer, StoppingCriteriaList, StoppingCriteria)
+
+class StopOnTokens(StoppingCriteria):
+    def __init__(self, token_ids):
+        self.token_ids = token_ids
+
+    def __call__(
+            self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs
+    ) -> bool:
+        for stop_id in self.token_ids:
+            if input_ids[0][-1] == stop_id:
+                return True
+        return False
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(add_help=False)
+    parser.add_argument('-h',
+                        '--help',
+                        action='help',
+                        help='Show this help message and exit.')
+    parser.add_argument('-m',
+                        '--model_path',
+                        required=True,
+                        type=str,
+                        help='Required. model path')
+    parser.add_argument('-l',
+                        '--max_sequence_length',
+                        default=256,
+                        required=False,
+                        type=int,
+                        help='Required. maximun length of output')
+    parser.add_argument('-d',
+                        '--device',
+                        default='CPU',
+                        required=False,
+                        type=str,
+                        help='Required. device for inference')
+    args = parser.parse_args()
+    model_dir = args.model_path
+
+    ov_config = {"PERFORMANCE_HINT": "LATENCY",
+                 "NUM_STREAMS": "1", "CACHE_DIR": ""}
+
+    tokenizer = AutoTokenizer.from_pretrained(
+        model_dir, trust_remote_code=True)
+
+    print("====Compiling model====")
+    ov_model = OVModelForCausalLM.from_pretrained(
+        model_dir,
+        device=args.device,
+        ov_config=ov_config,
+        config=AutoConfig.from_pretrained(model_dir, trust_remote_code=True),
+        trust_remote_code=True,
+    )
+
+    streamer = TextIteratorStreamer(
+        tokenizer, timeout=60.0, skip_prompt=True, skip_special_tokens=True
+    )
+    stop_tokens = [StopOnTokens([151329, 151336, 151338])]
+
+    def convert_history_to_token(history: List[Tuple[str, str]]):
+
+        messages = []
+        for idx, (user_msg, model_msg) in enumerate(history):
+            if idx == len(history) - 1 and not model_msg:
+                messages.append({"role": "user", "content": user_msg})
+                break
+            if user_msg:
+                messages.append({"role": "user", "content": user_msg})
+            if model_msg:
+                messages.append({"role": "assistant", "content": model_msg})
+
+        model_inputs = tokenizer.apply_chat_template(messages,
+                                                     add_generation_prompt=True,
+                                                     tokenize=True,
+                                                     return_tensors="pt")
+        return model_inputs
+
+    history = []
+    print("====Starting conversation====")
+    while True:
+        input_text = input("用户: ")
+        if input_text.lower() == 'stop':
+            break
+
+        if input_text.lower() == 'clear':
+            history = []
+            print("AI助手: 对话历史已清空")
+            continue
+
+        print("GLM-4-9B-OpenVINO:", end=" ")
+        history = history + [[input_text, ""]]
+        model_inputs = convert_history_to_token(history)
+        generate_kwargs = dict(
+            input_ids=model_inputs,
+            max_new_tokens=args.max_sequence_length,
+            temperature=0.1,
+            do_sample=True,
+            top_p=1.0,
+            top_k=50,
+            repetition_penalty=1.1,
+            streamer=streamer,
+            stopping_criteria=StoppingCriteriaList(stop_tokens)
+        )
+
+        t1 = Thread(target=ov_model.generate, kwargs=generate_kwargs)
+        t1.start()
+
+        partial_text = ""
+        for new_text in streamer:
+            new_text = new_text
+            print(new_text, end="", flush=True)
+            partial_text += new_text
+        print("\n")
+        history[-1][1] = partial_text
--- a/intel_device_demo/openvino/requirements.txt
+++ b/intel_device_demo/openvino/requirements.txt
@ -0,0 +1,2 @@
+optimum>=1.20.0
+optimum-intel @ git+https://github.com/huggingface/optimum-intel.git@c1ee8ac0864e25e22ea56b5a37a35451531da0e6