intel openvino support

2024-06-28 16:09:51 +08:00 · 2024-06-28 16:09:51 +08:00 · d8828b19fd
parent 6ae0f088ac
commit d8828b19fd
7 changed files with 342 additions and 3 deletions
--- a/README.md
+++ b/README.md
@ -11,7 +11,10 @@ Read this in [English](README_en.md)
 ## 项目更新
- 🔥🔥 **News**: ``2024/6/24``: 我们更新了模型仓库的运行文件和配置文件，支持 Flash Attention 2, 
+- 🔥 **News**: ``2024/6/28``: We have worked with the Intel technical team to improve the ITREX and OpenVINO deployment 
 tutorials for GLM-4-9B-Chat. You can use Intel CPU/GPU devices to efficiently deploy the GLM-4-9B open source model. 
 Welcome to [view](intel_device_demo).
 - 🔥 **News**: ``2024/6/24``: 我们更新了模型仓库的运行文件和配置文件，支持 Flash Attention 2, 
 请更新模型配置文件并参考 `basic_demo/trans_cli_demo.py` 中的示例代码。
 - 🔥 **News**: ``2024/6/19``: 我们更新了模型仓库的运行文件和配置文件，修复了部分已知的模型推理的问题，欢迎大家克隆最新的模型仓库。
 - 🔥 **News**: ``2024/6/18``: 我们发布 [技术报告](https://arxiv.org/pdf/2406.12793), 欢迎查看。
--- a/README_en.md
+++ b/README_en.md
@ -5,11 +5,11 @@
 </p>
 <p align="center">
 📍Experience and use a larger-scale GLM business model on the <a href="https://open.bigmodel.cn/?utm_campaign=open&_channel_track_key=OWTVNma9">Zhipu AI Open Platform</a>
 </p>
 ## Update
- 🔥🔥 **News**: ``2024/6/24``: We have updated the running files and configuration files of the model repository to support Flash Attention 2,
+- 🔥 **News**: ``2024/6/28``: We have updated the running files and configuration files of the model repository to support Flash Attention 2,
 - 🔥 **News**: ``2024/6/24``: We have updated the running files and configuration files of the model repository to support Flash Attention 2,
 Please update the model configuration file and refer to the sample code in `basic_demo/trans_cli_demo.py`.
 - 🔥🔥 **News**: ``2024/6/19``: We updated the running files and configuration files of the model repository and fixed some model inference issues. Welcome to clone the latest model repository.
 - 🔥 **News**: ``2024/6/18``: We released a [technical report](https://arxiv.org/pdf/2406.12793), welcome to check it out.
--- a/intel_device_demo/openvino/README.md
+++ b/intel_device_demo/openvino/README.md
@ -0,0 +1,70 @@
 # 使用 OpenVINO 部署 GLM-4-9B-Chat 模型
 Read this in [English](README_en.md).
 [OpenVINO](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html) 
 是 Intel 为深度学习推理而设计的开源工具包。它可以帮助开发者优化模型，提高推理性能，减少模型的内存占用。
 本示例将展示如何使用 OpenVINO 部署 GLM-4-9B-Chat 模型。
 ## 1. 环境配置
 首先，你需要安装依赖
 ```bash
 pip install -r requirements.txt
 ```
 ## 2. 转换模型
 由于需要将Huggingface模型转换为OpenVINO IR模型，因此您需要下载模型并转换。
 ```
 python3 convert.py --model_id THUDM/glm-4-9b-chat --output {your_path}/glm-4-9b-chat-ov
 ```
 ### 可以选择的参数
 * `--model_id` - 模型所在目录的路径（绝对路径）。
 * `--output` - 转换后模型保存的地址。
 * `--precision` - 转换的精度。
 转换过程如下：
 ```
 ====Exporting IR=====
 Framework not specified. Using pt to export the model.
 Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:04<00:00,  2.14it/s]
 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
 Using framework PyTorch: 2.3.1+cu121
 Mixed-Precision assignment ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 160/160 • 0:01:45 • 0:00:00
 INFO:nncf:Statistics of the bitwidth distribution:
 ┍━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
 │   Num bits (N) │ % all parameters (layers)   │ % ratio-defining parameters (layers)   │
 ┝━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
 │              8 │ 31% (76 / 163)              │ 20% (73 / 160)                         │
 ├────────────────┼─────────────────────────────┼────────────────────────────────────────┤
 │              4 │ 69% (87 / 163)              │ 80% (87 / 160)                         │
 ┕━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙
 Applying Weight Compression ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 163/163 • 0:03:46 • 0:00:00
 Configuration saved in glm-4-9b-ov/openvino_config.json
 ====Exporting tokenizer=====
 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
 ```
 ## 3. 运行 GLM-4-9B-Chat 模型
 ```
 python3 chat.py --model_path {your_path}/glm-4-9b-chat-ov --max_sequence_length 4096 --device CPU
 ```
 ### 可以选择的参数
 * `--model_path` - OpenVINO IR 模型所在目录的路径。
 * `--max_sequence_length` - 输出标记的最大大小。
 * `--device` - 运行推理的设备。
 ### 参考代码
 本代码参考 [OpenVINO 官方示例](https://github.com/OpenVINO-dev-contest/chatglm3.openvino) 进行修改。
--- a/intel_device_demo/openvino/README_en.md
+++ b/intel_device_demo/openvino/README_en.md
@ -0,0 +1,70 @@
 # Deploy the GLM-4-9B-Chat model using OpenVINO
 [OpenVINO](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html)
 is an open source toolkit designed by Intel for deep learning inference. It can help developers optimize models, improve inference performance, and reduce model memory usage.
 This example will show how to deploy the GLM-4-9B-Chat model using OpenVINO.
 ## 1. Environment configuration
 First, you need to install the dependencies
 ```bash
 pip install -r requirements.txt
 ```
 ## 2. Convert the model
 Since the Huggingface model needs to be converted to an OpenVINO IR model, you need to download the model and convert it.
 ```
 python3 convert.py --model_id THUDM/glm-4-9b-chat --output {your_path}/glm-4-9b-chat-ov
 ```
 The conversion process is as follows:
 ```
 ====Exporting IR=====
 Framework not specified. Using pt to export the model.
 Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:04<00:00,  2.14it/s]
 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
 Using framework PyTorch: 2.3.1+cu121
 Mixed-Precision assignment ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 160/160 • 0:01:45 • 0:00:00
 INFO:nncf:Statistics of the bitwidth distribution:
 ┍━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
 │   Num bits (N) │ % all parameters (layers)   │ % ratio-defining parameters (layers)   │
 ┝━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
 │              8 │ 31% (76 / 163)              │ 20% (73 / 160)                         │
 ├────────────────┼─────────────────────────────┼────────────────────────────────────────┤
 │              4 │ 69% (87 / 163)              │ 80% (87 / 160)                         │
 ┕━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙
 Applying Weight Compression ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 163/163 • 0:03:46 • 0:00:00
 Configuration saved in glm-4-9b-ov/openvino_config.json
 ====Exporting tokenizer=====
 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
 ```
 ### Optional parameters
 * `--model_id` - Path to the directory where the model is located (absolute path).
 * `--output` - Path to where the converted model is saved.
 * `--precision` - Precision of the conversion.
 ## 3. Run the GLM-4-9B-Chat model
 ```
 python3 chat.py --model_path {your_path}glm-4-9b-chat-ov --max_sequence_length 4096 --device CPU
 ```
 ### Optional parameters
 * `--model_path` - Path to the directory where the OpenVINO IR model is located.
 * `--max_sequence_length` - Maximum size of the output token.
 * `--device` - the device to run inference on.
 ### Reference code
 This code is modified based on the [OpenVINO official example](https://github.com/OpenVINO-dev-contest/chatglm3.openvino).
--- a/intel_device_demo/openvino/convert.py
+++ b/intel_device_demo/openvino/convert.py
@ -0,0 +1,72 @@
 """
 This script is used to convert the original model to OpenVINO IR format.
 The Origin Code can check https://github.com/OpenVINO-dev-contest/chatglm3.openvino/blob/main/convert.py
 """
 from transformers import AutoTokenizer, AutoConfig
 from optimum.intel import OVWeightQuantizationConfig
 from optimum.intel.openvino import OVModelForCausalLM
 import os
 from pathlib import Path
 import argparse
 if __name__ == '__main__':
    parser = argparse.ArgumentParser(add_help=False)
    parser.add_argument('-h',
                        '--help',
                        action='help',
                        help='Show this help message and exit.')
    parser.add_argument('-m',
                        '--model_id',
                        default='THUDM/glm-4-9b-chat',
                        required=False,
                        type=str,
                        help='orignal model path')
    parser.add_argument('-p',
                        '--precision',
                        required=False,
                        default="int4",
                        type=str,
                        choices=["fp16", "int8", "int4"],
                        help='fp16, int8 or int4')
    parser.add_argument('-o',
                        '--output',
                        default='./glm-4-9b-ov',
                        required=False,
                        type=str,
                        help='Required. path to save the ir model')
    args = parser.parse_args()
    ir_model_path = Path(args.output)
    if ir_model_path.exists() == False:
        os.mkdir(ir_model_path)
    model_kwargs = {
        "trust_remote_code": True,
        "config": AutoConfig.from_pretrained(args.model_id, trust_remote_code=True),
    }
    compression_configs = {
        "sym": False,
        "group_size": 128,
        "ratio": 0.8,
    }
    print("====Exporting IR=====")
    if args.precision == "int4":
        ov_model = OVModelForCausalLM.from_pretrained(args.model_id, export=True,
                                                      compile=False, quantization_config=OVWeightQuantizationConfig(
                                                          bits=4, **compression_configs), **model_kwargs)
    elif args.precision == "int8":
        ov_model = OVModelForCausalLM.from_pretrained(args.model_id, export=True,
                                                      compile=False, load_in_8bit=True, **model_kwargs)
    else:
        ov_model = OVModelForCausalLM.from_pretrained(args.model_id, export=True,
                                                      compile=False, load_in_8bit=False, **model_kwargs)
    ov_model.save_pretrained(ir_model_path)
    print("====Exporting tokenizer=====")
    tokenizer = AutoTokenizer.from_pretrained(
        args.model_id, trust_remote_code=True)
    tokenizer.save_pretrained(ir_model_path)
--- a/intel_device_demo/openvino/openvino_cli_demo.py
+++ b/intel_device_demo/openvino/openvino_cli_demo.py
@ -0,0 +1,122 @@
 import argparse
 from typing import List, Tuple
 from threading import Thread
 import torch
 from optimum.intel.openvino import OVModelForCausalLM
 from transformers import (AutoTokenizer, AutoConfig,
                          TextIteratorStreamer, StoppingCriteriaList, StoppingCriteria)
 class StopOnTokens(StoppingCriteria):
    def __init__(self, token_ids):
        self.token_ids = token_ids
    def __call__(
            self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs
    ) -> bool:
        for stop_id in self.token_ids:
            if input_ids[0][-1] == stop_id:
                return True
        return False
 if __name__ == "__main__":
    parser = argparse.ArgumentParser(add_help=False)
    parser.add_argument('-h',
                        '--help',
                        action='help',
                        help='Show this help message and exit.')
    parser.add_argument('-m',
                        '--model_path',
                        required=True,
                        type=str,
                        help='Required. model path')
    parser.add_argument('-l',
                        '--max_sequence_length',
                        default=256,
                        required=False,
                        type=int,
                        help='Required. maximun length of output')
    parser.add_argument('-d',
                        '--device',
                        default='CPU',
                        required=False,
                        type=str,
                        help='Required. device for inference')
    args = parser.parse_args()
    model_dir = args.model_path
    ov_config = {"PERFORMANCE_HINT": "LATENCY",
                 "NUM_STREAMS": "1", "CACHE_DIR": ""}
    tokenizer = AutoTokenizer.from_pretrained(
        model_dir, trust_remote_code=True)
    print("====Compiling model====")
    ov_model = OVModelForCausalLM.from_pretrained(
        model_dir,
        device=args.device,
        ov_config=ov_config,
        config=AutoConfig.from_pretrained(model_dir, trust_remote_code=True),
        trust_remote_code=True,
    )
    streamer = TextIteratorStreamer(
        tokenizer, timeout=60.0, skip_prompt=True, skip_special_tokens=True
    )
    stop_tokens = [StopOnTokens([151329, 151336, 151338])]
    def convert_history_to_token(history: List[Tuple[str, str]]):
        messages = []
        for idx, (user_msg, model_msg) in enumerate(history):
            if idx == len(history) - 1 and not model_msg:
                messages.append({"role": "user", "content": user_msg})
                break
            if user_msg:
                messages.append({"role": "user", "content": user_msg})
            if model_msg:
                messages.append({"role": "assistant", "content": model_msg})
        model_inputs = tokenizer.apply_chat_template(messages,
                                                     add_generation_prompt=True,
                                                     tokenize=True,
                                                     return_tensors="pt")
        return model_inputs
    history = []
    print("====Starting conversation====")
    while True:
        input_text = input("用户: ")
        if input_text.lower() == 'stop':
            break
        if input_text.lower() == 'clear':
            history = []
            print("AI助手: 对话历史已清空")
            continue
        print("GLM-4-9B-OpenVINO:", end=" ")
        history = history + [[input_text, ""]]
        model_inputs = convert_history_to_token(history)
        generate_kwargs = dict(
            input_ids=model_inputs,
            max_new_tokens=args.max_sequence_length,
            temperature=0.1,
            do_sample=True,
            top_p=1.0,
            top_k=50,
            repetition_penalty=1.1,
            streamer=streamer,
            stopping_criteria=StoppingCriteriaList(stop_tokens)
        )
        t1 = Thread(target=ov_model.generate, kwargs=generate_kwargs)
        t1.start()
        partial_text = ""
        for new_text in streamer:
            new_text = new_text
            print(new_text, end="", flush=True)
            partial_text += new_text
        print("\n")
        history[-1][1] = partial_text
--- a/intel_device_demo/openvino/requirements.txt
+++ b/intel_device_demo/openvino/requirements.txt
@ -0,0 +1,2 @@
 optimum>=1.20.0
 optimum-intel @ git+https://github.com/huggingface/optimum-intel.git@c1ee8ac0864e25e22ea56b5a37a35451531da0e6
		`@ -0,0 +1,2 @@`
							`optimum>=1.20.0`
							`optimum-intel @ git+https://github.com/huggingface/optimum-intel.git@c1ee8ac0864e25e22ea56b5a37a35451531da0e6`