intel openvino support
This commit is contained in:
parent
6ae0f088ac
commit
d8828b19fd
|
@ -11,7 +11,10 @@ Read this in [English](README_en.md)
|
|||
|
||||
## 项目更新
|
||||
|
||||
- 🔥🔥 **News**: ``2024/6/24``: 我们更新了模型仓库的运行文件和配置文件,支持 Flash Attention 2,
|
||||
- 🔥 **News**: ``2024/6/28``: We have worked with the Intel technical team to improve the ITREX and OpenVINO deployment
|
||||
tutorials for GLM-4-9B-Chat. You can use Intel CPU/GPU devices to efficiently deploy the GLM-4-9B open source model.
|
||||
Welcome to [view](intel_device_demo).
|
||||
- 🔥 **News**: ``2024/6/24``: 我们更新了模型仓库的运行文件和配置文件,支持 Flash Attention 2,
|
||||
请更新模型配置文件并参考 `basic_demo/trans_cli_demo.py` 中的示例代码。
|
||||
- 🔥 **News**: ``2024/6/19``: 我们更新了模型仓库的运行文件和配置文件,修复了部分已知的模型推理的问题,欢迎大家克隆最新的模型仓库。
|
||||
- 🔥 **News**: ``2024/6/18``: 我们发布 [技术报告](https://arxiv.org/pdf/2406.12793), 欢迎查看。
|
||||
|
|
|
@ -5,11 +5,11 @@
|
|||
</p>
|
||||
<p align="center">
|
||||
📍Experience and use a larger-scale GLM business model on the <a href="https://open.bigmodel.cn/?utm_campaign=open&_channel_track_key=OWTVNma9">Zhipu AI Open Platform</a>
|
||||
|
||||
</p>
|
||||
|
||||
## Update
|
||||
- 🔥🔥 **News**: ``2024/6/24``: We have updated the running files and configuration files of the model repository to support Flash Attention 2,
|
||||
- 🔥 **News**: ``2024/6/28``: We have updated the running files and configuration files of the model repository to support Flash Attention 2,
|
||||
- 🔥 **News**: ``2024/6/24``: We have updated the running files and configuration files of the model repository to support Flash Attention 2,
|
||||
Please update the model configuration file and refer to the sample code in `basic_demo/trans_cli_demo.py`.
|
||||
- 🔥🔥 **News**: ``2024/6/19``: We updated the running files and configuration files of the model repository and fixed some model inference issues. Welcome to clone the latest model repository.
|
||||
- 🔥 **News**: ``2024/6/18``: We released a [technical report](https://arxiv.org/pdf/2406.12793), welcome to check it out.
|
||||
|
|
|
@ -0,0 +1,70 @@
|
|||
# 使用 OpenVINO 部署 GLM-4-9B-Chat 模型
|
||||
|
||||
Read this in [English](README_en.md).
|
||||
|
||||
[OpenVINO](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html)
|
||||
是 Intel 为深度学习推理而设计的开源工具包。它可以帮助开发者优化模型,提高推理性能,减少模型的内存占用。
|
||||
本示例将展示如何使用 OpenVINO 部署 GLM-4-9B-Chat 模型。
|
||||
|
||||
## 1. 环境配置
|
||||
|
||||
首先,你需要安装依赖
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
## 2. 转换模型
|
||||
|
||||
由于需要将Huggingface模型转换为OpenVINO IR模型,因此您需要下载模型并转换。
|
||||
|
||||
```
|
||||
python3 convert.py --model_id THUDM/glm-4-9b-chat --output {your_path}/glm-4-9b-chat-ov
|
||||
```
|
||||
|
||||
### 可以选择的参数
|
||||
|
||||
* `--model_id` - 模型所在目录的路径(绝对路径)。
|
||||
* `--output` - 转换后模型保存的地址。
|
||||
* `--precision` - 转换的精度。
|
||||
|
||||
|
||||
转换过程如下:
|
||||
```
|
||||
====Exporting IR=====
|
||||
Framework not specified. Using pt to export the model.
|
||||
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:04<00:00, 2.14it/s]
|
||||
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
|
||||
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
|
||||
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
|
||||
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
|
||||
Using framework PyTorch: 2.3.1+cu121
|
||||
Mixed-Precision assignment ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 160/160 • 0:01:45 • 0:00:00
|
||||
INFO:nncf:Statistics of the bitwidth distribution:
|
||||
┍━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
|
||||
│ Num bits (N) │ % all parameters (layers) │ % ratio-defining parameters (layers) │
|
||||
┝━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
|
||||
│ 8 │ 31% (76 / 163) │ 20% (73 / 160) │
|
||||
├────────────────┼─────────────────────────────┼────────────────────────────────────────┤
|
||||
│ 4 │ 69% (87 / 163) │ 80% (87 / 160) │
|
||||
┕━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙
|
||||
Applying Weight Compression ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 163/163 • 0:03:46 • 0:00:00
|
||||
Configuration saved in glm-4-9b-ov/openvino_config.json
|
||||
====Exporting tokenizer=====
|
||||
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
|
||||
```
|
||||
## 3. 运行 GLM-4-9B-Chat 模型
|
||||
|
||||
```
|
||||
python3 chat.py --model_path {your_path}/glm-4-9b-chat-ov --max_sequence_length 4096 --device CPU
|
||||
```
|
||||
|
||||
### 可以选择的参数
|
||||
|
||||
* `--model_path` - OpenVINO IR 模型所在目录的路径。
|
||||
* `--max_sequence_length` - 输出标记的最大大小。
|
||||
* `--device` - 运行推理的设备。
|
||||
|
||||
### 参考代码
|
||||
|
||||
本代码参考 [OpenVINO 官方示例](https://github.com/OpenVINO-dev-contest/chatglm3.openvino) 进行修改。
|
|
@ -0,0 +1,70 @@
|
|||
# Deploy the GLM-4-9B-Chat model using OpenVINO
|
||||
|
||||
[OpenVINO](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html)
|
||||
is an open source toolkit designed by Intel for deep learning inference. It can help developers optimize models, improve inference performance, and reduce model memory usage.
|
||||
This example will show how to deploy the GLM-4-9B-Chat model using OpenVINO.
|
||||
|
||||
## 1. Environment configuration
|
||||
|
||||
First, you need to install the dependencies
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
## 2. Convert the model
|
||||
|
||||
Since the Huggingface model needs to be converted to an OpenVINO IR model, you need to download the model and convert it.
|
||||
|
||||
```
|
||||
python3 convert.py --model_id THUDM/glm-4-9b-chat --output {your_path}/glm-4-9b-chat-ov
|
||||
```
|
||||
The conversion process is as follows:
|
||||
```
|
||||
====Exporting IR=====
|
||||
Framework not specified. Using pt to export the model.
|
||||
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:04<00:00, 2.14it/s]
|
||||
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
|
||||
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
|
||||
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
|
||||
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
|
||||
Using framework PyTorch: 2.3.1+cu121
|
||||
Mixed-Precision assignment ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 160/160 • 0:01:45 • 0:00:00
|
||||
INFO:nncf:Statistics of the bitwidth distribution:
|
||||
┍━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
|
||||
│ Num bits (N) │ % all parameters (layers) │ % ratio-defining parameters (layers) │
|
||||
┝━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
|
||||
│ 8 │ 31% (76 / 163) │ 20% (73 / 160) │
|
||||
├────────────────┼─────────────────────────────┼────────────────────────────────────────┤
|
||||
│ 4 │ 69% (87 / 163) │ 80% (87 / 160) │
|
||||
┕━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙
|
||||
Applying Weight Compression ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 163/163 • 0:03:46 • 0:00:00
|
||||
Configuration saved in glm-4-9b-ov/openvino_config.json
|
||||
====Exporting tokenizer=====
|
||||
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
|
||||
```
|
||||
|
||||
### Optional parameters
|
||||
|
||||
* `--model_id` - Path to the directory where the model is located (absolute path).
|
||||
|
||||
* `--output` - Path to where the converted model is saved.
|
||||
|
||||
* `--precision` - Precision of the conversion.
|
||||
|
||||
## 3. Run the GLM-4-9B-Chat model
|
||||
|
||||
```
|
||||
python3 chat.py --model_path {your_path}glm-4-9b-chat-ov --max_sequence_length 4096 --device CPU
|
||||
```
|
||||
|
||||
### Optional parameters
|
||||
|
||||
* `--model_path` - Path to the directory where the OpenVINO IR model is located.
|
||||
|
||||
* `--max_sequence_length` - Maximum size of the output token.
|
||||
* `--device` - the device to run inference on.
|
||||
|
||||
### Reference code
|
||||
|
||||
This code is modified based on the [OpenVINO official example](https://github.com/OpenVINO-dev-contest/chatglm3.openvino).
|
|
@ -0,0 +1,72 @@
|
|||
"""
|
||||
This script is used to convert the original model to OpenVINO IR format.
|
||||
The Origin Code can check https://github.com/OpenVINO-dev-contest/chatglm3.openvino/blob/main/convert.py
|
||||
"""
|
||||
from transformers import AutoTokenizer, AutoConfig
|
||||
from optimum.intel import OVWeightQuantizationConfig
|
||||
from optimum.intel.openvino import OVModelForCausalLM
|
||||
|
||||
import os
|
||||
from pathlib import Path
|
||||
import argparse
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
parser = argparse.ArgumentParser(add_help=False)
|
||||
parser.add_argument('-h',
|
||||
'--help',
|
||||
action='help',
|
||||
help='Show this help message and exit.')
|
||||
parser.add_argument('-m',
|
||||
'--model_id',
|
||||
default='THUDM/glm-4-9b-chat',
|
||||
required=False,
|
||||
type=str,
|
||||
help='orignal model path')
|
||||
parser.add_argument('-p',
|
||||
'--precision',
|
||||
required=False,
|
||||
default="int4",
|
||||
type=str,
|
||||
choices=["fp16", "int8", "int4"],
|
||||
help='fp16, int8 or int4')
|
||||
parser.add_argument('-o',
|
||||
'--output',
|
||||
default='./glm-4-9b-ov',
|
||||
required=False,
|
||||
type=str,
|
||||
help='Required. path to save the ir model')
|
||||
args = parser.parse_args()
|
||||
|
||||
ir_model_path = Path(args.output)
|
||||
if ir_model_path.exists() == False:
|
||||
os.mkdir(ir_model_path)
|
||||
|
||||
model_kwargs = {
|
||||
"trust_remote_code": True,
|
||||
"config": AutoConfig.from_pretrained(args.model_id, trust_remote_code=True),
|
||||
}
|
||||
compression_configs = {
|
||||
"sym": False,
|
||||
"group_size": 128,
|
||||
"ratio": 0.8,
|
||||
}
|
||||
|
||||
print("====Exporting IR=====")
|
||||
if args.precision == "int4":
|
||||
ov_model = OVModelForCausalLM.from_pretrained(args.model_id, export=True,
|
||||
compile=False, quantization_config=OVWeightQuantizationConfig(
|
||||
bits=4, **compression_configs), **model_kwargs)
|
||||
elif args.precision == "int8":
|
||||
ov_model = OVModelForCausalLM.from_pretrained(args.model_id, export=True,
|
||||
compile=False, load_in_8bit=True, **model_kwargs)
|
||||
else:
|
||||
ov_model = OVModelForCausalLM.from_pretrained(args.model_id, export=True,
|
||||
compile=False, load_in_8bit=False, **model_kwargs)
|
||||
|
||||
ov_model.save_pretrained(ir_model_path)
|
||||
|
||||
print("====Exporting tokenizer=====")
|
||||
tokenizer = AutoTokenizer.from_pretrained(
|
||||
args.model_id, trust_remote_code=True)
|
||||
tokenizer.save_pretrained(ir_model_path)
|
|
@ -0,0 +1,122 @@
|
|||
import argparse
|
||||
from typing import List, Tuple
|
||||
from threading import Thread
|
||||
import torch
|
||||
from optimum.intel.openvino import OVModelForCausalLM
|
||||
from transformers import (AutoTokenizer, AutoConfig,
|
||||
TextIteratorStreamer, StoppingCriteriaList, StoppingCriteria)
|
||||
|
||||
class StopOnTokens(StoppingCriteria):
|
||||
def __init__(self, token_ids):
|
||||
self.token_ids = token_ids
|
||||
|
||||
def __call__(
|
||||
self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs
|
||||
) -> bool:
|
||||
for stop_id in self.token_ids:
|
||||
if input_ids[0][-1] == stop_id:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(add_help=False)
|
||||
parser.add_argument('-h',
|
||||
'--help',
|
||||
action='help',
|
||||
help='Show this help message and exit.')
|
||||
parser.add_argument('-m',
|
||||
'--model_path',
|
||||
required=True,
|
||||
type=str,
|
||||
help='Required. model path')
|
||||
parser.add_argument('-l',
|
||||
'--max_sequence_length',
|
||||
default=256,
|
||||
required=False,
|
||||
type=int,
|
||||
help='Required. maximun length of output')
|
||||
parser.add_argument('-d',
|
||||
'--device',
|
||||
default='CPU',
|
||||
required=False,
|
||||
type=str,
|
||||
help='Required. device for inference')
|
||||
args = parser.parse_args()
|
||||
model_dir = args.model_path
|
||||
|
||||
ov_config = {"PERFORMANCE_HINT": "LATENCY",
|
||||
"NUM_STREAMS": "1", "CACHE_DIR": ""}
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(
|
||||
model_dir, trust_remote_code=True)
|
||||
|
||||
print("====Compiling model====")
|
||||
ov_model = OVModelForCausalLM.from_pretrained(
|
||||
model_dir,
|
||||
device=args.device,
|
||||
ov_config=ov_config,
|
||||
config=AutoConfig.from_pretrained(model_dir, trust_remote_code=True),
|
||||
trust_remote_code=True,
|
||||
)
|
||||
|
||||
streamer = TextIteratorStreamer(
|
||||
tokenizer, timeout=60.0, skip_prompt=True, skip_special_tokens=True
|
||||
)
|
||||
stop_tokens = [StopOnTokens([151329, 151336, 151338])]
|
||||
|
||||
def convert_history_to_token(history: List[Tuple[str, str]]):
|
||||
|
||||
messages = []
|
||||
for idx, (user_msg, model_msg) in enumerate(history):
|
||||
if idx == len(history) - 1 and not model_msg:
|
||||
messages.append({"role": "user", "content": user_msg})
|
||||
break
|
||||
if user_msg:
|
||||
messages.append({"role": "user", "content": user_msg})
|
||||
if model_msg:
|
||||
messages.append({"role": "assistant", "content": model_msg})
|
||||
|
||||
model_inputs = tokenizer.apply_chat_template(messages,
|
||||
add_generation_prompt=True,
|
||||
tokenize=True,
|
||||
return_tensors="pt")
|
||||
return model_inputs
|
||||
|
||||
history = []
|
||||
print("====Starting conversation====")
|
||||
while True:
|
||||
input_text = input("用户: ")
|
||||
if input_text.lower() == 'stop':
|
||||
break
|
||||
|
||||
if input_text.lower() == 'clear':
|
||||
history = []
|
||||
print("AI助手: 对话历史已清空")
|
||||
continue
|
||||
|
||||
print("GLM-4-9B-OpenVINO:", end=" ")
|
||||
history = history + [[input_text, ""]]
|
||||
model_inputs = convert_history_to_token(history)
|
||||
generate_kwargs = dict(
|
||||
input_ids=model_inputs,
|
||||
max_new_tokens=args.max_sequence_length,
|
||||
temperature=0.1,
|
||||
do_sample=True,
|
||||
top_p=1.0,
|
||||
top_k=50,
|
||||
repetition_penalty=1.1,
|
||||
streamer=streamer,
|
||||
stopping_criteria=StoppingCriteriaList(stop_tokens)
|
||||
)
|
||||
|
||||
t1 = Thread(target=ov_model.generate, kwargs=generate_kwargs)
|
||||
t1.start()
|
||||
|
||||
partial_text = ""
|
||||
for new_text in streamer:
|
||||
new_text = new_text
|
||||
print(new_text, end="", flush=True)
|
||||
partial_text += new_text
|
||||
print("\n")
|
||||
history[-1][1] = partial_text
|
|
@ -0,0 +1,2 @@
|
|||
optimum>=1.20.0
|
||||
optimum-intel @ git+https://github.com/huggingface/optimum-intel.git@c1ee8ac0864e25e22ea56b5a37a35451531da0e6
|
Loading…
Reference in New Issue