intel openvino support
This commit is contained in:
parent
6ae0f088ac
commit
d8828b19fd
|
@ -11,7 +11,10 @@ Read this in [English](README_en.md)
|
||||||
|
|
||||||
## 项目更新
|
## 项目更新
|
||||||
|
|
||||||
- 🔥🔥 **News**: ``2024/6/24``: 我们更新了模型仓库的运行文件和配置文件,支持 Flash Attention 2,
|
- 🔥 **News**: ``2024/6/28``: We have worked with the Intel technical team to improve the ITREX and OpenVINO deployment
|
||||||
|
tutorials for GLM-4-9B-Chat. You can use Intel CPU/GPU devices to efficiently deploy the GLM-4-9B open source model.
|
||||||
|
Welcome to [view](intel_device_demo).
|
||||||
|
- 🔥 **News**: ``2024/6/24``: 我们更新了模型仓库的运行文件和配置文件,支持 Flash Attention 2,
|
||||||
请更新模型配置文件并参考 `basic_demo/trans_cli_demo.py` 中的示例代码。
|
请更新模型配置文件并参考 `basic_demo/trans_cli_demo.py` 中的示例代码。
|
||||||
- 🔥 **News**: ``2024/6/19``: 我们更新了模型仓库的运行文件和配置文件,修复了部分已知的模型推理的问题,欢迎大家克隆最新的模型仓库。
|
- 🔥 **News**: ``2024/6/19``: 我们更新了模型仓库的运行文件和配置文件,修复了部分已知的模型推理的问题,欢迎大家克隆最新的模型仓库。
|
||||||
- 🔥 **News**: ``2024/6/18``: 我们发布 [技术报告](https://arxiv.org/pdf/2406.12793), 欢迎查看。
|
- 🔥 **News**: ``2024/6/18``: 我们发布 [技术报告](https://arxiv.org/pdf/2406.12793), 欢迎查看。
|
||||||
|
|
|
@ -5,11 +5,11 @@
|
||||||
</p>
|
</p>
|
||||||
<p align="center">
|
<p align="center">
|
||||||
📍Experience and use a larger-scale GLM business model on the <a href="https://open.bigmodel.cn/?utm_campaign=open&_channel_track_key=OWTVNma9">Zhipu AI Open Platform</a>
|
📍Experience and use a larger-scale GLM business model on the <a href="https://open.bigmodel.cn/?utm_campaign=open&_channel_track_key=OWTVNma9">Zhipu AI Open Platform</a>
|
||||||
|
|
||||||
</p>
|
</p>
|
||||||
|
|
||||||
## Update
|
## Update
|
||||||
- 🔥🔥 **News**: ``2024/6/24``: We have updated the running files and configuration files of the model repository to support Flash Attention 2,
|
- 🔥 **News**: ``2024/6/28``: We have updated the running files and configuration files of the model repository to support Flash Attention 2,
|
||||||
|
- 🔥 **News**: ``2024/6/24``: We have updated the running files and configuration files of the model repository to support Flash Attention 2,
|
||||||
Please update the model configuration file and refer to the sample code in `basic_demo/trans_cli_demo.py`.
|
Please update the model configuration file and refer to the sample code in `basic_demo/trans_cli_demo.py`.
|
||||||
- 🔥🔥 **News**: ``2024/6/19``: We updated the running files and configuration files of the model repository and fixed some model inference issues. Welcome to clone the latest model repository.
|
- 🔥🔥 **News**: ``2024/6/19``: We updated the running files and configuration files of the model repository and fixed some model inference issues. Welcome to clone the latest model repository.
|
||||||
- 🔥 **News**: ``2024/6/18``: We released a [technical report](https://arxiv.org/pdf/2406.12793), welcome to check it out.
|
- 🔥 **News**: ``2024/6/18``: We released a [technical report](https://arxiv.org/pdf/2406.12793), welcome to check it out.
|
||||||
|
|
|
@ -0,0 +1,70 @@
|
||||||
|
# 使用 OpenVINO 部署 GLM-4-9B-Chat 模型
|
||||||
|
|
||||||
|
Read this in [English](README_en.md).
|
||||||
|
|
||||||
|
[OpenVINO](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html)
|
||||||
|
是 Intel 为深度学习推理而设计的开源工具包。它可以帮助开发者优化模型,提高推理性能,减少模型的内存占用。
|
||||||
|
本示例将展示如何使用 OpenVINO 部署 GLM-4-9B-Chat 模型。
|
||||||
|
|
||||||
|
## 1. 环境配置
|
||||||
|
|
||||||
|
首先,你需要安装依赖
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install -r requirements.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
## 2. 转换模型
|
||||||
|
|
||||||
|
由于需要将Huggingface模型转换为OpenVINO IR模型,因此您需要下载模型并转换。
|
||||||
|
|
||||||
|
```
|
||||||
|
python3 convert.py --model_id THUDM/glm-4-9b-chat --output {your_path}/glm-4-9b-chat-ov
|
||||||
|
```
|
||||||
|
|
||||||
|
### 可以选择的参数
|
||||||
|
|
||||||
|
* `--model_id` - 模型所在目录的路径(绝对路径)。
|
||||||
|
* `--output` - 转换后模型保存的地址。
|
||||||
|
* `--precision` - 转换的精度。
|
||||||
|
|
||||||
|
|
||||||
|
转换过程如下:
|
||||||
|
```
|
||||||
|
====Exporting IR=====
|
||||||
|
Framework not specified. Using pt to export the model.
|
||||||
|
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:04<00:00, 2.14it/s]
|
||||||
|
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
|
||||||
|
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
|
||||||
|
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
|
||||||
|
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
|
||||||
|
Using framework PyTorch: 2.3.1+cu121
|
||||||
|
Mixed-Precision assignment ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 160/160 • 0:01:45 • 0:00:00
|
||||||
|
INFO:nncf:Statistics of the bitwidth distribution:
|
||||||
|
┍━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
|
||||||
|
│ Num bits (N) │ % all parameters (layers) │ % ratio-defining parameters (layers) │
|
||||||
|
┝━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
|
||||||
|
│ 8 │ 31% (76 / 163) │ 20% (73 / 160) │
|
||||||
|
├────────────────┼─────────────────────────────┼────────────────────────────────────────┤
|
||||||
|
│ 4 │ 69% (87 / 163) │ 80% (87 / 160) │
|
||||||
|
┕━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙
|
||||||
|
Applying Weight Compression ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 163/163 • 0:03:46 • 0:00:00
|
||||||
|
Configuration saved in glm-4-9b-ov/openvino_config.json
|
||||||
|
====Exporting tokenizer=====
|
||||||
|
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
|
||||||
|
```
|
||||||
|
## 3. 运行 GLM-4-9B-Chat 模型
|
||||||
|
|
||||||
|
```
|
||||||
|
python3 chat.py --model_path {your_path}/glm-4-9b-chat-ov --max_sequence_length 4096 --device CPU
|
||||||
|
```
|
||||||
|
|
||||||
|
### 可以选择的参数
|
||||||
|
|
||||||
|
* `--model_path` - OpenVINO IR 模型所在目录的路径。
|
||||||
|
* `--max_sequence_length` - 输出标记的最大大小。
|
||||||
|
* `--device` - 运行推理的设备。
|
||||||
|
|
||||||
|
### 参考代码
|
||||||
|
|
||||||
|
本代码参考 [OpenVINO 官方示例](https://github.com/OpenVINO-dev-contest/chatglm3.openvino) 进行修改。
|
|
@ -0,0 +1,70 @@
|
||||||
|
# Deploy the GLM-4-9B-Chat model using OpenVINO
|
||||||
|
|
||||||
|
[OpenVINO](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html)
|
||||||
|
is an open source toolkit designed by Intel for deep learning inference. It can help developers optimize models, improve inference performance, and reduce model memory usage.
|
||||||
|
This example will show how to deploy the GLM-4-9B-Chat model using OpenVINO.
|
||||||
|
|
||||||
|
## 1. Environment configuration
|
||||||
|
|
||||||
|
First, you need to install the dependencies
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install -r requirements.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
## 2. Convert the model
|
||||||
|
|
||||||
|
Since the Huggingface model needs to be converted to an OpenVINO IR model, you need to download the model and convert it.
|
||||||
|
|
||||||
|
```
|
||||||
|
python3 convert.py --model_id THUDM/glm-4-9b-chat --output {your_path}/glm-4-9b-chat-ov
|
||||||
|
```
|
||||||
|
The conversion process is as follows:
|
||||||
|
```
|
||||||
|
====Exporting IR=====
|
||||||
|
Framework not specified. Using pt to export the model.
|
||||||
|
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:04<00:00, 2.14it/s]
|
||||||
|
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
|
||||||
|
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
|
||||||
|
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
|
||||||
|
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
|
||||||
|
Using framework PyTorch: 2.3.1+cu121
|
||||||
|
Mixed-Precision assignment ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 160/160 • 0:01:45 • 0:00:00
|
||||||
|
INFO:nncf:Statistics of the bitwidth distribution:
|
||||||
|
┍━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
|
||||||
|
│ Num bits (N) │ % all parameters (layers) │ % ratio-defining parameters (layers) │
|
||||||
|
┝━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
|
||||||
|
│ 8 │ 31% (76 / 163) │ 20% (73 / 160) │
|
||||||
|
├────────────────┼─────────────────────────────┼────────────────────────────────────────┤
|
||||||
|
│ 4 │ 69% (87 / 163) │ 80% (87 / 160) │
|
||||||
|
┕━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙
|
||||||
|
Applying Weight Compression ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 163/163 • 0:03:46 • 0:00:00
|
||||||
|
Configuration saved in glm-4-9b-ov/openvino_config.json
|
||||||
|
====Exporting tokenizer=====
|
||||||
|
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
|
||||||
|
```
|
||||||
|
|
||||||
|
### Optional parameters
|
||||||
|
|
||||||
|
* `--model_id` - Path to the directory where the model is located (absolute path).
|
||||||
|
|
||||||
|
* `--output` - Path to where the converted model is saved.
|
||||||
|
|
||||||
|
* `--precision` - Precision of the conversion.
|
||||||
|
|
||||||
|
## 3. Run the GLM-4-9B-Chat model
|
||||||
|
|
||||||
|
```
|
||||||
|
python3 chat.py --model_path {your_path}glm-4-9b-chat-ov --max_sequence_length 4096 --device CPU
|
||||||
|
```
|
||||||
|
|
||||||
|
### Optional parameters
|
||||||
|
|
||||||
|
* `--model_path` - Path to the directory where the OpenVINO IR model is located.
|
||||||
|
|
||||||
|
* `--max_sequence_length` - Maximum size of the output token.
|
||||||
|
* `--device` - the device to run inference on.
|
||||||
|
|
||||||
|
### Reference code
|
||||||
|
|
||||||
|
This code is modified based on the [OpenVINO official example](https://github.com/OpenVINO-dev-contest/chatglm3.openvino).
|
|
@ -0,0 +1,72 @@
|
||||||
|
"""
|
||||||
|
This script is used to convert the original model to OpenVINO IR format.
|
||||||
|
The Origin Code can check https://github.com/OpenVINO-dev-contest/chatglm3.openvino/blob/main/convert.py
|
||||||
|
"""
|
||||||
|
from transformers import AutoTokenizer, AutoConfig
|
||||||
|
from optimum.intel import OVWeightQuantizationConfig
|
||||||
|
from optimum.intel.openvino import OVModelForCausalLM
|
||||||
|
|
||||||
|
import os
|
||||||
|
from pathlib import Path
|
||||||
|
import argparse
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
parser = argparse.ArgumentParser(add_help=False)
|
||||||
|
parser.add_argument('-h',
|
||||||
|
'--help',
|
||||||
|
action='help',
|
||||||
|
help='Show this help message and exit.')
|
||||||
|
parser.add_argument('-m',
|
||||||
|
'--model_id',
|
||||||
|
default='THUDM/glm-4-9b-chat',
|
||||||
|
required=False,
|
||||||
|
type=str,
|
||||||
|
help='orignal model path')
|
||||||
|
parser.add_argument('-p',
|
||||||
|
'--precision',
|
||||||
|
required=False,
|
||||||
|
default="int4",
|
||||||
|
type=str,
|
||||||
|
choices=["fp16", "int8", "int4"],
|
||||||
|
help='fp16, int8 or int4')
|
||||||
|
parser.add_argument('-o',
|
||||||
|
'--output',
|
||||||
|
default='./glm-4-9b-ov',
|
||||||
|
required=False,
|
||||||
|
type=str,
|
||||||
|
help='Required. path to save the ir model')
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
ir_model_path = Path(args.output)
|
||||||
|
if ir_model_path.exists() == False:
|
||||||
|
os.mkdir(ir_model_path)
|
||||||
|
|
||||||
|
model_kwargs = {
|
||||||
|
"trust_remote_code": True,
|
||||||
|
"config": AutoConfig.from_pretrained(args.model_id, trust_remote_code=True),
|
||||||
|
}
|
||||||
|
compression_configs = {
|
||||||
|
"sym": False,
|
||||||
|
"group_size": 128,
|
||||||
|
"ratio": 0.8,
|
||||||
|
}
|
||||||
|
|
||||||
|
print("====Exporting IR=====")
|
||||||
|
if args.precision == "int4":
|
||||||
|
ov_model = OVModelForCausalLM.from_pretrained(args.model_id, export=True,
|
||||||
|
compile=False, quantization_config=OVWeightQuantizationConfig(
|
||||||
|
bits=4, **compression_configs), **model_kwargs)
|
||||||
|
elif args.precision == "int8":
|
||||||
|
ov_model = OVModelForCausalLM.from_pretrained(args.model_id, export=True,
|
||||||
|
compile=False, load_in_8bit=True, **model_kwargs)
|
||||||
|
else:
|
||||||
|
ov_model = OVModelForCausalLM.from_pretrained(args.model_id, export=True,
|
||||||
|
compile=False, load_in_8bit=False, **model_kwargs)
|
||||||
|
|
||||||
|
ov_model.save_pretrained(ir_model_path)
|
||||||
|
|
||||||
|
print("====Exporting tokenizer=====")
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained(
|
||||||
|
args.model_id, trust_remote_code=True)
|
||||||
|
tokenizer.save_pretrained(ir_model_path)
|
|
@ -0,0 +1,122 @@
|
||||||
|
import argparse
|
||||||
|
from typing import List, Tuple
|
||||||
|
from threading import Thread
|
||||||
|
import torch
|
||||||
|
from optimum.intel.openvino import OVModelForCausalLM
|
||||||
|
from transformers import (AutoTokenizer, AutoConfig,
|
||||||
|
TextIteratorStreamer, StoppingCriteriaList, StoppingCriteria)
|
||||||
|
|
||||||
|
class StopOnTokens(StoppingCriteria):
|
||||||
|
def __init__(self, token_ids):
|
||||||
|
self.token_ids = token_ids
|
||||||
|
|
||||||
|
def __call__(
|
||||||
|
self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs
|
||||||
|
) -> bool:
|
||||||
|
for stop_id in self.token_ids:
|
||||||
|
if input_ids[0][-1] == stop_id:
|
||||||
|
return True
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
parser = argparse.ArgumentParser(add_help=False)
|
||||||
|
parser.add_argument('-h',
|
||||||
|
'--help',
|
||||||
|
action='help',
|
||||||
|
help='Show this help message and exit.')
|
||||||
|
parser.add_argument('-m',
|
||||||
|
'--model_path',
|
||||||
|
required=True,
|
||||||
|
type=str,
|
||||||
|
help='Required. model path')
|
||||||
|
parser.add_argument('-l',
|
||||||
|
'--max_sequence_length',
|
||||||
|
default=256,
|
||||||
|
required=False,
|
||||||
|
type=int,
|
||||||
|
help='Required. maximun length of output')
|
||||||
|
parser.add_argument('-d',
|
||||||
|
'--device',
|
||||||
|
default='CPU',
|
||||||
|
required=False,
|
||||||
|
type=str,
|
||||||
|
help='Required. device for inference')
|
||||||
|
args = parser.parse_args()
|
||||||
|
model_dir = args.model_path
|
||||||
|
|
||||||
|
ov_config = {"PERFORMANCE_HINT": "LATENCY",
|
||||||
|
"NUM_STREAMS": "1", "CACHE_DIR": ""}
|
||||||
|
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained(
|
||||||
|
model_dir, trust_remote_code=True)
|
||||||
|
|
||||||
|
print("====Compiling model====")
|
||||||
|
ov_model = OVModelForCausalLM.from_pretrained(
|
||||||
|
model_dir,
|
||||||
|
device=args.device,
|
||||||
|
ov_config=ov_config,
|
||||||
|
config=AutoConfig.from_pretrained(model_dir, trust_remote_code=True),
|
||||||
|
trust_remote_code=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
streamer = TextIteratorStreamer(
|
||||||
|
tokenizer, timeout=60.0, skip_prompt=True, skip_special_tokens=True
|
||||||
|
)
|
||||||
|
stop_tokens = [StopOnTokens([151329, 151336, 151338])]
|
||||||
|
|
||||||
|
def convert_history_to_token(history: List[Tuple[str, str]]):
|
||||||
|
|
||||||
|
messages = []
|
||||||
|
for idx, (user_msg, model_msg) in enumerate(history):
|
||||||
|
if idx == len(history) - 1 and not model_msg:
|
||||||
|
messages.append({"role": "user", "content": user_msg})
|
||||||
|
break
|
||||||
|
if user_msg:
|
||||||
|
messages.append({"role": "user", "content": user_msg})
|
||||||
|
if model_msg:
|
||||||
|
messages.append({"role": "assistant", "content": model_msg})
|
||||||
|
|
||||||
|
model_inputs = tokenizer.apply_chat_template(messages,
|
||||||
|
add_generation_prompt=True,
|
||||||
|
tokenize=True,
|
||||||
|
return_tensors="pt")
|
||||||
|
return model_inputs
|
||||||
|
|
||||||
|
history = []
|
||||||
|
print("====Starting conversation====")
|
||||||
|
while True:
|
||||||
|
input_text = input("用户: ")
|
||||||
|
if input_text.lower() == 'stop':
|
||||||
|
break
|
||||||
|
|
||||||
|
if input_text.lower() == 'clear':
|
||||||
|
history = []
|
||||||
|
print("AI助手: 对话历史已清空")
|
||||||
|
continue
|
||||||
|
|
||||||
|
print("GLM-4-9B-OpenVINO:", end=" ")
|
||||||
|
history = history + [[input_text, ""]]
|
||||||
|
model_inputs = convert_history_to_token(history)
|
||||||
|
generate_kwargs = dict(
|
||||||
|
input_ids=model_inputs,
|
||||||
|
max_new_tokens=args.max_sequence_length,
|
||||||
|
temperature=0.1,
|
||||||
|
do_sample=True,
|
||||||
|
top_p=1.0,
|
||||||
|
top_k=50,
|
||||||
|
repetition_penalty=1.1,
|
||||||
|
streamer=streamer,
|
||||||
|
stopping_criteria=StoppingCriteriaList(stop_tokens)
|
||||||
|
)
|
||||||
|
|
||||||
|
t1 = Thread(target=ov_model.generate, kwargs=generate_kwargs)
|
||||||
|
t1.start()
|
||||||
|
|
||||||
|
partial_text = ""
|
||||||
|
for new_text in streamer:
|
||||||
|
new_text = new_text
|
||||||
|
print(new_text, end="", flush=True)
|
||||||
|
partial_text += new_text
|
||||||
|
print("\n")
|
||||||
|
history[-1][1] = partial_text
|
|
@ -0,0 +1,2 @@
|
||||||
|
optimum>=1.20.0
|
||||||
|
optimum-intel @ git+https://github.com/huggingface/optimum-intel.git@c1ee8ac0864e25e22ea56b5a37a35451531da0e6
|
Loading…
Reference in New Issue