first commit

This commit is contained in:
xxl 2025-02-25 09:26:49 +08:00
parent c73dc99a1c
commit 412cb3c21a
15 changed files with 2838 additions and 2 deletions

400
README.md
View File

@ -1,3 +1,399 @@
# Baichuan-M1-14B-Instruct
---
language:
- en
- zh
tags:
- medical
---
<div align="center">
<h1>
Baichuan-M1-14B-Instruct
</h1>
</div>
<p align="center">
🤗 <a href="https://huggingface.co/baichuan-inc/Baichuan-M1-14B-Base" target="_blank">Baichuan-M1-14B-Base</a> • 🤗 <a href="https://huggingface.co/baichuan-inc/Baichuan-M1-14B-Instruct" target="_blank">Baichuan-M1-14B-Instruct</a> • 📗 <a href="https://arxiv.org/abs/2502.12671" target="_blank">Technical Report</a> • 💬 <a href="https://y41.8if.cn/JQCj6n" target="_blank">WeChat</a>
</p>
---
# 📖 Table of Contents
- [🏁 Model Introduction](#intro)
- [🔬 Data Collection and Processing](#data)
- [🧠 New Model Architecture](#structure)
- [⚙️ Training Methodology](#training)
- [📊 Benchmark Results](#benchmark)
- [🚀 Quick Start](#quick)
- [📜 License and Statement](#declare)
- [🏷️ Reference](#reference)
---
<a name="intro"></a>
# 🏁 Model Introduction
**Baichuan-14B-M1** is the industry's first open-source large language model developed from scratch by Baichuan Intelligence, specifically optimized for medical scenarios. While excelling in general capabilities, it demonstrates powerful performance in the medical field. It achieves results comparable to models of similar size in most general benchmark evaluations, while outperforming models five times larger in medical scenarios. Below are the core features of the model:
- Trained from scratch on **20 trillion tokens** of high-quality medical and general data.
- Specialized modeling for **20+ medical departments** with fine-grained medical expertise.
- Introduces **innovative model architecture**, significantly improving context understanding and long-sequence task performance.
- Provides **[🤗 Base Model](https://huggingface.co/baichuan-inc/Baichuan-M1-14B-Base)** and **[🤗 Instruct Model](https://huggingface.co/baichuan-inc/Baichuan-M1-14B-Instruct)**.
---
<a name="data"></a>
# 🔬 Data Collection and Processing
## Medical Data Collection
We conducted meticulous data collection and synthesis for the medical field, including:
- **Tens of millions of professional medical data**: Chinese/English professional papers, medical cases, medical textbooks, knowledge bases, etc.
- **Hundreds of millions of medical Q&A and clinical data**: Covering complex medical reasoning and real-world clinical cases.
- **Comprehensive data classification and evaluation**: Categorized by medical departments, content, and value to ensure balanced data distribution and filter out truly valuable medical data.
## Data Synthesis and Optimization
- **Synthetic data design**: Combining knowledge graphs, cases, and textbooks to generate diverse, high-quality medical reasoning data.
- **Self-reflection mechanism and reward model**: Continuously improving the quality of synthetic data, ultimately generating **nearly a trillion tokens** of reasoning data, covering long-tail knowledge and complex scenarios.
## General Data Collection
- **20T multilingual general dataset**: Including 14T English data, 4T Chinese data, and 2T data covering 30 mainstream languages.
- **Deduplication and upsampling strategy**: Upsampling high-quality data to significantly enhance model performance.
- **27 global knowledge categories**: Optimizing data ratios based on small model experiments to balance general and domain-specific capabilities.
---
<a name="structure"></a>
# 🧠 New Model Architecture
## Short Convolution Attention Mechanism
- By introducing lightweight short convolution operations when computing Key and Value, the reliance of standard Transformer models on induction heads for learning is significantly reduced. Traditional Transformers rely on induction heads to capture repetitive patterns and contextual dependencies in sequences, which requires a certain model width and depth. Short convolution decouples the Key and Value sequences in the time dimension, enhancing context learning capabilities. Extensive experiments from toy models to models with over ten billion parameters show that the short convolution attention mechanism excels in language modeling tasks, especially those heavily dependent on contextual information.
## Sliding Window Attention Mechanism
- Adopting a sliding window attention mechanism in some layers to reduce KV Cache memory usage.
- **Optimization**: Balancing computational efficiency and performance, especially suitable for long-sequence tasks.
## Optimizing Position Encoding Oscillation
- By increasing the dimensions of some attention heads, RoPE curve oscillation is reduced.
- **Result**: More stable performance in long-sequence tasks while maintaining the model's ability to capture diverse features.
## High Peak Learning Rate Strategy
- Using **WSD learning rate scheduling strategy** with high peak learning rates to promote model generalization.
- **Comparison results**: Significant improvement in benchmark task performance.
## Adaptive Gradient Update
- **Dynamic gradient clipping**: Skipping updates when gradients are too large to reduce instability caused by special samples or steep loss spaces.
---
<a name="training"></a>
# ⚙️ Training Methodology
We innovatively adopted a **multi-stage curriculum learning and alignment optimization** approach, systematically enhancing model capabilities through the following two parts:
## 1. Multi-Stage Curriculum Learning
Training is divided into three stages, progressively optimizing the model's general and medical domain capabilities:
1. **General Knowledge Enhancement Stage**: Focused on general language modeling to improve basic language and common sense.
2. **Medical Basic Knowledge Enhancement Stage**: Introducing high-quality medical data to enhance reasoning, mathematical, and medical knowledge.
3. **Medical Advanced Knowledge Enhancement Stage**: Further optimizing data quality, focusing on complex medical reasoning, disease diagnosis, and long-tail knowledge.
## 2. Alignment Optimization
Enhancing model generation quality, logical reasoning, and user preference alignment through reinforcement learning and pairwise data optimization:
1. **Pairwise Data**: Covering multi-turn dialogues, instruction following, math and code, and reasoning tasks, sourced from human annotations and multi-model generation.
2. **Optimization Process**:
- **ELO**: Optimizing diverse, high-quality chain-of-thought generation based on maximum likelihood.
- **TDPO**: Using pairwise data to optimize the generation model for better user preference alignment.
- **PPO**: Further enhancing generation logic and task performance through policy optimization.
This combined approach of multi-stage and alignment optimization enables the model to achieve exceptional performance in both general and medical domain capabilities.
---
<a name="benchmark"></a>
# 📊 Benchmark Results
Our evaluation covers all mainstream benchmarks, achieving excellent metrics in both open-source and closed-source evaluations, demonstrating outstanding medical scenario capabilities while maintaining strong general performance.
<table style="border: 1px solid #000; border-collapse: collapse; width: 100%; text-align: center;">
<thead>
<tr>
<th>Category</th>
<th>Benchmark</th>
<th style="font-size:15px;">Baichuan-M1-14B-Instruct</th>
<th style="font-size:15px;">Qwen2.5-14B-Instruct</th>
<th style="font-size:15px;">Qwen2.5-72B-Instruct</th>
<th style="font-size:15px;">claude-3.5-sonnet-20241022</th>
<th style="font-size:15px;">gpt-4o</th>
</tr>
</thead>
<tbody>
<tr>
<td colspan="2" style="text-align: center;">Average Score</td>
<td>72.23</td>
<td>65.39</td>
<td>70.51</td>
<td>74.85</td>
<td>75.00</td>
</tr>
<tr>
<td rowspan="7" style="vertical-align: middle;">Clinical Practice</td>
<td style="text-align: left;">cmbclin</td>
<td>77.40</td>
<td>71.51</td>
<td>75.36</td>
<td>78.37</td>
<td>75.36</td>
</tr>
<tr>
<td style="text-align: left;">clinicalbench_diag</td>
<td>70.90</td>
<td>68.85</td>
<td>72.23</td>
<td>75.00</td>
<td>73.05</td>
</tr>
<tr>
<td style="text-align: left;">clinicalbench_hos</td>
<td>70.05</td>
<td>68.83</td>
<td>70.53</td>
<td>65.58</td>
<td>69.38</td>
</tr>
<tr>
<td style="text-align: left;">clinicalbench_treat</td>
<td>56.38</td>
<td>55.03</td>
<td>57.30</td>
<td>64.03</td>
<td>59.35</td>
</tr>
<tr>
<td style="text-align: left;">rarearena_rdc</td>
<td>81.80</td>
<td>66.40</td>
<td>76.20</td>
<td>89.60</td>
<td>88.40</td>
</tr>
<tr>
<td style="text-align: left;">rarearena_rds</td>
<td>54.00</td>
<td>42.60</td>
<td>49.80</td>
<td>59.80</td>
<td>57.20</td>
</tr>
<tr>
<td style="text-align: left;">rarebench</td>
<td>59.60</td>
<td>52.80</td>
<td>60.60</td>
<td>65.30</td>
<td>62.80</td>
</tr>
<tr>
<td rowspan="10" style="vertical-align: middle;">Exams</td>
<td style="text-align: left;">cmexam</td>
<td>80.10</td>
<td>77.70</td>
<td>82.70</td>
<td>77.50</td>
<td>78.00</td>
</tr>
<tr>
<td style="text-align: left;">Pediatric Qualification Exam</td>
<td>78.48</td>
<td>74.68</td>
<td>84.81</td>
<td>76.58</td>
<td>78.48</td>
</tr>
<tr>
<td style="text-align: left;">Internal Medicine Qualification Exam</td>
<td>83.42</td>
<td>86.10</td>
<td>87.17</td>
<td>87.70</td>
<td>83.42</td>
</tr>
<tr>
<td style="text-align: left;">General Practice Qualification Exam</td>
<td>87.07</td>
<td>88.44</td>
<td>88.44</td>
<td>81.63</td>
<td>84.35</td>
</tr>
<tr>
<td style="text-align: left;">USMLE</td>
<td>78.00</td>
<td>67.20</td>
<td>76.70</td>
<td>85.90</td>
<td>87.10</td>
</tr>
<tr>
<td style="text-align: left;">medbullets</td>
<td>66.88</td>
<td>54.22</td>
<td>64.29</td>
<td>72.40</td>
<td>75.97</td>
</tr>
<tr>
<td style="text-align: left;">mediq</td>
<td>83.40</td>
<td>66.80</td>
<td>79.90</td>
<td>88.80</td>
<td>90.20</td>
</tr>
<tr>
<td style="text-align: left;">nejmqa</td>
<td>49.75</td>
<td>45.69</td>
<td>50.76</td>
<td>69.54</td>
<td>54.31</td>
</tr>
<tr>
<td style="text-align: left;">pubmedqa</td>
<td>75.20</td>
<td>76.40</td>
<td>75.60</td>
<td>77.00</td>
<td>77.60</td>
</tr>
<tr>
<td style="text-align: left;">redisqa</td>
<td>74.50</td>
<td>69.70</td>
<td>75.00</td>
<td>83.20</td>
<td>82.80</td>
</tr>
<tr>
<td rowspan="5" style="vertical-align: middle;">Basic Capabilities</td>
<td style="text-align: left;">mednli_dis</td>
<td>80.40</td>
<td>68.90</td>
<td>74.90</td>
<td>58.30</td>
<td>79.80</td>
</tr>
<tr>
<td style="text-align: left;">medcalc</td>
<td>56.00</td>
<td>31.40</td>
<td>37.90</td>
<td>52.60</td>
<td>49.00</td>
</tr>
<tr>
<td style="text-align: left;">MMLU-anatomy</td>
<td>80.00</td>
<td>67.41</td>
<td>71.11</td>
<td>86.67</td>
<td>91.11</td>
</tr>
<tr>
<td style="text-align: left;">MMLU-virology</td>
<td>54.82</td>
<td>56.02</td>
<td>53.01</td>
<td>54.22</td>
<td>57.23</td>
</tr>
<tr>
<td style="text-align: left;">MMLU-genetics</td>
<td>91.00</td>
<td>82.00</td>
<td>87.00</td>
<td>97.00</td>
<td>95.00</td>
</tr>
</tbody>
</table>
---
<a name="quick"></a>
# 🚀 Quick Start
### 🤗 Hugging Face Transformers
We recommend using the latest version of the Transformers library (at least 4.47.0). The following code snippet demonstrates how to use the **Baichuan-M1-14B-Instruct** model:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# 1. Load pre-trained model and tokenizer
model_name = "baichuan-inc/Baichuan-M1-14B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name,trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name,trust_remote_code=True,torch_dtype = torch.bfloat16).cuda()
# 2. Input prompt text
prompt = "May I ask you some questions about medical knowledge?"
# 3. Encode the input text for the model
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# 4. Generate text
generated_ids = model.generate(
**model_inputs,
max_new_tokens=512
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
# 5. Decode the generated text
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
# 6. Output the result
print("Generated text:")
print(response)
```
---
<a name="declare"></a>
# 📜 License and Statement
The use of the model must comply with [《Baichuan-M1-14B模型社区许可协议》](https://github.com/baichuan-inc/Baichuan-M1-14B/blob/main/Baichuan-M1-14B模型社区许可协议.pdf).
The development team of Baichuan has not developed any commercial applications based on this model. All users must comply with laws and regulations and must not use the model for harmful national security or illegal purposes.
---
<a name="reference"></a>
# 🏷️ Reference
If you need to cite our work, please use the following reference:
```
@article{baichuan-m1-2025,
title={Baichuan-M1: Pushing the Medical Capability of Large Language Models},
author={Bingning Wang, Haizhou Zhao, Huozhi Zhou, Liang Song, Mingyu Xu, Wei Cheng, Xiangrong Zeng, Yupeng Zhang, Yuqi Huo, Zecheng Wang, Zhengyun Zhao and others},
journal={arXiv preprint arXiv:2502.12671},
year={2025}
}
```

56
config.json Normal file
View File

@ -0,0 +1,56 @@
{
"architectures": [
"BaichuanM1ForCausalLM"
],
"attention_dropout": 0.0,
"auto_map": {
"AutoConfig": "configuration_baichuan.BaichuanM1Config",
"AutoModelForCausalLM": "modeling_baichuan.BaichuanM1ForCausalLM"
},
"bos_token_id": 1,
"conv_window": 2,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 5120,
"initializer_range": 0.02,
"intermediate_size": 17408,
"max_position_embeddings": 32768,
"model_max_length": 32768,
"model_type": "baichuan_m1",
"num_attention_heads": 20,
"num_hidden_layers": 40,
"num_key_value_heads": 2,
"num_swa_attention_heads": 40,
"num_swa_key_value_heads": 8,
"pad_token_id": 0,
"rms_norm_eps": 1e-06,
"rope_theta": 1000000.0,
"sliding_window": 8192,
"sliding_window_layers": [
1,
3,
5,
7,
9,
11,
13,
15,
17,
19,
21,
23,
25,
27,
29,
31,
33,
35,
37,
39
],
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.48.1",
"use_cache": true,
"vocab_size": 133120
}

119
configuration_baichuan.py Normal file
View File

@ -0,0 +1,119 @@
# coding=utf-8
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from transformers import PretrainedConfig
from transformers.utils import logging
logger = logging.get_logger(__name__)
class BaichuanM1Config(PretrainedConfig):
r"""
Configuration objects inherit from [`PretrainedConfig`] and control the behavior of model outputs. For more details,
refer to the documentation of [`PretrainedConfig`].
Args:
vocab_size (`int`, *optional*, defaults to 133120):
The size of the vocabulary used by the model.
hidden_size (`int`, *optional*, defaults to 4096):
The dimensionality of the hidden representations.
intermediate_size (`int`, *optional*, defaults to 22016):
The dimensionality of the intermediate (MLP) representations.
num_hidden_layers (`int`, *optional*, defaults to 32):
The number of hidden layers in the Transformer encoder.
num_attention_heads (`int`, *optional*, defaults to 32):
The number of attention heads for each attention layer in the Transformer encoder.
num_key_value_heads (`int`, *optional*, defaults to 32):
The number of key-value heads used to implement Grouped Query Attention (GQA).
- If `num_key_value_heads == num_attention_heads`, the model uses Multi-Head Attention (MHA).
- If `num_key_value_heads == 1`, the model uses Multi-Query Attention (MQA).
- Otherwise, the model uses Grouped Query Attention (GQA).
When converting a multi-head checkpoint to a GQA checkpoint, each group's key and value heads are constructed
by mean-pooling the original heads within that group. For more details, refer to [this paper](https://arxiv.org/pdf/2305.13245.pdf).
If not specified, this defaults to `32`.
hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
The non-linear activation function (either a string or a callable function) used in the decoder.
max_position_embeddings (`int`, *optional*, defaults to 32768):
The maximum sequence length the model can handle.
initializer_range (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated normal initializer for initializing all weight matrices.
rms_norm_eps (`float`, *optional*, defaults to 1e-06):
The epsilon value used by the RMS normalization layers.
use_cache (`bool`, *optional*, defaults to `True`):
Whether the model should return the last key/value attentions. This is only relevant if `config.is_decoder=True`.
tie_word_embeddings (`bool`, *optional*, defaults to `False`):
Whether to tie the model's input and output word embeddings.
rope_theta (`float`, *optional*, defaults to 10000.0):
The base period of the Rotary Position Embeddings (RoPE).
use_sliding_window (`bool`, *optional*, defaults to `False`):
Whether to enable sliding window attention.
sliding_window (`int`, *optional*, defaults to 4096):
The size of the sliding window for sliding window attention (SWA). If not specified, it defaults to `2048`.
attention_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio applied to the attention probabilities.
"""
model_type = "baichuan"
keys_to_ignore_at_inference = ["past_key_values"]
def __init__(
self,
vocab_size=133120,
hidden_size=5120,
intermediate_size=17408,
num_hidden_layers=40,
num_attention_heads=40,
num_key_value_heads=2,
num_swa_attention_heads: int = 20,
num_swa_key_value_heads=8,
sliding_window_layers: list = None,
hidden_act="silu",
max_position_embeddings=32768,
initializer_range=0.02,
rms_norm_eps=1e-6,
use_cache=True,
tie_word_embeddings=False,
rope_theta=100000.0,
sliding_window=2048,
attention_dropout=0.0,
conv_window = 2,
**kwargs,
):
self.sliding_window_layers = sliding_window_layers
self.num_swa_key_value_heads = num_swa_key_value_heads
self.num_swa_attention_heads = num_swa_attention_heads
self.vocab_size = vocab_size
self.max_position_embeddings = max_position_embeddings
self.hidden_size = hidden_size
self.intermediate_size = intermediate_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.sliding_window = sliding_window
# for backward compatibility
if num_key_value_heads is None:
num_key_value_heads = num_attention_heads
self.num_key_value_heads = num_key_value_heads
self.hidden_act = hidden_act
self.initializer_range = initializer_range
self.rms_norm_eps = rms_norm_eps
self.use_cache = use_cache
self.rope_theta = rope_theta
self.attention_dropout = attention_dropout
self.conv_window = conv_window
super().__init__(
tie_word_embeddings=tie_word_embeddings,
**kwargs,
)

14
generation_config.json Normal file
View File

@ -0,0 +1,14 @@
{
"assistant_token_id": 74,
"bos_token_id": 1,
"do_sample": true,
"eos_token_id": 2,
"max_new_tokens": 2048,
"pad_token_id": 0,
"repetition_penalty": 1.05,
"temperature": 0.3,
"top_k": 5,
"top_p": 0.85,
"transformers_version": "4.48.1",
"user_token_id": 73
}

BIN
model-00001-of-00006.safetensors (Stored with Git LFS) Normal file

Binary file not shown.

BIN
model-00002-of-00006.safetensors (Stored with Git LFS) Normal file

Binary file not shown.

BIN
model-00004-of-00006.safetensors (Stored with Git LFS) Normal file

Binary file not shown.

BIN
model-00005-of-00006.safetensors (Stored with Git LFS) Normal file

Binary file not shown.

BIN
model-00006-of-00006.safetensors (Stored with Git LFS) Normal file

Binary file not shown.

View File

@ -0,0 +1,370 @@
{
"metadata": {
"total_size": 28941528640
},
"weight_map": {
"lm_head.weight": "model-00006-of-00006.safetensors",
"model.embed_tokens.weight": "model-00001-of-00006.safetensors",
"model.layers.0.input_layernorm.weight": "model-00001-of-00006.safetensors",
"model.layers.0.mlp.down_proj.weight": "model-00001-of-00006.safetensors",
"model.layers.0.mlp.gate_proj.weight": "model-00001-of-00006.safetensors",
"model.layers.0.mlp.up_proj.weight": "model-00001-of-00006.safetensors",
"model.layers.0.post_attention_layernorm.weight": "model-00001-of-00006.safetensors",
"model.layers.0.self_attn.W_pack.weight": "model-00001-of-00006.safetensors",
"model.layers.0.self_attn.conv_k": "model-00001-of-00006.safetensors",
"model.layers.0.self_attn.conv_v": "model-00001-of-00006.safetensors",
"model.layers.0.self_attn.o_proj.weight": "model-00001-of-00006.safetensors",
"model.layers.1.input_layernorm.weight": "model-00001-of-00006.safetensors",
"model.layers.1.mlp.down_proj.weight": "model-00001-of-00006.safetensors",
"model.layers.1.mlp.gate_proj.weight": "model-00001-of-00006.safetensors",
"model.layers.1.mlp.up_proj.weight": "model-00001-of-00006.safetensors",
"model.layers.1.post_attention_layernorm.weight": "model-00001-of-00006.safetensors",
"model.layers.1.self_attn.W_pack.weight": "model-00001-of-00006.safetensors",
"model.layers.1.self_attn.conv_k": "model-00001-of-00006.safetensors",
"model.layers.1.self_attn.conv_v": "model-00001-of-00006.safetensors",
"model.layers.1.self_attn.o_proj.weight": "model-00001-of-00006.safetensors",
"model.layers.10.input_layernorm.weight": "model-00002-of-00006.safetensors",
"model.layers.10.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
"model.layers.10.mlp.gate_proj.weight": "model-00002-of-00006.safetensors",
"model.layers.10.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
"model.layers.10.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
"model.layers.10.self_attn.W_pack.weight": "model-00002-of-00006.safetensors",
"model.layers.10.self_attn.conv_k": "model-00002-of-00006.safetensors",
"model.layers.10.self_attn.conv_v": "model-00002-of-00006.safetensors",
"model.layers.10.self_attn.o_proj.weight": "model-00002-of-00006.safetensors",
"model.layers.11.input_layernorm.weight": "model-00002-of-00006.safetensors",
"model.layers.11.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
"model.layers.11.mlp.gate_proj.weight": "model-00002-of-00006.safetensors",
"model.layers.11.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
"model.layers.11.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
"model.layers.11.self_attn.W_pack.weight": "model-00002-of-00006.safetensors",
"model.layers.11.self_attn.conv_k": "model-00002-of-00006.safetensors",
"model.layers.11.self_attn.conv_v": "model-00002-of-00006.safetensors",
"model.layers.11.self_attn.o_proj.weight": "model-00002-of-00006.safetensors",
"model.layers.12.input_layernorm.weight": "model-00002-of-00006.safetensors",
"model.layers.12.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
"model.layers.12.mlp.gate_proj.weight": "model-00002-of-00006.safetensors",
"model.layers.12.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
"model.layers.12.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
"model.layers.12.self_attn.W_pack.weight": "model-00002-of-00006.safetensors",
"model.layers.12.self_attn.conv_k": "model-00002-of-00006.safetensors",
"model.layers.12.self_attn.conv_v": "model-00002-of-00006.safetensors",
"model.layers.12.self_attn.o_proj.weight": "model-00002-of-00006.safetensors",
"model.layers.13.input_layernorm.weight": "model-00003-of-00006.safetensors",
"model.layers.13.mlp.down_proj.weight": "model-00003-of-00006.safetensors",
"model.layers.13.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
"model.layers.13.mlp.up_proj.weight": "model-00003-of-00006.safetensors",
"model.layers.13.post_attention_layernorm.weight": "model-00003-of-00006.safetensors",
"model.layers.13.self_attn.W_pack.weight": "model-00003-of-00006.safetensors",
"model.layers.13.self_attn.conv_k": "model-00002-of-00006.safetensors",
"model.layers.13.self_attn.conv_v": "model-00002-of-00006.safetensors",
"model.layers.13.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
"model.layers.14.input_layernorm.weight": "model-00003-of-00006.safetensors",
"model.layers.14.mlp.down_proj.weight": "model-00003-of-00006.safetensors",
"model.layers.14.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
"model.layers.14.mlp.up_proj.weight": "model-00003-of-00006.safetensors",
"model.layers.14.post_attention_layernorm.weight": "model-00003-of-00006.safetensors",
"model.layers.14.self_attn.W_pack.weight": "model-00003-of-00006.safetensors",
"model.layers.14.self_attn.conv_k": "model-00003-of-00006.safetensors",
"model.layers.14.self_attn.conv_v": "model-00003-of-00006.safetensors",
"model.layers.14.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
"model.layers.15.input_layernorm.weight": "model-00003-of-00006.safetensors",
"model.layers.15.mlp.down_proj.weight": "model-00003-of-00006.safetensors",
"model.layers.15.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
"model.layers.15.mlp.up_proj.weight": "model-00003-of-00006.safetensors",
"model.layers.15.post_attention_layernorm.weight": "model-00003-of-00006.safetensors",
"model.layers.15.self_attn.W_pack.weight": "model-00003-of-00006.safetensors",
"model.layers.15.self_attn.conv_k": "model-00003-of-00006.safetensors",
"model.layers.15.self_attn.conv_v": "model-00003-of-00006.safetensors",
"model.layers.15.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
"model.layers.16.input_layernorm.weight": "model-00003-of-00006.safetensors",
"model.layers.16.mlp.down_proj.weight": "model-00003-of-00006.safetensors",
"model.layers.16.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
"model.layers.16.mlp.up_proj.weight": "model-00003-of-00006.safetensors",
"model.layers.16.post_attention_layernorm.weight": "model-00003-of-00006.safetensors",
"model.layers.16.self_attn.W_pack.weight": "model-00003-of-00006.safetensors",
"model.layers.16.self_attn.conv_k": "model-00003-of-00006.safetensors",
"model.layers.16.self_attn.conv_v": "model-00003-of-00006.safetensors",
"model.layers.16.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
"model.layers.17.input_layernorm.weight": "model-00003-of-00006.safetensors",
"model.layers.17.mlp.down_proj.weight": "model-00003-of-00006.safetensors",
"model.layers.17.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
"model.layers.17.mlp.up_proj.weight": "model-00003-of-00006.safetensors",
"model.layers.17.post_attention_layernorm.weight": "model-00003-of-00006.safetensors",
"model.layers.17.self_attn.W_pack.weight": "model-00003-of-00006.safetensors",
"model.layers.17.self_attn.conv_k": "model-00003-of-00006.safetensors",
"model.layers.17.self_attn.conv_v": "model-00003-of-00006.safetensors",
"model.layers.17.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
"model.layers.18.input_layernorm.weight": "model-00003-of-00006.safetensors",
"model.layers.18.mlp.down_proj.weight": "model-00003-of-00006.safetensors",
"model.layers.18.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
"model.layers.18.mlp.up_proj.weight": "model-00003-of-00006.safetensors",
"model.layers.18.post_attention_layernorm.weight": "model-00003-of-00006.safetensors",
"model.layers.18.self_attn.W_pack.weight": "model-00003-of-00006.safetensors",
"model.layers.18.self_attn.conv_k": "model-00003-of-00006.safetensors",
"model.layers.18.self_attn.conv_v": "model-00003-of-00006.safetensors",
"model.layers.18.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
"model.layers.19.input_layernorm.weight": "model-00003-of-00006.safetensors",
"model.layers.19.mlp.down_proj.weight": "model-00003-of-00006.safetensors",
"model.layers.19.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
"model.layers.19.mlp.up_proj.weight": "model-00003-of-00006.safetensors",
"model.layers.19.post_attention_layernorm.weight": "model-00003-of-00006.safetensors",
"model.layers.19.self_attn.W_pack.weight": "model-00003-of-00006.safetensors",
"model.layers.19.self_attn.conv_k": "model-00003-of-00006.safetensors",
"model.layers.19.self_attn.conv_v": "model-00003-of-00006.safetensors",
"model.layers.19.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
"model.layers.2.input_layernorm.weight": "model-00001-of-00006.safetensors",
"model.layers.2.mlp.down_proj.weight": "model-00001-of-00006.safetensors",
"model.layers.2.mlp.gate_proj.weight": "model-00001-of-00006.safetensors",
"model.layers.2.mlp.up_proj.weight": "model-00001-of-00006.safetensors",
"model.layers.2.post_attention_layernorm.weight": "model-00001-of-00006.safetensors",
"model.layers.2.self_attn.W_pack.weight": "model-00001-of-00006.safetensors",
"model.layers.2.self_attn.conv_k": "model-00001-of-00006.safetensors",
"model.layers.2.self_attn.conv_v": "model-00001-of-00006.safetensors",
"model.layers.2.self_attn.o_proj.weight": "model-00001-of-00006.safetensors",
"model.layers.20.input_layernorm.weight": "model-00004-of-00006.safetensors",
"model.layers.20.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
"model.layers.20.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
"model.layers.20.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
"model.layers.20.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
"model.layers.20.self_attn.W_pack.weight": "model-00003-of-00006.safetensors",
"model.layers.20.self_attn.conv_k": "model-00003-of-00006.safetensors",
"model.layers.20.self_attn.conv_v": "model-00003-of-00006.safetensors",
"model.layers.20.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
"model.layers.21.input_layernorm.weight": "model-00004-of-00006.safetensors",
"model.layers.21.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
"model.layers.21.mlp.gate_proj.weight": "model-00004-of-00006.safetensors",
"model.layers.21.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
"model.layers.21.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
"model.layers.21.self_attn.W_pack.weight": "model-00004-of-00006.safetensors",
"model.layers.21.self_attn.conv_k": "model-00004-of-00006.safetensors",
"model.layers.21.self_attn.conv_v": "model-00004-of-00006.safetensors",
"model.layers.21.self_attn.o_proj.weight": "model-00004-of-00006.safetensors",
"model.layers.22.input_layernorm.weight": "model-00004-of-00006.safetensors",
"model.layers.22.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
"model.layers.22.mlp.gate_proj.weight": "model-00004-of-00006.safetensors",
"model.layers.22.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
"model.layers.22.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
"model.layers.22.self_attn.W_pack.weight": "model-00004-of-00006.safetensors",
"model.layers.22.self_attn.conv_k": "model-00004-of-00006.safetensors",
"model.layers.22.self_attn.conv_v": "model-00004-of-00006.safetensors",
"model.layers.22.self_attn.o_proj.weight": "model-00004-of-00006.safetensors",
"model.layers.23.input_layernorm.weight": "model-00004-of-00006.safetensors",
"model.layers.23.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
"model.layers.23.mlp.gate_proj.weight": "model-00004-of-00006.safetensors",
"model.layers.23.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
"model.layers.23.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
"model.layers.23.self_attn.W_pack.weight": "model-00004-of-00006.safetensors",
"model.layers.23.self_attn.conv_k": "model-00004-of-00006.safetensors",
"model.layers.23.self_attn.conv_v": "model-00004-of-00006.safetensors",
"model.layers.23.self_attn.o_proj.weight": "model-00004-of-00006.safetensors",
"model.layers.24.input_layernorm.weight": "model-00004-of-00006.safetensors",
"model.layers.24.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
"model.layers.24.mlp.gate_proj.weight": "model-00004-of-00006.safetensors",
"model.layers.24.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
"model.layers.24.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
"model.layers.24.self_attn.W_pack.weight": "model-00004-of-00006.safetensors",
"model.layers.24.self_attn.conv_k": "model-00004-of-00006.safetensors",
"model.layers.24.self_attn.conv_v": "model-00004-of-00006.safetensors",
"model.layers.24.self_attn.o_proj.weight": "model-00004-of-00006.safetensors",
"model.layers.25.input_layernorm.weight": "model-00004-of-00006.safetensors",
"model.layers.25.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
"model.layers.25.mlp.gate_proj.weight": "model-00004-of-00006.safetensors",
"model.layers.25.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
"model.layers.25.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
"model.layers.25.self_attn.W_pack.weight": "model-00004-of-00006.safetensors",
"model.layers.25.self_attn.conv_k": "model-00004-of-00006.safetensors",
"model.layers.25.self_attn.conv_v": "model-00004-of-00006.safetensors",
"model.layers.25.self_attn.o_proj.weight": "model-00004-of-00006.safetensors",
"model.layers.26.input_layernorm.weight": "model-00004-of-00006.safetensors",
"model.layers.26.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
"model.layers.26.mlp.gate_proj.weight": "model-00004-of-00006.safetensors",
"model.layers.26.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
"model.layers.26.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
"model.layers.26.self_attn.W_pack.weight": "model-00004-of-00006.safetensors",
"model.layers.26.self_attn.conv_k": "model-00004-of-00006.safetensors",
"model.layers.26.self_attn.conv_v": "model-00004-of-00006.safetensors",
"model.layers.26.self_attn.o_proj.weight": "model-00004-of-00006.safetensors",
"model.layers.27.input_layernorm.weight": "model-00004-of-00006.safetensors",
"model.layers.27.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
"model.layers.27.mlp.gate_proj.weight": "model-00004-of-00006.safetensors",
"model.layers.27.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
"model.layers.27.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
"model.layers.27.self_attn.W_pack.weight": "model-00004-of-00006.safetensors",
"model.layers.27.self_attn.conv_k": "model-00004-of-00006.safetensors",
"model.layers.27.self_attn.conv_v": "model-00004-of-00006.safetensors",
"model.layers.27.self_attn.o_proj.weight": "model-00004-of-00006.safetensors",
"model.layers.28.input_layernorm.weight": "model-00005-of-00006.safetensors",
"model.layers.28.mlp.down_proj.weight": "model-00005-of-00006.safetensors",
"model.layers.28.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
"model.layers.28.mlp.up_proj.weight": "model-00005-of-00006.safetensors",
"model.layers.28.post_attention_layernorm.weight": "model-00005-of-00006.safetensors",
"model.layers.28.self_attn.W_pack.weight": "model-00005-of-00006.safetensors",
"model.layers.28.self_attn.conv_k": "model-00004-of-00006.safetensors",
"model.layers.28.self_attn.conv_v": "model-00004-of-00006.safetensors",
"model.layers.28.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
"model.layers.29.input_layernorm.weight": "model-00005-of-00006.safetensors",
"model.layers.29.mlp.down_proj.weight": "model-00005-of-00006.safetensors",
"model.layers.29.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
"model.layers.29.mlp.up_proj.weight": "model-00005-of-00006.safetensors",
"model.layers.29.post_attention_layernorm.weight": "model-00005-of-00006.safetensors",
"model.layers.29.self_attn.W_pack.weight": "model-00005-of-00006.safetensors",
"model.layers.29.self_attn.conv_k": "model-00005-of-00006.safetensors",
"model.layers.29.self_attn.conv_v": "model-00005-of-00006.safetensors",
"model.layers.29.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
"model.layers.3.input_layernorm.weight": "model-00001-of-00006.safetensors",
"model.layers.3.mlp.down_proj.weight": "model-00001-of-00006.safetensors",
"model.layers.3.mlp.gate_proj.weight": "model-00001-of-00006.safetensors",
"model.layers.3.mlp.up_proj.weight": "model-00001-of-00006.safetensors",
"model.layers.3.post_attention_layernorm.weight": "model-00001-of-00006.safetensors",
"model.layers.3.self_attn.W_pack.weight": "model-00001-of-00006.safetensors",
"model.layers.3.self_attn.conv_k": "model-00001-of-00006.safetensors",
"model.layers.3.self_attn.conv_v": "model-00001-of-00006.safetensors",
"model.layers.3.self_attn.o_proj.weight": "model-00001-of-00006.safetensors",
"model.layers.30.input_layernorm.weight": "model-00005-of-00006.safetensors",
"model.layers.30.mlp.down_proj.weight": "model-00005-of-00006.safetensors",
"model.layers.30.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
"model.layers.30.mlp.up_proj.weight": "model-00005-of-00006.safetensors",
"model.layers.30.post_attention_layernorm.weight": "model-00005-of-00006.safetensors",
"model.layers.30.self_attn.W_pack.weight": "model-00005-of-00006.safetensors",
"model.layers.30.self_attn.conv_k": "model-00005-of-00006.safetensors",
"model.layers.30.self_attn.conv_v": "model-00005-of-00006.safetensors",
"model.layers.30.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
"model.layers.31.input_layernorm.weight": "model-00005-of-00006.safetensors",
"model.layers.31.mlp.down_proj.weight": "model-00005-of-00006.safetensors",
"model.layers.31.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
"model.layers.31.mlp.up_proj.weight": "model-00005-of-00006.safetensors",
"model.layers.31.post_attention_layernorm.weight": "model-00005-of-00006.safetensors",
"model.layers.31.self_attn.W_pack.weight": "model-00005-of-00006.safetensors",
"model.layers.31.self_attn.conv_k": "model-00005-of-00006.safetensors",
"model.layers.31.self_attn.conv_v": "model-00005-of-00006.safetensors",
"model.layers.31.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
"model.layers.32.input_layernorm.weight": "model-00005-of-00006.safetensors",
"model.layers.32.mlp.down_proj.weight": "model-00005-of-00006.safetensors",
"model.layers.32.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
"model.layers.32.mlp.up_proj.weight": "model-00005-of-00006.safetensors",
"model.layers.32.post_attention_layernorm.weight": "model-00005-of-00006.safetensors",
"model.layers.32.self_attn.W_pack.weight": "model-00005-of-00006.safetensors",
"model.layers.32.self_attn.conv_k": "model-00005-of-00006.safetensors",
"model.layers.32.self_attn.conv_v": "model-00005-of-00006.safetensors",
"model.layers.32.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
"model.layers.33.input_layernorm.weight": "model-00005-of-00006.safetensors",
"model.layers.33.mlp.down_proj.weight": "model-00005-of-00006.safetensors",
"model.layers.33.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
"model.layers.33.mlp.up_proj.weight": "model-00005-of-00006.safetensors",
"model.layers.33.post_attention_layernorm.weight": "model-00005-of-00006.safetensors",
"model.layers.33.self_attn.W_pack.weight": "model-00005-of-00006.safetensors",
"model.layers.33.self_attn.conv_k": "model-00005-of-00006.safetensors",
"model.layers.33.self_attn.conv_v": "model-00005-of-00006.safetensors",
"model.layers.33.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
"model.layers.34.input_layernorm.weight": "model-00005-of-00006.safetensors",
"model.layers.34.mlp.down_proj.weight": "model-00005-of-00006.safetensors",
"model.layers.34.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
"model.layers.34.mlp.up_proj.weight": "model-00005-of-00006.safetensors",
"model.layers.34.post_attention_layernorm.weight": "model-00005-of-00006.safetensors",
"model.layers.34.self_attn.W_pack.weight": "model-00005-of-00006.safetensors",
"model.layers.34.self_attn.conv_k": "model-00005-of-00006.safetensors",
"model.layers.34.self_attn.conv_v": "model-00005-of-00006.safetensors",
"model.layers.34.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
"model.layers.35.input_layernorm.weight": "model-00006-of-00006.safetensors",
"model.layers.35.mlp.down_proj.weight": "model-00006-of-00006.safetensors",
"model.layers.35.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
"model.layers.35.mlp.up_proj.weight": "model-00006-of-00006.safetensors",
"model.layers.35.post_attention_layernorm.weight": "model-00006-of-00006.safetensors",
"model.layers.35.self_attn.W_pack.weight": "model-00005-of-00006.safetensors",
"model.layers.35.self_attn.conv_k": "model-00005-of-00006.safetensors",
"model.layers.35.self_attn.conv_v": "model-00005-of-00006.safetensors",
"model.layers.35.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
"model.layers.36.input_layernorm.weight": "model-00006-of-00006.safetensors",
"model.layers.36.mlp.down_proj.weight": "model-00006-of-00006.safetensors",
"model.layers.36.mlp.gate_proj.weight": "model-00006-of-00006.safetensors",
"model.layers.36.mlp.up_proj.weight": "model-00006-of-00006.safetensors",
"model.layers.36.post_attention_layernorm.weight": "model-00006-of-00006.safetensors",
"model.layers.36.self_attn.W_pack.weight": "model-00006-of-00006.safetensors",
"model.layers.36.self_attn.conv_k": "model-00006-of-00006.safetensors",
"model.layers.36.self_attn.conv_v": "model-00006-of-00006.safetensors",
"model.layers.36.self_attn.o_proj.weight": "model-00006-of-00006.safetensors",
"model.layers.37.input_layernorm.weight": "model-00006-of-00006.safetensors",
"model.layers.37.mlp.down_proj.weight": "model-00006-of-00006.safetensors",
"model.layers.37.mlp.gate_proj.weight": "model-00006-of-00006.safetensors",
"model.layers.37.mlp.up_proj.weight": "model-00006-of-00006.safetensors",
"model.layers.37.post_attention_layernorm.weight": "model-00006-of-00006.safetensors",
"model.layers.37.self_attn.W_pack.weight": "model-00006-of-00006.safetensors",
"model.layers.37.self_attn.conv_k": "model-00006-of-00006.safetensors",
"model.layers.37.self_attn.conv_v": "model-00006-of-00006.safetensors",
"model.layers.37.self_attn.o_proj.weight": "model-00006-of-00006.safetensors",
"model.layers.38.input_layernorm.weight": "model-00006-of-00006.safetensors",
"model.layers.38.mlp.down_proj.weight": "model-00006-of-00006.safetensors",
"model.layers.38.mlp.gate_proj.weight": "model-00006-of-00006.safetensors",
"model.layers.38.mlp.up_proj.weight": "model-00006-of-00006.safetensors",
"model.layers.38.post_attention_layernorm.weight": "model-00006-of-00006.safetensors",
"model.layers.38.self_attn.W_pack.weight": "model-00006-of-00006.safetensors",
"model.layers.38.self_attn.conv_k": "model-00006-of-00006.safetensors",
"model.layers.38.self_attn.conv_v": "model-00006-of-00006.safetensors",
"model.layers.38.self_attn.o_proj.weight": "model-00006-of-00006.safetensors",
"model.layers.39.input_layernorm.weight": "model-00006-of-00006.safetensors",
"model.layers.39.mlp.down_proj.weight": "model-00006-of-00006.safetensors",
"model.layers.39.mlp.gate_proj.weight": "model-00006-of-00006.safetensors",
"model.layers.39.mlp.up_proj.weight": "model-00006-of-00006.safetensors",
"model.layers.39.post_attention_layernorm.weight": "model-00006-of-00006.safetensors",
"model.layers.39.self_attn.W_pack.weight": "model-00006-of-00006.safetensors",
"model.layers.39.self_attn.conv_k": "model-00006-of-00006.safetensors",
"model.layers.39.self_attn.conv_v": "model-00006-of-00006.safetensors",
"model.layers.39.self_attn.o_proj.weight": "model-00006-of-00006.safetensors",
"model.layers.4.input_layernorm.weight": "model-00001-of-00006.safetensors",
"model.layers.4.mlp.down_proj.weight": "model-00001-of-00006.safetensors",
"model.layers.4.mlp.gate_proj.weight": "model-00001-of-00006.safetensors",
"model.layers.4.mlp.up_proj.weight": "model-00001-of-00006.safetensors",
"model.layers.4.post_attention_layernorm.weight": "model-00001-of-00006.safetensors",
"model.layers.4.self_attn.W_pack.weight": "model-00001-of-00006.safetensors",
"model.layers.4.self_attn.conv_k": "model-00001-of-00006.safetensors",
"model.layers.4.self_attn.conv_v": "model-00001-of-00006.safetensors",
"model.layers.4.self_attn.o_proj.weight": "model-00001-of-00006.safetensors",
"model.layers.5.input_layernorm.weight": "model-00002-of-00006.safetensors",
"model.layers.5.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
"model.layers.5.mlp.gate_proj.weight": "model-00001-of-00006.safetensors",
"model.layers.5.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
"model.layers.5.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
"model.layers.5.self_attn.W_pack.weight": "model-00001-of-00006.safetensors",
"model.layers.5.self_attn.conv_k": "model-00001-of-00006.safetensors",
"model.layers.5.self_attn.conv_v": "model-00001-of-00006.safetensors",
"model.layers.5.self_attn.o_proj.weight": "model-00001-of-00006.safetensors",
"model.layers.6.input_layernorm.weight": "model-00002-of-00006.safetensors",
"model.layers.6.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
"model.layers.6.mlp.gate_proj.weight": "model-00002-of-00006.safetensors",
"model.layers.6.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
"model.layers.6.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
"model.layers.6.self_attn.W_pack.weight": "model-00002-of-00006.safetensors",
"model.layers.6.self_attn.conv_k": "model-00002-of-00006.safetensors",
"model.layers.6.self_attn.conv_v": "model-00002-of-00006.safetensors",
"model.layers.6.self_attn.o_proj.weight": "model-00002-of-00006.safetensors",
"model.layers.7.input_layernorm.weight": "model-00002-of-00006.safetensors",
"model.layers.7.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
"model.layers.7.mlp.gate_proj.weight": "model-00002-of-00006.safetensors",
"model.layers.7.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
"model.layers.7.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
"model.layers.7.self_attn.W_pack.weight": "model-00002-of-00006.safetensors",
"model.layers.7.self_attn.conv_k": "model-00002-of-00006.safetensors",
"model.layers.7.self_attn.conv_v": "model-00002-of-00006.safetensors",
"model.layers.7.self_attn.o_proj.weight": "model-00002-of-00006.safetensors",
"model.layers.8.input_layernorm.weight": "model-00002-of-00006.safetensors",
"model.layers.8.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
"model.layers.8.mlp.gate_proj.weight": "model-00002-of-00006.safetensors",
"model.layers.8.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
"model.layers.8.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
"model.layers.8.self_attn.W_pack.weight": "model-00002-of-00006.safetensors",
"model.layers.8.self_attn.conv_k": "model-00002-of-00006.safetensors",
"model.layers.8.self_attn.conv_v": "model-00002-of-00006.safetensors",
"model.layers.8.self_attn.o_proj.weight": "model-00002-of-00006.safetensors",
"model.layers.9.input_layernorm.weight": "model-00002-of-00006.safetensors",
"model.layers.9.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
"model.layers.9.mlp.gate_proj.weight": "model-00002-of-00006.safetensors",
"model.layers.9.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
"model.layers.9.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
"model.layers.9.self_attn.W_pack.weight": "model-00002-of-00006.safetensors",
"model.layers.9.self_attn.conv_k": "model-00002-of-00006.safetensors",
"model.layers.9.self_attn.conv_v": "model-00002-of-00006.safetensors",
"model.layers.9.self_attn.o_proj.weight": "model-00002-of-00006.safetensors",
"model.norm.weight": "model-00006-of-00006.safetensors"
}
}

1197
modeling_baichuan.py Normal file

File diff suppressed because it is too large Load Diff

46
special_tokens_map.json Normal file
View File

@ -0,0 +1,46 @@
{
"additional_special_tokens": [
"<|im_start|>",
"<|im_end|>",
"<|object_ref_start|>",
"<|object_ref_end|>",
"<|box_start|>",
"<|box_end|>",
"<|quad_start|>",
"<|quad_end|>",
"<|vision_start|>",
"<|vision_end|>",
"<|vision_pad|>",
"<|image_pad|>",
"<|video_pad|>",
"<B_SYS>",
"<B_USYS>",
"<C_Q>",
"<C_A>",
"<|im_sep|>",
"<|tool_call|>",
"<|arguments|>"
],
"bos_token": {
"content": "<s>",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false
},
"eos_token": {
"content": "</s>",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false
},
"pad_token": "<pad>",
"unk_token": {
"content": "<unk>",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false
}
}

231
tokenization_baichuan.py Normal file
View File

@ -0,0 +1,231 @@
import os
from shutil import copyfile
from typing import Any, Dict, List, Optional, Tuple
import sentencepiece as spm
from transformers.tokenization_utils import AddedToken, PreTrainedTokenizer
from transformers.utils import logging
logger = logging.get_logger(__name__)
VOCAB_FILES_NAMES = {"vocab_file": "tokenizer.model"}
PRETRAINED_VOCAB_FILES_MAP = {
"vocab_file": {},
"tokenizer_file": {},
}
PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {}
class BaichuanTokenizer(PreTrainedTokenizer):
"""
Construct a Baichuan tokenizer. Based on byte-level Byte-Pair-Encoding.
Args:
vocab_file (`str`):
Path to the vocabulary file.
"""
vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
model_input_names = ["input_ids", "attention_mask"]
def __init__(
self,
vocab_file,
unk_token="<unk>",
bos_token="<s>",
eos_token="</s>",
pad_token=None,
sp_model_kwargs: Optional[Dict[str, Any]] = None,
add_bos_token=True,
add_eos_token=False,
clean_up_tokenization_spaces=False,
**kwargs,
):
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
self.vocab_file = vocab_file
self.add_bos_token = add_bos_token
self.add_eos_token = add_eos_token
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(vocab_file)
super().__init__(
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
pad_token=pad_token,
add_bos_token=add_bos_token,
add_eos_token=add_eos_token,
sp_model_kwargs=self.sp_model_kwargs,
clean_up_tokenization_spaces=clean_up_tokenization_spaces,
**kwargs,
)
self.pad_token_id = self._convert_token_to_id(self.pad_token)
def __getstate__(self):
state = self.__dict__.copy()
state["sp_model"] = None
return state
def __setstate__(self, d):
self.__dict__ = d
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(self.vocab_file)
@property
def vocab_size(self):
"""Returns vocab size"""
return self.sp_model.get_piece_size()
def get_vocab(self):
"""Returns vocab as a dict"""
vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
vocab.update(self.added_tokens_encoder)
return vocab
def _tokenize(self, text):
"""Returns a tokenized string."""
return self.sp_model.encode(text, out_type=str)
def _convert_token_to_id(self, token):
"""Converts a token (str) in an id using the vocab."""
return self.sp_model.piece_to_id(token)
def _convert_id_to_token(self, index):
"""Converts an index (integer) in a token (str) using the vocab."""
token = self.sp_model.IdToPiece(index)
return token
def convert_tokens_to_string(self, tokens):
"""Converts a sequence of tokens (string) in a single string."""
current_sub_tokens = []
out_string = ""
prev_is_special = False
for i, token in enumerate(tokens):
# make sure that special tokens are not decoded using sentencepiece model
if token in self.all_special_tokens:
if not prev_is_special and i != 0:
out_string += " "
out_string += self.sp_model.decode(current_sub_tokens) + token
prev_is_special = True
current_sub_tokens = []
else:
current_sub_tokens.append(token)
prev_is_special = False
out_string += self.sp_model.decode(current_sub_tokens)
return out_string
def save_vocabulary(self, save_directory, filename_prefix: Optional[str] = None) -> Tuple[str]:
"""
Save the vocabulary and special tokens file to a directory.
Args:
save_directory (`str`):
The directory in which to save the vocabulary.
Returns:
`Tuple(str)`: Paths to the files saved.
"""
if not os.path.isdir(save_directory):
logger.error(f"Vocabulary path ({save_directory}) should be a directory")
return
out_vocab_file = os.path.join(
save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
)
if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file) and os.path.isfile(self.vocab_file):
copyfile(self.vocab_file, out_vocab_file)
elif not os.path.isfile(self.vocab_file):
with open(out_vocab_file, "wb") as fi:
content_spiece_model = self.sp_model.serialized_model_proto()
fi.write(content_spiece_model)
return (out_vocab_file,)
def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
bos_token_id = [self.bos_token_id] if self.add_bos_token else []
eos_token_id = [self.eos_token_id] if self.add_eos_token else []
output = bos_token_id + token_ids_0 + eos_token_id
if token_ids_1 is not None:
output = output + bos_token_id + token_ids_1 + eos_token_id
return output
def get_special_tokens_mask(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
) -> List[int]:
"""
Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
special tokens using the tokenizer `prepare_for_model` method.
Args:
token_ids_0 (`List[int]`):
List of IDs.
token_ids_1 (`List[int]`, *optional*):
Optional second list of IDs for sequence pairs.
already_has_special_tokens (`bool`, *optional*, defaults to `False`):
Whether or not the token list is already formatted with special tokens for the model.
Returns:
`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
"""
if already_has_special_tokens:
return super().get_special_tokens_mask(
token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
)
bos_token_id = [1] if self.add_bos_token else []
eos_token_id = [1] if self.add_eos_token else []
if token_ids_1 is None:
return bos_token_id + ([0] * len(token_ids_0)) + eos_token_id
return (
bos_token_id
+ ([0] * len(token_ids_0))
+ eos_token_id
+ bos_token_id
+ ([0] * len(token_ids_1))
+ eos_token_id
)
def create_token_type_ids_from_sequences(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
) -> List[int]:
"""
Creates a mask from the two sequences passed to be used in a sequence-pair classification task. An ALBERT
sequence pair mask has the following format:
```
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
| first sequence | second sequence |
```
if token_ids_1 is None, only returns the first portion of the mask (0s).
Args:
token_ids_0 (`List[int]`):
List of ids.
token_ids_1 (`List[int]`, *optional*):
Optional second list of IDs for sequence pairs.
Returns:
`List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).
"""
bos_token_id = [self.bos_token_id] if self.add_bos_token else []
eos_token_id = [self.eos_token_id] if self.add_eos_token else []
output = [0] * len(bos_token_id + token_ids_0 + eos_token_id)
if token_ids_1 is not None:
output += [1] * len(bos_token_id + token_ids_1 + eos_token_id)
return output

BIN
tokenizer.model (Stored with Git LFS) Normal file

Binary file not shown.

389
tokenizer_config.json Normal file
View File

@ -0,0 +1,389 @@
{
"add_bos_token": false,
"add_eos_token": false,
"added_tokens_decoder": {
"0": {
"content": "<pad>",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": true
},
"1": {
"content": "<s>",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": true
},
"2": {
"content": "</s>",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": true
},
"3": {
"content": "<unk>",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": true
},
"50": {
"content": "<|im_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"51": {
"content": "<|im_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"52": {
"content": "<|object_ref_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"53": {
"content": "<|object_ref_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"54": {
"content": "<|box_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"55": {
"content": "<|box_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"56": {
"content": "<|quad_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"57": {
"content": "<|quad_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"58": {
"content": "<|vision_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"59": {
"content": "<|vision_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"60": {
"content": "<|vision_pad|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"61": {
"content": "<|image_pad|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"62": {
"content": "<|video_pad|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"63": {
"content": "<tool_call>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"64": {
"content": "</tool_call>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"65": {
"content": "<|fim_prefix|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"66": {
"content": "<|fim_middle|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"67": {
"content": "<|fim_suffix|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"68": {
"content": "<|fim_pad|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"69": {
"content": "<|repo_name|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"70": {
"content": "<|file_sep|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"71": {
"content": "<B_SYS>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"72": {
"content": "<B_USYS>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"73": {
"content": "<C_Q>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"74": {
"content": "<C_A>",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"75": {
"content": "<B_FUNC>",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"76": {
"content": "<B_CODE>",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"77": {
"content": "<B_APE>",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"78": {
"content": "<function_calling>",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"79": {
"content": "<calc_start>",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"80": {
"content": "<calc_end>",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"81": {
"content": "<inner_think>",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"82": {
"content": "<|im_sep|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"83": {
"content": "<|tool_call|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"84": {
"content": "<|arguments|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"85": {
"content": "<|o1_step|>",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"86": {
"content": "<|o1_answer|>",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"87": {
"content": "<tree_node>",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"88": {
"content": "</tree_node>",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
}
},
"additional_special_tokens": [
"<|im_start|>",
"<|im_end|>",
"<|object_ref_start|>",
"<|object_ref_end|>",
"<|box_start|>",
"<|box_end|>",
"<|quad_start|>",
"<|quad_end|>",
"<|vision_start|>",
"<|vision_end|>",
"<|vision_pad|>",
"<|image_pad|>",
"<|video_pad|>",
"<B_SYS>",
"<B_USYS>",
"<C_Q>",
"<C_A>",
"<|im_sep|>",
"<|tool_call|>",
"<|arguments|>"
],
"auto_map": {
"AutoTokenizer": [
"tokenization_baichuan.BaichuanTokenizer",
null
]
},
"bos_token": "<s>",
"chat_template": "{% for message in messages %}{% if message['role'] == 'system' %}{{'<B_SYS>' + message['content']}}{% elif message['role'] == 'user_system' %}{{'<B_USYS>' + message['content']}}{% elif message['role'] == 'user' %}{{'<C_Q>' + message['content']}}{% elif message['role'] == 'assistant' %}{{'<C_A>' + message['content']}}{% elif message['role'] == 'function' %}{{'<B_FUNC>' + message['content']}}{% elif message['role'] == 'code' %}{{'<B_CODE>' + message['content']}}{% else %}{{ raise_exception('Invalid message role: ' + message['role']) }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{'<C_A>'}}{% endif %}",
"clean_up_tokenization_spaces": false,
"eos_token": "</s>",
"extra_special_tokens": {},
"model_max_length": 32768,
"pad_token": "<pad>",
"sp_model_kwargs": {},
"tokenizer_class": "BaichuanTokenizer",
"unk_token": "<unk>",
"use_fast": false
}