first commit

This commit is contained in:
xxl 2025-03-12 16:58:03 +08:00
parent f4ae513617
commit dcfedc8dc5
13 changed files with 253658 additions and 2 deletions

316
LICENSE Normal file
View File

@ -0,0 +1,316 @@
Instella-3B [RESEARCH-ONLY RAIL-MS]
Licensed Artifact(s):
- Model
- Source Code
Section I: PREAMBLE
BY ACCESSING, DOWNLOADING, INSTALLING, OR USING THE ARTIFACT, YOU AGREE
TO BE BOUND BY THIS LICENSE. IF YOU DO NOT AGREE TO ALL OF THE TERMS AND
CONDITIONS OF THIS LICENSE, DO NOT ACCESS, DOWNLOAD, INSTALL, OR USE THE
ARTIFACT.
1. Definitions
(a) “Application” refers to a sequence of instructions or statements
written in machine code language, including object code (that is the
product of a compiler), binary code (data using a two-symbol system)
or an intermediate language (such as register transfer language).
(b) “Artifact” refers to a software application (in either binary or
source code format), Model, and/or Source Code, in accordance with
what is specified above as the “Licensed Artifact”.
(c) “Contribution” means any work, including any modifications or
additions to an Artifact, that is intentionally submitted to
Licensor for inclusion or incorporation in the Artifact directly or
indirectly by the rights owner. For the purposes of this definition,
“submitted” means any form of electronic, verbal, or written
communication sent to the Licensor or its representatives, including
but not limited to communication on electronic mailing lists, source
code control systems, and issue tracking systems that are managed
by, or on behalf of, the Licensor for the purpose of discussing,
sharing and improving the Artifact, but excluding communication that
is conspicuously marked or otherwise designated in writing by the
contributor as “Not a Contribution.”
(d) “Contributor” means Licensor or any other individual or legal entity
that creates or owns a Contribution that is added to or incorporated
into an Artifact or its Derivative.
(e) “Data” means a collection of information and/or content extracted
from the dataset used with a given Model, including to train,
pretrain, or otherwise evaluate the Model. The Data is not licensed
under this License.
(f) “Derivative” means a work derived from or based upon an Artifact,
and includes all modified versions of such Artifact.
(g) “Distribution” means any transmission, reproduction, publication or
other sharing of an Artifact or Derivative to a Third Party,
including providing a hosted service incorporating the Artifact,
which is made available by electronic or other remote means -
e.g. API-based or web access.
(h) “Harm” includes but is not limited to physical, mental,
psychological, financial and reputational damage, pain, or loss.
(i) “License” means the terms and conditions for use, reproduction, and
Distribution as defined in this document.
(j) “Licensor” means the rights owner (by virtue of creation or
documented transfer of ownership) or entity authorized by the rights
owner (e.g., exclusive licensee) that is granting the rights in this
License.
(k) “Model” means any machine-learning based assembly or assemblies
(including checkpoints), consisting of learnt weights, parameters
(including optimizer states), corresponding to the model
architecture as embodied in the Source Code.
(l) “Output” means the results of operating a Model as embodied in
informational content resulting therefrom.
(m) “Permitted Purpose” means for academic or research purposes only.
(n) “Source Code” means any collection of text written using
human-readable programming language, including the code and scripts
used to define, run, load, benchmark or evaluate a Model or any
component thereof, and/or used to prepare data for training or
evaluation, if any. Source Code includes any accompanying
documentation, tutorials, examples, etc, if any. For clarity, the
term “Source Code” as used in this License includes any and all
Derivatives of such Source Code.
(o) “Third Parties” means individuals or legal entities that are not
under common control with Licensor or You.
(p) “Use” includes accessing, using, copying, modifying, and/or
distributing an Artifact; in connection with a Model as Artifact,
Use also includes creating content, fine-tuning, updating, running,
training, evaluating and/or re-parametrizing such Model.
(q) “You” (or “Your”) means an individual or legal entity receiving and
exercising permissions granted by this License and/or making use of
the Artifact for permitted purposes and in any permitted field of
use, including usage of the Artifact in an end-use application -
e.g. chatbot, translator, image generator, etc.
Section II: INTELLECTUAL PROPERTY RIGHTS
Both copyright and patent grants may apply to the Artifact. The Artifact
is subject to additional terms and conditions as described in Section III
below.
2. Grant of Copyright License. Conditioned upon compliance with Section
III below and subject to the terms and conditions of this License, each
Contributor hereby grants to You, only in connection with the Permitted
Purpose, a worldwide, non-exclusive, royalty-free copyright license to
reproduce, use, publicly display, publicly perform, sublicense, and
distribute the Artifact and Derivatives thereof.
3. Grant of Patent License. Conditioned upon compliance with Section III
below and subject to the terms and conditions of this License, and only
where and as applicable, each Contributor hereby grants to You, only in
connection with the Permitted Purpose, a worldwide, non-exclusive,
royalty-free, irrevocable (except as stated in this paragraph) patent
license to make, have made, use, sell, offer to sell, import, and
otherwise transfer the Artifact where such license applies only to those
patent claims licensable by such Contributor that are necessarily
infringed by their Contribution(s) alone or by combination of their
Contribution(s) with the Artifact to which such Contribution(s) was
submitted. If You institute patent litigation against any entity
(including a cross-claim or counterclaim in a lawsuit) alleging that the
Artifact and/or a Contribution incorporated within the Artifact
constitutes direct or contributory patent infringement, then any patent
licenses granted to You under this License in connection with the
Artifact shall terminate as of the date such litigation is asserted or
filed.
Licensor and Contributor each have the right to grant the licenses
above.
Section III: CONDITIONS OF USAGE, DISTRIBUTION AND REDISTRIBUTION
4. Use-based Restrictions. The restrictions contained in the AMD
Responsible AI Use Policy set forth in Attachment A are mandatory Use-
based restrictions. Therefore You may not Use the Artifact in violation
of such restrictions. You may Use the Artifact only subject to this
License; if Section II is held unenforceable or inapplicable, this
Section III will continue to govern any use of the Artifact. You shall
require all of Your users who Use the Artifact or its Derivative
to comply with the terms and conditions of this License, including
those contained in this paragraph, and only for the Permitted Purpose.
5. The Output You Generate with a Model (as Artifact). Except as set
forth herein, Licensor claims no rights in the Output You generate. You
are accountable for the Output You generate and its subsequent uses. No
use of the Output may contravene any provision as stated in this
License.
6. Distribution and Redistribution. You may host for Third Party remote
access purposes (e.g. software-as-a-service), reproduce and distribute
copies of the Artifact or its Derivatives in any medium, with or without
modifications, provided that You meet the following conditions:
6.1. Use-based restrictions in paragraph 4 MUST be included as a
condition precedent to effect any type of legal agreement (e.g. a
license) governing the use and/or distribution of the Artifact or
its Derivatives, and You shall give such notice to any subsequent
Third Party recipients;
6.2. You shall give any Third Party recipients of the Artifact or its
Derivatives a copy of this License;
6.3. You shall cause any modified files to carry prominent notices
stating that You changed the files;
6.4. You shall retain all copyright, patent, trademark, and attribution
notices excluding those notices that do not pertain to any part of
the Artifact or its Derivatives.
6.5. You and any Third Party recipients of the Artifact or its
Derivative shall adhere to the Permitted Purpose.
You may add Your own copyright statement to Your modifications and may
provide additional or different license terms and conditions with
respect to paragraph 6.1., to govern the use, reproduction, or
Distribution of Your modifications, or for any Derivative, provided that
Your use, reproduction, and Distribution of the Artifact or its
Derivative otherwise complies with the conditions stated in this
License. In other words, the Use-based restrictions in Attachment A form
the minimum set of terms for You to license to Third Parties any
Artifact or its Derivative, but You may add more restrictive terms if
You deem it necessary.
Section IV: OTHER PROVISIONS
7. Updates and Runtime Restrictions. To the maximum extent permitted by
law, Licensor reserves the right to restrict (remotely or otherwise)
usage of the Artifact in violation of this License or update the
Artifact through electronic means.
8. Trademarks and Related. Nothing in this License permits You to make
use of Licensors trademarks, trade names, logos or to otherwise suggest
endorsement or misrepresent the relationship between the parties; and
any rights not expressly granted herein are reserved by the Licensors.
9. Disclaimer of Warranty. Unless required by applicable law or agreed
to in writing, Licensor provides the Artifact (and each Contributor
provides its Contributions) on an “AS IS” BASIS, WITHOUT WARRANTIES OR
CONDITIONS OF ANY KIND, either express or implied, including, without
limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT,
MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely
responsible for determining the appropriateness of using the Artifact,
and assume any risks associated with Your exercise of permissions under
this License.
10. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise, unless
required by applicable law (such as deliberate and grossly negligent
acts) or agreed to in writing, shall any Contributor be liable to You
for damages, including any direct, indirect, special, incidental, or
consequential damages of any character arising as a result of this
License or out of the use or inability to use the Artifact (including
but not limited to damages for loss of goodwill, work stoppage, computer
failure or malfunction, or any and all other commercial damages or
losses), even if such Contributor has been advised of the possibility of
such damages.
11. If any provision of this License is held to be invalid, illegal or
unenforceable, the remaining provisions shall be unaffected thereby and
remain valid as if such provision had not been set forth herein.
12. Term and Termination. The term of this License will commence upon
the earlier of Your (a) acceptance of this License or (b) accessing the
Artifact; and will continue in full force and effect until terminated in
accordance with the terms and conditions herein. Licensor may terminate
this License if You are in breach of any term or condition of this
License. Upon termination of this License, all licenses granted to You
will terminate and You must promptly delete and cease use of the
Artifact. Sections 1, 7, 8, 9, 10, 11, and 12 survive termination of
this License.
END OF TERMS AND CONDITIONS
Attachment A
AMD Responsible AI Use Policy
AMD is committed to the responsible use of its Artificial Intelligence
(AI) products and technologies (“AMD AI”). AMD AI may include
artificial intelligence or machine learning technologies that use
algorithms to analyze data and generate output using predictions based
on patterns in data. This policy explains the uses that AMD
specifically prohibits.
If you use any AMD AI, you are agreeing to use the AMD AI in compliance
with applicable laws and not for any of the following prohibited uses.
Prohibited Uses:
1) No Illegal Acts. Do not use AMD AI in violation of any applicable
national, state, local, or other jurisdictional law, rule, regulation,
or sanction.
2) No Explicit Content. Do not use AMD AI to submit (as input),
generate, or disseminate content depicting violent or sexually explicit
content or to create sexual chatbots.
3) No Harm. Do not use AMD AI for any potentially harmful uses,
including fraud, deception, discrimination, abuse, or harassment,
including the following:
a) Harm or abuse of a minor, including grooming and child sexual
exploitation.
b) Impersonation of human beings for purposes of deception.
c) Generation or dissemination of information you know to be false
for the purpose of harming others.
d) Intentionally defame, disparage, or otherwise harass others.
e) Intentionally attempting to materially distort the behavior of a
person in a manner that causes or is likely to cause that person
or another person physical or psychological harm.
f) Providing medical advice or interpretation of medical results that
is intended to be a substitute for professional medical advice,
diagnosis, or treatment.
g) Engaging in the unlawful or unauthorized practice of any
profession, including financial, legal, medical, health, or
related professional practices.
h) Judgment of, discrimination against, or harm to individuals or
groups based on legally protected characteristics or categories,
online or offline social behavior, or known or predicted personal
or personality characteristics, including any of the foregoing
uses in social credit systems.
4) No High-Risk Activity. Do not use AMD AI in any high-risk activities
or applications that create a risk of personal injury, death, or
severe property or environmental damage, including in weapons or
military applications.
5) No Personal Information. Do not use AMD AI to collect, process, or
disclose personal data, including heath or sensitive personal
information, without the necessary rights or consents.
6) No Infringement. Do not use AMD AI to generate or disseminate any
information that infringes upon or misappropriates the intellectual
property rights of others, including copyright, trademark, patent, and
trade secret rights, rights to privacy, and publicity rights.
7) No Malware. Do not use AMD AI to generate or disseminate malware or
any other content to be used for the purpose of facilitating unpermitted
access to, or use of, computer systems or data.
8) No Obfuscation. Do not inappropriately obfuscate or fail to disclose
to end users the presence of AI in any application in which AMD AI is
deployed, along with any known risks or dangers of using AI without
appropriate safeguards, oversight and human control.
9) No Reliance. Do not rely on any information generated using AMD AI
without assessing it for accuracy, potential for harm, or other specific
risks applicable to the use case.

209
NOTICES Normal file
View File

@ -0,0 +1,209 @@
NOTICES Instella-3B
Dependencies on allenai_OLMo(Apache-2.0) Copyright Allen Institute for AI
Copyright Statements
# Modifications copyright(c) 2025 Advanced Micro Devices,Inc. All rights reserved.
License Text https://spdx.org/licenses/Apache-2.0.html
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.
"Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
(a) You must give any other recipients of the Work or Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License.
You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Standard License Header
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Dependencies on Qwen2.5-72B-Instruct(Qwen LICENSE AGREEMENT,Apache-2.0)
Copyright Statements
Qwen is licensed under the Qwen LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved."
# Modifications copyright(c) 2025 Advanced Micro Devices,Inc. All rights reserved.
License Text
Qwen LICENSE AGREEMENT
Qwen LICENSE AGREEMENT Release Date: September 19, 2024
By clicking to agree or by using or distributing any portion or element of the Qwen Materials, you will be deemed to have recognized and accepted the content of this Agreement, which is effective immediately.
1. Definitions
a. This Qwen LICENSE AGREEMENT (this "Agreement") shall mean the terms and conditions for use, reproduction, distribution and modification of the Materials as defined by this Agreement.
b. "We" (or "Us") shall mean Alibaba Cloud.
c. "You" (or "Your") shall mean a natural person or legal entity exercising the rights granted by this Agreement and/or using the Materials for any purpose and in any field of use.
d. "Third Parties" shall mean individuals or legal entities that are not under common control with us or you.
e. "Qwen" shall mean the large language models, and software and algorithms, consisting of trained model weights, parameters (including optimizer states), machine-learning model code, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing distributed by us.
f. "Materials" shall mean, collectively, Alibaba Cloud's proprietary Qwen and Documentation (and any portion thereof) made available under this Agreement.
g. "Source" form shall mean the preferred form for making modifications, including but not limited to model source code, documentation source, and configuration files.
h. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.
2. Grant of Rights
You are granted a non-exclusive, worldwide, non-transferable and royalty-free limited license under Alibaba Cloud's intellectual property or other rights owned by us embodied in the Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the Materials.
3. Redistribution
You may distribute copies or make the Materials, or derivative works thereof, available as part of a product or service that contains any of them, with or without modifications, and in Source or Object form, provided that you meet the following conditions:
a. You shall give any other recipients of the Materials or derivative works a copy of this Agreement;
b. You shall cause any modified files to carry prominent notices stating that you changed the files;
c. You shall retain in all copies of the Materials that you distribute the following attribution notices within a "Notice" text file distributed as a part of such copies: "Qwen is licensed under the Qwen LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved."; and
d. You may add your own copyright statement to your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of your modifications, or for any such derivative works as a whole, provided your use, reproduction, and distribution of the work otherwise complies with the terms and conditions of this Agreement.
4. Restrictions
If you are commercially using the Materials, and your product or service has more than 100 million monthly active users, you shall request a license from us. You cannot exercise your rights under this Agreement without our express authorization.
5. Rules of use
a. The Materials may be subject to export controls or restrictions in China, the United States or other countries or regions. You shall comply with applicable laws and regulations in your use of the Materials.
b. If you use the Materials or any outputs or results therefrom to create, train, fine-tune, or improve an AI model that is distributed or made available, you shall prominently display “Built with Qwen” or “Improved using Qwen” in the related product documentation.
6. Intellectual Property
a. We retain ownership of all intellectual property rights in and to the Materials and derivatives made by or for us. Conditioned upon compliance with the terms and conditions of this Agreement, with respect to any derivative works and modifications of the Materials that are made by you, you are and will be the owner of such derivative works and modifications.
b. No trademark license is granted to use the trade names, trademarks, service marks, or product names of us, except as required to fulfill notice requirements under this Agreement or as required for reasonable and customary use in describing and redistributing the Materials.
c. If you commence a lawsuit or other proceedings (including a cross-claim or counterclaim in a lawsuit) against us or any entity alleging that the Materials or any output therefrom, or any part of the foregoing, infringe any intellectual property or other right owned or licensable by you, then all licenses granted to you under this Agreement shall terminate as of the date such lawsuit or other proceeding is commenced or brought.
7. Disclaimer of Warranty and Limitation of Liability
a. We are not obligated to support, update, provide training for, or develop any further version of the Qwen Materials or to grant any license thereto.
b. THE MATERIALS ARE PROVIDED "AS IS" WITHOUT ANY EXPRESS OR IMPLIED WARRANTY OF ANY KIND INCLUDING WARRANTIES OF MERCHANTABILITY, NONINFRINGEMENT, OR FITNESS FOR A PARTICULAR PURPOSE. WE MAKE NO WARRANTY AND ASSUME NO RESPONSIBILITY FOR THE SAFETY OR STABILITY OF THE MATERIALS AND ANY OUTPUT THEREFROM.
c. IN NO EVENT SHALL WE BE LIABLE TO YOU FOR ANY DAMAGES, INCLUDING, BUT NOT LIMITED TO ANY DIRECT, OR INDIRECT, SPECIAL OR CONSEQUENTIAL DAMAGES ARISING FROM YOUR USE OR INABILITY TO USE THE MATERIALS OR ANY OUTPUT OF IT, NO MATTER HOW ITS CAUSED.
d. You will defend, indemnify and hold harmless us from and against any claim by any third party arising out of or related to your use or distribution of the Materials.
8. Survival and Termination.
a. The term of this Agreement shall commence upon your acceptance of this Agreement or access to the Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein.
b. We may terminate this Agreement if you breach any of the terms or conditions of this Agreement. Upon termination of this Agreement, you must delete and cease use of the Materials. Sections 7 and 9 shall survive the termination of this Agreement.
9. Governing Law and Jurisdiction.
a. This Agreement and any dispute arising out of or relating to it will be governed by the laws of China, without regard to conflict of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this Agreement.
b. The People's Courts in Hangzhou City shall have exclusive jurisdiction over any dispute arising out of this Agreement.
---------------------------------------------------------------------------------------------------------------
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.
"Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
(a) You must give any other recipients of the Work or Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License.
You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Standard License Header
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

546
README.md
View File

@ -1,3 +1,545 @@
# Instella-3B-Instruct
---
license: other
license_link: LICENSE
pipeline_tag: text-generation
library_name: transformers
---
# Instella✨: Fully Open Language Models with Stellar Performance
Instella-3B-Instruct
AMD is excited to announce Instella, a family of fully open state-of-the-art 3-billion-parameter language models (LMs) trained from scratch on AMD Instinct™ MI300X GPUs. Instella models outperform existing fully open models of similar sizes and achieve competitive performance compared to state-of-the-art open-weight models such as Llama-3.2-3B, Gemma-2-2B, and Qwen-2.5-3B, including their instruction-tuned counterparts.
<div align="center">
<img src="scaling_perf_instruct.png" style="object-fit: contain;"/>
<em><b>Figure 1:</b> Pareto frontier of pre-training tokens vs average performance for pre-trained and instruction-tuned models.</em>
</div>
By training Instella from scratch on Instinct MI300X GPUs, we highlight our hardwares capability and scalability in handling demanding large-scale AI training workloads, offering a viable alternative in the AI hardware landscape. In line with the AMD commitment to open source, we are releasing all artifacts related to Instella models [here](#additional-resources), including the model weights, detailed training configurations, datasets, and code, enabling the AI community to collaborate, replicate, and innovate, thereby accelerating progress.
## Takeaways
- **Announcing Instella**, a series of 3 billion parameter language models developed by AMD, trained from scratch on 128 Instinct MI300X GPUs.
- **Instella models significantly outperform existing fully open LMs** (Figure 1) of comparable size, as well as bridge the gap between fully open and open weight models by achieving competitive performance compared state-of-the-art open weight models and their instruction-tuned counterparts.
- Fully open and accessible: **Fully open-source release of model weights, training hyperparameters, datasets, and code**, fostering innovation and collaboration within the AI community.
- Supported by the AMD ROCm software stack, Instella employs efficient training techniques such as **FlashAttention-2, Torch Compile, and Fully Sharded Data Parallelism (FSDP)** with hybrid sharding to **scale model training over a large cluster.**
## Instella Models
In this release, we introduce the following Instella models:
<div align="center">
| Model | Stage | Training Data (Tokens) | Description |
| :----: | :----: | :----: | :---- |
| [Instella-3B-Stage1](https://huggingface.co/amd/Instella-3B-Stage1) | Pre-training (Stage 1) | 4.065 Trillion | First stage pre-training to develop proficiency in natural language. |
| [Instella-3B](https://huggingface.co/amd/Instella-3B) | Pre-training (Stage 2) | 57.575 Billion | Second stage pre-training to further enhance problem solving capabilities. |
| [Instella-3B-SFT](https://huggingface.co/amd/Instella-3B-SFT) | SFT | 8.902 Billion (x3 epochs) | Supervised Fine-tuning (SFT) to enable instruction-following capabilities. |
| [Instella-3B-Instruct](https://huggingface.co/amd/Instella-3B-instruct) | DPO | 760 Million | Alignment to human preferences and strengthen chat capabilities with direct preference optimization (DPO). |
| | **Total:** | **4.15 Trillion** | |
<em><b>Table 1:</b> Instella models and training stages.</em>
</div>
The Instella models are text-only, autoregressive transformer-based LMs having 3 billion parameters. Architecture-wise, Instella is packed with 36 decoder layers, each having 32 attention heads. These models support a sequence length of up to 4,096 tokens and have a vocabulary size of ~50,000 tokens using the OLMo tokenizer. During both pre-training and fine-tuning, we utilized FlashAttention-2, Torch Compile, and bfloat16 mixed-precision training to reduce memory usage, leading to computational speedups and optimal resource utilization. To balance inter-node memory efficiency and intra-node communication overhead within our cluster, we employed fully sharded data parallelism (FSDP) with hybrid sharding, with model parameters, gradients, and optimizer states sharded within a node and replicated across the nodes.
Our training pipeline is based on the open-sourced OLMo codebase, adapted, and optimized for our hardware and model architecture. For pre-training we used a total of 128 Instinct MI300X GPUs distributed across 16 nodes with each node having 8x Instinct MI300X GPUs. We evaluated our models and baselines using standard tasks from [OLMES](https://github.com/allenai/olmes/tree/main), [FastChat MT-Bench](https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/README.md), and [Alpaca](https://github.com/tatsu-lab/alpaca_eval/tree/main). For more details about the architecture, training pipeline/hyperparameters and evaluation results, please refer to our [Blog](https://rocm.blogs.amd.com/artificial-intelligence/introducing-instella-3B/README.html), [Hugging Face model card](https://huggingface.co/amd/Instella-3B) and [Github repository](https://github.com/AMD-AIG-AIMA/Instella).
## Training Pipeline
The training of the Instella models comprised of four stages, where each stage incrementally enhanced the models capabilities from fundamental natural language understanding to instruction following and alignment towards human preferences.
### Model Summary
| Stage | Model | Training Tokens | Layers | Attention Heads | Model Hidden Size | MLP Hidden Size | Context Length | RoPE Theta |
| :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- |
| Pre-training | Instella-3B-stage1 | 4.065T | 36 | 32 | 2560 | 13824 | 4096 | 10,000 |
| Pre-training | Instella-3B | 57.575B | 36 | 32 | 2560 | 13824 | 4096 | 10,000 |
| SFT | Instella-3B-SFT | 8.902B (x3) | 36 | 32 | 2560 | 13824 | 4096 | 10,000 |
| SFT+DPO | Instella-3B-instruct | 760M | 36 | 32 | 2560 | 13824 | 4096 | 10,000 |
### Hyparparameter
|Stage | Optimizer | Peak LR | LR Scheduler | Alpha F | Warmup (steps) | Weight Decay | Decay Norm & Bias | Decay Embedding | Batch Size (Tokens) | Epochs |
|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|
| Pretraining Stage 1 | AdamW(0.9,0.95) | 4.0e-4 | cosine_with_warmup | 0.1 | 2000 | 0.1 | True | True | 4M | 1 |
| Pretraining Stage 2 | AdamW(0.9,0.95) | 4.0e-5 | cosine_with_warmup | 0.0 | 0 | 0.1 | True | True | 4M | 1 |
| SFT | AdamW(0.9,0.95) | 1.0e-5 | linear_with_warmup | 0.001 | 500 | 0.1 | True | True | 0.5M | 3 |
| DPO | AdamW(0.9,0.95) | 5.0e-7 | linear | -- | 10% | 0.1 | -- | -- | 0.25M | 1 |
## Getting Started
### Installation
First, install [PyTorch](https://pytorch.org) according to the instructions specific to your operating system. For AMD GPUs, you can also start with a [rocm/pytorch](https://hub.docker.com/r/rocm/pytorch/tags?name=pytorch) docker.
To install from source (recommended for training/fine-tuning) run:
```bash
git clone https://github.com/AMD-AIG-AIMA/Instella.git
cd Instella
# install Flash-Attention on MI300X
GPU_ARCH=gfx942 MAX_JOBS=$(nproc) pip install git+https://github.com/Dao-AILab/flash-attention.git -v
# install other dependencies
pip install -e .[all]
```
### Example Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "amd/Instella-3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", trust_remote_code=True)
prompt = [{"role": "user", "content": "What are the benefits of open-source AI research?"}]
inputs = tokenizer.apply_chat_template(
prompt,
add_generation_prompt=True,
return_tensors='pt'
)
tokens = model.generate(
inputs.to(model.device),
max_new_tokens=1024,
temperature=0.8,
do_sample=True
)
print(tokenizer.decode(tokens[0], skip_special_tokens=False))
```
### Chat in TRL
You can also use the TRL CLI to chat with the model from the terminal:
```bash
pip install trl
trl chat --model_name_or_path amd/Instella-3B-Instruct --trust_remote_code --max_new_tokens 1024
# <root>:
# which is bigger 9.8 or 9.11?
# <amd/Instella-3B-Instruct>:
# 9.8 is bigger than 9.11. The difference between the two numbers is 0.69 (9.8 - 9.11 = 0.69), which indicates that 9.8 is 0.69 units larger than 9.11.
```
## Results
### Pre-training
<div class="table-wrapper" align="center">
<table>
<thead>
<tr>
<th>Models</th>
<th>Size</th>
<th>Training Tokens</th>
<th>Avg</th>
<th>ARC Challenge</th>
<th>ARC Easy</th>
<th>BoolQ</th>
<th>Hellaswag</th>
<th>PiQA</th>
<th>SciQ</th>
<th>Winnograde</th>
<th>OpenBookQA</th>
<th>MMLU</th>
<th>BBH (3-shot)</th>
<th>GSM8k (8-shot)</th>
</tr>
</thead>
<tbody>
<tr>
<th colspan="15">Open Weight Models</th>
</tr>
<tr>
<td>Gemma-2-2B</td>
<td>2.61B</td>
<td>~2T</td>
<td>59.34</td>
<td>39.46</td>
<td>59.30</td>
<td>74.50</td>
<td>70.50</td>
<td>76.40</td>
<td><strong>96.60</strong></td>
<td>69.80</td>
<td>44.80</td>
<td>53.28</td>
<td>40.75</td>
<td>27.37</td>
</tr>
<tr>
<td>Llama-3.2-3B</td>
<td>3.21B</td>
<td>~9T</td>
<td>62.51</td>
<td>47.16</td>
<td>64.91</td>
<td>74.80</td>
<td>73.10</td>
<td>75.90</td>
<td>95.30</td>
<td>70.30</td>
<td>51.20</td>
<td>57.81</td>
<td><ins>47.00</ins></td>
<td>30.10</td>
</tr>
<tr>
<td>Qwen2.5-3B</td>
<td>3.09B</td>
<td>~18T</td>
<td><strong>68.30</strong></td>
<td>51.51</td>
<td>67.19</td>
<td><strong>79.10</strong></td>
<td>72.10</td>
<td>77.40</td>
<td>95.50</td>
<td>69.30</td>
<td><ins>51.40</ins></td>
<td><strong>67.22</strong></td>
<td><strong>56.69</strong></td>
<td><strong>63.84</strong></td>
</tr>
<tr>
<th colspan="15">Fully Open Models</th>
</tr>
<tr>
<td>Pythia-2.8b</td>
<td>2.91B</td>
<td>300B</td>
<td>49.83</td>
<td>40.47</td>
<td>60.70</td>
<td>64.80</td>
<td>60.10</td>
<td>72.50</td>
<td>89.70</td>
<td>60.80</td>
<td>42.60</td>
<td>26.09</td>
<td>27.69</td>
<td>2.73</td>
</tr>
<tr>
<td>GPTNeo-2.7B</td>
<td>2.72B</td>
<td>~420B</td>
<td>47.96</td>
<td>38.46</td>
<td>54.56</td>
<td>62.70</td>
<td>55.20</td>
<td>70.80</td>
<td>88.00</td>
<td>58.30</td>
<td>40.80</td>
<td>27.83</td>
<td>27.25</td>
<td>3.71</td>
</tr>
<tr>
<td>OpenELM-3B</td>
<td>3.04B</td>
<td>~1.5T</td>
<td>52.28</td>
<td>37.46</td>
<td>58.42</td>
<td>68.60</td>
<td>71.70</td>
<td>75.60</td>
<td>92.50</td>
<td>65.40</td>
<td>46.40</td>
<td>26.69</td>
<td>29.40</td>
<td>2.96</td>
</tr>
<tr>
<td>StableLM-3B-4E1T</td>
<td>2.8B</td>
<td>~4T</td>
<td>58.51</td>
<td>44.82</td>
<td>67.02</td>
<td>75.40</td>
<td><ins>74.20</ins></td>
<td><strong>78.40</strong></td>
<td>93.40</td>
<td>68.40</td>
<td>48.60</td>
<td>45.19</td>
<td>37.33</td>
<td>10.84</td>
</tr>
<tr>
<td><strong><a href="https://huggingface.co/amd/Instella-3B-Stage1">Instella-3B-Stage1</a></strong></td>
<td>3.11B</td>
<td>~4T</td>
<td>61.33</td>
<td><strong>53.85</strong></td>
<td><strong>73.16</strong></td>
<td><ins>78.70</ins></td>
<td><ins>74.20</ins></td>
<td>77.50</td>
<td>94.90</td>
<td><ins>71.20</ins></td>
<td><ins>51.40</ins></td>
<td>54.69</td>
<td>34.30</td>
<td>10.77</td>
</tr>
<tr>
<td><strong><a href="https://huggingface.co/amd/Instella-3B">Instella-3B</a></strong></td>
<td>3.11B</td>
<td>~4T+60B</td>
<td><ins>66.59</ins></td>
<td><ins>52.84</ins></td>
<td><ins>70.53</ins></td>
<td>76.50</td>
<td><strong>75.00</strong></td>
<td><ins>77.80</ins></td>
<td><ins>96.40</ins></td>
<td><strong>73.10</strong></td>
<td><strong>52.40</strong></td>
<td><ins>58.31</ins></td>
<td>39.74</td>
<td><ins>59.82</ins></td>
</tr>
</tbody>
</table>
<em><strong>Table 2:</strong> Pre-trained model performance on standard benchmarks. Here <strong>Bold</strong> represents the best performance, and <ins>Underscore</ins> represents the second best performance.</em>
</div>
- Both Instella-3B-Stage1 & Instella-3B models outperform all the other fully open models over all the benchmarks individually (except PIQA). **Our final pre-trained checkpoint Instella-3B outperforms the existing top performant fully open pre-trained models by a lead of ⬆8.08% on average**, with significant improvements in `ARC Challenge [+8.02%], ARC Easy [+3.51%], Winnograde [+4.7%], OpenBookQA [+3.88%], MMLU [+13.12%] and GSM8K [+48.98%]`.
- **Second stage pre-training elevated the overall average performance relative to stage-1 by ⬆5.26%**, substantially narrowing the performance gap between Instella-3B model vs the closed-source models, and **outperforming Llama-3.2-3B by ⬆4.08% on average** (`+5.69% [ARC Challenge], +5.61% [ARC Easy], and +29.72% [GSM8k]`), **Gemma-2-2B by ⬆7.25% on average** (`+13.38% [ARC Challenge], +11.23% [ARC Easy], +4.5% [Hellaswag], +7.6% [OpenBookQA], +5.03% [MMLU], and +32.45% [GSM8k]`), and is **competitive with Qwen-2.5-3B** on the majority of the benchmarks.
- The multi-stage pre-training with diverse and high-quality data mix significantly enhanced Instella-3Bs capabilities, establishing it as a competitive and open alternative in the landscape of comparable size language models.
### Instruction-tuning Results
<div class="table-wrapper" align="center">
<table>
<thead>
<tr>
<th>Models</th>
<th>Size</th>
<th>Training Tokens</th>
<th>Avg</th>
<th>MMLU</th>
<th>TruthfulQA</th>
<th>BBH</th>
<th>GPQA</th>
<th>GSM8K</th>
<th>Minerva MATH</th>
<th>IFEval</th>
<th>AlpacaEval 2</th>
<th>MT-Bench</th>
</tr>
</thead>
<tbody>
<tr>
<th colspan="13">Open Weight Models</th>
</tr>
<tr>
<td>Gemma-2-2B-Instruct</td>
<td>2.61B</td>
<td>~2T</td>
<td>39.04</td>
<td>58.35</td>
<td><ins>55.76</ins></td>
<td>42.96</td>
<td>25.22</td>
<td>53.45</td>
<td>22.48</td>
<td>55.64</td>
<td><strong>29.41</strong></td>
<td><strong>8.07</strong></td>
</tr>
<tr>
<td>Llama-3.2-3B-Instruct</td>
<td>3.21B</td>
<td>~9T</td>
<td><ins>47.53</ins></td>
<td><ins>61.50</ins></td>
<td>50.23</td>
<td><strong>61.50</strong></td>
<td><ins>29.69</ins></td>
<td><strong>77.03</strong></td>
<td><ins>46.00</ins></td>
<td><strong>75.42</strong></td>
<td>19.31</td>
<td>7.13</td>
</tr>
<tr>
<td>Qwen2.5-3B-Instruct</td>
<td>3.09B</td>
<td>~18T</td>
<td><strong>48.72</strong></td>
<td><strong>66.90</strong></td>
<td><strong>57.16</strong></td>
<td><ins>57.29</ins></td>
<td>28.13</td>
<td><ins>75.97</ins></td>
<td><strong>60.42</strong></td>
<td>62.48</td>
<td><ins>22.12</ins></td>
<td><ins>8.00</ins></td>
</tr>
<tr>
<th colspan="13">Fully Open Models</th>
</tr>
<tr>
<td>StableLM-zephyr-3B</td>
<td>2.8B</td>
<td>4T</td>
<td>30.50</td>
<td>45.10</td>
<td>47.90</td>
<td>39.32</td>
<td>25.67</td>
<td>58.38</td>
<td>10.38</td>
<td>34.20</td>
<td>7.51</td>
<td>6.04</td>
</tr>
<tr>
<td>OpenELM-3B-Instruct</td>
<td>3.04B</td>
<td>~1.5T</td>
<td>14.11</td>
<td>27.36</td>
<td>38.08</td>
<td>24.24</td>
<td>18.08</td>
<td>1.59</td>
<td>0.38</td>
<td>16.08</td>
<td>0.21</td>
<td>1.00</td>
</tr>
<tr>
<td><a href="https://huggingface.co/amd/Instella-3B-SFT">Instella-3B-SFT</a></td>
<td>3.11B</td>
<td>~4T</td>
<td>42.05</td>
<td>58.76</td>
<td>52.49</td>
<td>46.00</td>
<td>28.13</td>
<td>71.72</td>
<td>40.50</td>
<td>66.17</td>
<td>7.58</td>
<td>7.07</td>
</tr>
<tr>
<td><a href="https://huggingface.co/amd/Instella-3B-Instruct">Instella-3B-Instruct</a></td>
<td>3.11B</td>
<td>~4T</td>
<td>44.87</td>
<td>58.90</td>
<td>55.47</td>
<td>46.75</td>
<td><strong>30.13</strong></td>
<td>73.92</td>
<td>42.46</td>
<td><ins>71.35</ins></td>
<td>17.59</td>
<td>7.23</td>
</tr>
</tbody>
</table>
<em><strong>Table 2:</strong> Instruct model performance on standard benchmarks. Here <strong>Bold</strong> represents the best performance, and <ins>Underscore</ins> represents the second best performance.</em>
</div>
- **Instella-3B-Instruct model consistently outperforms other fully open models across all evaluated benchmarks with a significant average score lead of ⬆️ 14.37%** w.r.t the next top performing fully open instruction-tuned models. With substantial margins across all the chat benchmarks (`+13% [MMLU], 7.57% [TruthfulQA], 7.43% [BBH], +4.46% [GPQA], +37.15 [IFEval], 10.08% [Alpaca 2], and 1.2% [MT-Bench]`).
- **Instella-3B-Instruct narrows the performance gap with leading open-weight models.** Instella-3B-Instruct performs **on par with or slightly surpasses existing state-of-the-art open weight instruction-tuned models** such as Llama-3.2-3B-Instruct (`+5.24% [TruthfulQA], 0.45% [GPQA], and +0.1% [MT-Bench]`), and Qwen2.5-3B-Instruct (`+2.01% [GPQA] and +8.87% [IFEval]`), while significantly outperforming Gemma-2-2B-Instruct with an average score lead of ⬆5.83% (`+0.55% [MMLU], +3.79 [BBH], +4.91 [GPQA], +20.47 [GSM8k], +19.98 [Minerva MATH], and +15.17% [IFEval]`).
- **Overall, Instella-3B-Instruct excels in instruction following tasks and multi-turn QA tasks like TruthfulQA, GPQA, IFEval and MT-Bench**, while being highly competitive compared to existing state-of-the-art open weight models on other knowledge recall and math benchmarks, while being trained on significantly fewer training tokens.
## Training Data
| Stage | Model | Dataset | License |
| :---- | :---- | :---- | :---- |
| Pre-training Stage 1 | Instella-3B-stage1 | [https://huggingface.co/datasets/allenai/OLMoE-mix-0924](https://huggingface.co/datasets/allenai/OLMoE-mix-0924) | ODC-BY-1.0 |
| Pre-training Stage 2 | Instella-3B | [https://huggingface.co/datasets/allenai/tulu-3-sft-mixture](https://huggingface.co/datasets/allenai/tulu-3-sft-mixture) | ODC-BY-1.0 |
| Pre-training Stage 2 | Instella-3B | [https://huggingface.co/datasets/allenai/dolmino-mix-1124](https://huggingface.co/datasets/allenai/dolmino-mix-1124) | ODC-BY-1.0 |
| Pre-training Stage 2 | Instella-3B | [https://huggingface.co/datasets/teknium/OpenHermes-2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) | Refer source materials |
| Pre-training Stage 2 | Instella-3B | [https://huggingface.co/datasets/TIGER-Lab/WebinstructSub](https://huggingface.co/datasets/TIGER-Lab/WebinstructSub) | Apache-2.0 |
| Pre-training Stage 2 | Instella-3B | [https://huggingface.co/datasets/m-a-p/Code-Feedback](https://huggingface.co/datasets/m-a-p/Code-Feedback) | Apache-2.0 |
| Pre-training Stage 2 | Instella-3B | [https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) | MIT |
| Pre-training Stage 2 | Instella-3B | [https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus/viewer/python-edu](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus/viewer/python-edu) | ODC-BY-1.0 |
| Pre-training Stage 2 | Instella-3B | [https://github.com/google-deepmind/mathematics_dataset](https://github.com/google-deepmind/mathematics_dataset) | Apache-2.0 |
| Pre-training Stage 2 | Instella-3B | [https://huggingface.co/datasets/amd/Instella-GSM8K-synthetic](https://huggingface.co/datasets/amd/Instella-GSM8K-synthetic) | [LICENSE](https://huggingface.co/datasets/amd/Instella-GSM8K-synthetic/blob/main/LICENSE) |
| SFT | Instella-3B-SFT | [https://huggingface.co/datasets/nvidia/OpenMathinstruct-2](https://huggingface.co/datasets/nvidia/OpenMathinstruct-2) | CC-BY-4.0 |
| SFT | Instella-3B-SFT | [https://huggingface.co/datasets/cais/mmlu](https://huggingface.co/datasets/cais/mmlu) | MIT |
| SFT | Instella-3B-SFT | [https://huggingface.co/datasets/HuggingFaceTB/smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) | Apache-2.0 |
| SFT | Instella-3B-SFT | [https://huggingface.co/datasets/GAIR/o1-journey](https://huggingface.co/datasets/GAIR/o1-journey) | Refer source materials |
| SFT | Instella-3B-SFT | [https://huggingface.co/datasets/allenai/tulu-3-sft-personas-instruction-following (subset of Tulu3)](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-instruction-following) | ODC-BY-1.0 |
| DPO | Instella-3B-instruct | [https://huggingface.co/datasets/allenai/olmo-2-1124-7b-preference-mix](https://huggingface.co/datasets/allenai/olmo-2-1124-7b-preference-mix) | ODC-BY-1.0 |
> [!NOTE]
> Further information concerning the training datasets, including applicable licensing terms and use restrictions, may be located at the linked source location.
## Conclusion
The release of the Instella family of models represents a significant stride in advancing open-source AI and demonstrating the capabilities of AMD hardware in large-scale language model training. The 3 billion parameter models from Instella family significantly outperform present fully open comparable size models in key benchmarks while also being competitive to comparable open-weight models, which we attribute to the high-quality data-mix selection, multi-stage training pipeline, and the use of high-performance Instinct MI300X GPUs for large scale training.
By fully open sourcing the Instella models, including weights, training configurations, datasets, and code, we aim to foster innovation and collaboration within the AI community. We believe that transparency, reproducibility and accessibility are key drivers of progress in AI research and development. We invite developers, researchers, and AI enthusiasts to explore Instella, contribute to its ongoing improvement, and join us in pushing the boundaries of what is possible with language models.
We will continue enhancing the models across multiple dimensions, including context length, reasoning ability, and multimodal capabilities. Additionally, we will scale up both the model and dataset while exploring diverse architectural approaches. Keep your eyes peeled for more exciting blogs on the Instella LMs family, its features and capabilities!
## Additional Resources
### Hugging Face Model Cards
- Pre-trained models:
- Instella-3B-Stage1: [amd/Instella-3B-Stage1](https://huggingface.co/amd/Instella-3B-Stage1), First stage pre-training checkpoint.
- Instella-3B: [amd/Instella-3B](https://huggingface.co/amd/Instella-3B), Final pre-training checkpoint.
- Instruction-tuned models:
- Instella-3B-SFT: [amd/Instella-3B-SFT](https://huggingface.co/amd/Instella-3B-SFT), Supervised fine-tuned checkpoint.
- Instella-3B-Instruct: [amd/Instella-3B-Instruct](https://huggingface.co/amd/Instella-3B-Instruct), Final Instruction-tuned checkpoint.
### Datasets
Second stage pre-training GSM8k synthetic dataset: [amd/Instella-GSM8K-synthetic](https://huggingface.co/datasets/amd/Instella-GSM8K-synthetic)
- The dataset consists of two splits: `train` and `train_119K`.
- For Instella-3B model second stage pre-training we used the `train_119K` split, which is a subset of the larger `train` split.
### Code
- Github: [https://github.com/AMD-AIG-AIMA/Instella](https://github.com/AMD-AIG-AIMA/Instella)
Please refer to the following blogs to get started with using these techniques on AMD GPUs:
- [PyTorch Fully Sharded Data Parallel (FSDP) on AMD GPUs with ROCm™](https://rocm.blogs.amd.com/artificial-intelligence/fsdp-training-pytorch/README.html)
- [Accelerating Large Language Models with Flash Attention on AMD GPUs](https://rocm.blogs.amd.com/artificial-intelligence/flash-attention/README.html)
- [Accelerate PyTorch Models using torch.compile on AMD GPUs with ROCm™](https://rocm.blogs.amd.com/artificial-intelligence/torch_compile/README.html)
- [Introducing the First AMD 1B Language Models: AMD OLMo](https://www.amd.com/en/developer/resources/technical-articles/introducing-the-first-amd-1b-language-model.html)
## Bias, Risks, and Limitations
- The models are being released for research purposes only and are not intended for use cases that require high levels of factuality, safety-critical situations, health, or medical applications, generating false information, facilitating toxic conversations.
- Model checkpoints are made accessible without any safety promises. It is crucial for users to conduct comprehensive evaluations and implement safety filtering mechanisms as per their respective use cases.
- It may be possible to prompt the model to generate content that may be factually inaccurate, harmful, violent, toxic, biased, or otherwise objectionable. Such content may also get generated by prompts that did not intend to produce output as such. Users are thus requested to be aware of this and exercise caution and responsible thinking when using the model.
- Multi-lingual abilities of the models have not been tested and thus may misunderstand and generate erroneous responses across different languages.
## License
- The Instella-3B models are licensed for academic and research purposes under a ResearchRAIL license.
- The [amd/Instella-GSM8K-synthetic](https://huggingface.co/datasets/amd/Instella-GSM8K-synthetic) dataset used in second stage pre-training is built with Qwen2.5-72B-Instruct, and is licensed for academic and research purposes under a ResearchRAIL license. Refer to the [LICENSE](https://huggingface.co/datasets/amd/Instella-GSM8K-synthetic/blob/main/LICENSE) and [NOTICES](https://huggingface.co/datasets/amd/Instella-GSM8K-synthetic/blob/main/NOTICES) in the [amd/Instella-GSM8K-synthetic](https://huggingface.co/datasets/amd/Instella-GSM8K-synthetic) dataset card files for more information.
- Refer to the [LICENSE](https://huggingface.co/amd/Instella-3B/blob/main/LICENSE) and [NOTICES](https://huggingface.co/amd/Instella-3B/blob/main/NOTICES) files for more information.
## Citations
Feel free to cite our Instella-3B models:
```text
@misc{Instella,
title = {Instella: Fully Open Language Models with Stellar Performance},
url = {https://huggingface.co/amd/Instella-3B},
author = {Jiang Liu, Jialian Wu, Xiaodong Yu, Prakamya Mishra, Sudhanshu Ranjan, Zicheng Liu, Chaitanya Manem, Yusheng Su, Pratik Prabhanjan Brahma, Gowtham Ramesh, Ximeng Sun, Ze Wang, Emad Barsoum},
month = {March},
year = {2025}
}
```

31
config.json Normal file
View File

@ -0,0 +1,31 @@
{
"architectures": [
"InstellaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"auto_map": {
"AutoConfig": "modeling_instella.InstellaConfig",
"AutoModelForCausalLM": "modeling_instella.InstellaForCausalLM"
},
"bos_token_id": 0,
"eos_token_id": 0,
"hidden_act": "silu",
"hidden_size": 2560,
"initializer_range": 0.02,
"intermediate_size": 6912,
"max_position_embeddings": 4096,
"model_type": "instella",
"num_attention_heads": 32,
"num_hidden_layers": 36,
"num_key_value_heads": 32,
"pad_token_id": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_theta": 10000.0,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.48.0",
"use_cache": true,
"vocab_size": 50304
}

7
generation_config.json Normal file
View File

@ -0,0 +1,7 @@
{
"_from_model_config": true,
"bos_token_id": 0,
"eos_token_id": 0,
"pad_token_id": 1,
"transformers_version": "4.48.0"
}

BIN
model-00001-of-00002.safetensors (Stored with Git LFS) Normal file

Binary file not shown.

BIN
model-00002-of-00002.safetensors (Stored with Git LFS) Normal file

Binary file not shown.

View File

@ -0,0 +1,406 @@
{
"metadata": {
"total_size": 6225351680
},
"weight_map": {
"lm_head.weight": "model-00002-of-00002.safetensors",
"model.embed_tokens.weight": "model-00001-of-00002.safetensors",
"model.layers.0.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.0.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.0.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.0.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.0.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.0.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.0.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.0.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.0.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.0.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.0.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.1.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.1.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.1.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.1.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.1.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.1.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.1.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.1.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.1.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.1.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.1.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.10.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.10.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.10.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.10.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.10.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.10.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.10.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.10.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.10.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.10.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.10.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.11.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.11.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.11.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.11.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.11.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.11.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.11.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.11.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.11.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.11.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.11.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.12.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.12.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.12.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.12.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.12.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.12.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.12.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.12.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.12.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.12.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.12.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.13.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.13.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.13.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.13.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.13.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.13.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.13.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.13.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.13.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.13.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.13.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.14.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.14.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.14.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.14.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.14.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.14.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.14.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.14.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.14.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.14.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.14.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.15.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.15.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.15.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.15.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.15.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.15.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.15.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.15.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.15.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.15.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.15.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.16.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.16.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.16.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.16.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.16.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.16.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.16.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.16.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.16.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.16.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.16.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.17.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.17.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.17.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.17.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.17.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.17.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.17.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.17.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.17.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.17.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.17.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.18.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.18.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.18.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.18.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.18.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.18.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.18.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.18.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.18.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.18.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.18.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.19.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.19.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.19.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.19.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.19.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.19.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.19.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.19.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.19.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.19.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.19.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.2.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.2.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.2.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.2.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.2.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.2.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.2.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.2.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.2.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.2.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.2.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.20.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.20.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.20.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.20.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.20.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.20.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.20.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.20.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.20.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.20.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.20.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.21.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.21.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.21.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.21.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.21.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.21.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.21.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.21.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.21.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.21.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.21.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.22.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.22.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.22.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.22.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.22.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.22.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.22.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.22.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.22.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.22.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.22.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.23.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.23.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.23.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.23.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.23.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.23.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.23.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.23.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.23.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.23.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.23.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.24.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.24.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.24.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.24.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.24.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.24.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.24.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.24.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.24.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.24.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.24.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.25.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.25.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.25.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.25.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.25.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.25.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.25.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.25.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.25.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.25.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.25.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.26.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.26.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.26.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.26.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.26.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.26.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.26.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.26.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.26.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.26.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.26.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.27.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.27.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.27.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.27.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.27.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.27.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.27.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.27.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.27.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.27.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.27.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.28.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.28.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.28.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.28.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.28.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.28.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.28.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.28.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.28.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.28.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.28.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.29.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.29.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.29.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.29.pre_attention_layernorm.weight": "model-00002-of-00002.safetensors",
"model.layers.29.pre_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
"model.layers.29.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.29.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.29.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.29.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.29.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.29.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.3.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.3.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.3.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.3.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.3.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.3.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.3.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.3.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.3.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.3.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.3.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.30.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.30.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.30.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.30.pre_attention_layernorm.weight": "model-00002-of-00002.safetensors",
"model.layers.30.pre_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
"model.layers.30.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
"model.layers.30.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.30.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.30.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
"model.layers.30.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.30.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.31.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.31.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.31.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.31.pre_attention_layernorm.weight": "model-00002-of-00002.safetensors",
"model.layers.31.pre_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
"model.layers.31.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
"model.layers.31.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.31.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.31.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
"model.layers.31.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.31.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.32.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.32.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.32.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.32.pre_attention_layernorm.weight": "model-00002-of-00002.safetensors",
"model.layers.32.pre_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
"model.layers.32.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
"model.layers.32.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.32.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.32.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
"model.layers.32.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.32.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.33.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.33.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.33.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.33.pre_attention_layernorm.weight": "model-00002-of-00002.safetensors",
"model.layers.33.pre_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
"model.layers.33.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
"model.layers.33.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.33.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.33.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
"model.layers.33.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.33.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.34.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.34.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.34.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.34.pre_attention_layernorm.weight": "model-00002-of-00002.safetensors",
"model.layers.34.pre_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
"model.layers.34.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
"model.layers.34.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.34.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.34.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
"model.layers.34.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.34.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.35.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.35.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.35.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.35.pre_attention_layernorm.weight": "model-00002-of-00002.safetensors",
"model.layers.35.pre_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
"model.layers.35.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
"model.layers.35.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.35.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.35.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
"model.layers.35.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.35.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
"model.layers.4.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.4.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.4.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.4.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.4.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.4.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.4.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.4.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.4.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.4.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.4.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.5.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.5.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.5.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.5.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.5.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.5.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.5.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.5.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.5.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.5.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.5.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.6.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.6.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.6.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.6.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.6.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.6.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.6.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.6.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.6.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.6.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.6.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.7.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.7.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.7.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.7.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.7.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.7.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.7.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.7.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.7.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.7.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.7.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.8.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.8.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.8.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.8.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.8.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.8.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.8.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.8.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.8.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.8.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.8.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.9.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.9.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.9.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.9.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.9.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
"model.layers.9.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.9.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.9.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.9.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
"model.layers.9.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
"model.layers.9.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
"model.norm.weight": "model-00002-of-00002.safetensors"
}
}

1251
modeling_instella.py Normal file

File diff suppressed because it is too large Load Diff

BIN
scaling_perf_instruct.png (Stored with Git LFS) Normal file

Binary file not shown.

24
special_tokens_map.json Normal file
View File

@ -0,0 +1,24 @@
{
"bos_token": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"eos_token": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"pad_token": "<padding>",
"unk_token": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
}
}

250613
tokenizer.json Normal file

File diff suppressed because it is too large Load Diff

248
tokenizer_config.json Normal file
View File

@ -0,0 +1,248 @@
{
"add_bos_token": false,
"add_eos_token": false,
"add_prefix_space": false,
"added_tokens_decoder": {
"0": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"1": {
"content": "<|padding|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"50254": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50255": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50256": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50257": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50258": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50259": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50260": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50261": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50262": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50263": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50264": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50265": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50266": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50267": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50268": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50269": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50270": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50271": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50272": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50273": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50274": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50275": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50276": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50277": {
"content": "|||EMAIL_ADDRESS|||",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50278": {
"content": "|||PHONE_NUMBER|||",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50279": {
"content": "|||IP_ADDRESS|||",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50280": {
"content": "<padding>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
}
},
"bos_token": "<|endoftext|>",
"chat_template": "{{ bos_token }}{% for message in messages %}{% if message['role'] == 'system' %}{{ '<|system|>\n' + message['content'] + '\n' }}{% elif message['role'] == 'user' %}{{ '<|user|>\n' + message['content'] + '\n' }}{% elif message['role'] == 'assistant' %}{% if not loop.last %}{{ '<|assistant|>\n' + message['content'] + eos_token + '\n' }}{% else %}{{ '<|assistant|>\n' + message['content'] + eos_token }}{% endif %}{% endif %}{% if loop.last and add_generation_prompt %}{{ '<|assistant|>\n' }}{% endif %}{% endfor %}",
"clean_up_tokenization_spaces": false,
"eos_token": "<|endoftext|>",
"extra_special_tokens": {},
"model_max_length": 1000000000000000019884624838656,
"pad_token": "<padding>",
"tokenizer_class": "GPTNeoXTokenizer",
"unk_token": "<|endoftext|>"
}