first commit

2025-03-12 16:58:03 +08:00 · 2025-03-12 16:58:03 +08:00 · dcfedc8dc5
parent f4ae513617
commit dcfedc8dc5
13 changed files with 253658 additions and 2 deletions
--- a/316
+++ b/316
@ -0,0 +1,316 @@
+Instella-3B [RESEARCH-ONLY RAIL-MS]
+
+Licensed Artifact(s):
+
+-   Model
+
+-   Source Code
+
+Section I: PREAMBLE
+
+BY ACCESSING, DOWNLOADING, INSTALLING, OR USING THE ARTIFACT, YOU AGREE
+TO BE BOUND BY THIS LICENSE. IF YOU DO NOT AGREE TO ALL OF THE TERMS AND
+CONDITIONS OF THIS LICENSE, DO NOT ACCESS, DOWNLOAD, INSTALL, OR USE THE
+ARTIFACT.
+
+1. Definitions
+
+(a) “Application” refers to a sequence of instructions or statements
+    written in machine code language, including object code (that is the
+    product of a compiler), binary code (data using a two-symbol system)
+    or an intermediate language (such as register transfer language).
+
+(b) “Artifact” refers to a software application (in either binary or
+    source code format), Model, and/or Source Code, in accordance with
+    what is specified above as the “Licensed Artifact”.
+
+(c) “Contribution” means any work, including any modifications or
+    additions to an Artifact, that is intentionally submitted to
+    Licensor for inclusion or incorporation in the Artifact directly or
+    indirectly by the rights owner. For the purposes of this definition,
+    “submitted” means any form of electronic, verbal, or written
+    communication sent to the Licensor or its representatives, including
+    but not limited to communication on electronic mailing lists, source
+    code control systems, and issue tracking systems that are managed
+    by, or on behalf of, the Licensor for the purpose of discussing,
+    sharing and improving the Artifact, but excluding communication that
+    is conspicuously marked or otherwise designated in writing by the
+    contributor as “Not a Contribution.”
+
+(d) “Contributor” means Licensor or any other individual or legal entity
+    that creates or owns a Contribution that is added to or incorporated
+    into an Artifact or its Derivative.
+
+(e) “Data” means a collection of information and/or content extracted
+    from the dataset used with a given Model, including to train,
+    pretrain, or otherwise evaluate the Model. The Data is not licensed
+    under this License.
+
+(f) “Derivative” means a work derived from or based upon an Artifact,
+    and includes all modified versions of such Artifact.
+
+(g) “Distribution” means any transmission, reproduction, publication or
+    other sharing of an Artifact or Derivative to a Third Party,
+    including providing a hosted service incorporating the Artifact,
+    which is made available by electronic or other remote means -
+    e.g. API-based or web access.
+
+(h) “Harm” includes but is not limited to physical, mental,
+    psychological, financial and reputational damage, pain, or loss.
+
+(i) “License” means the terms and conditions for use, reproduction, and
+    Distribution as defined in this document.
+
+(j) “Licensor” means the rights owner (by virtue of creation or
+    documented transfer of ownership) or entity authorized by the rights
+    owner (e.g., exclusive licensee) that is granting the rights in this
+    License.
+
+(k) “Model” means any machine-learning based assembly or assemblies
+    (including checkpoints), consisting of learnt weights, parameters
+    (including optimizer states), corresponding to the model
+    architecture as embodied in the Source Code.
+
+(l) “Output” means the results of operating a Model as embodied in
+    informational content resulting therefrom.
+
+(m) “Permitted Purpose” means for academic or research purposes only.
+
+(n) “Source Code” means any collection of text written using
+    human-readable programming language, including the code and scripts
+    used to define, run, load, benchmark or evaluate a Model or any
+    component thereof, and/or used to prepare data for training or
+    evaluation, if any. Source Code includes any accompanying
+    documentation, tutorials, examples, etc, if any. For clarity, the
+    term “Source Code” as used in this License includes any and all
+    Derivatives of such Source Code.
+
+(o) “Third Parties” means individuals or legal entities that are not
+    under common control with Licensor or You.
+
+(p) “Use” includes accessing, using, copying, modifying, and/or
+    distributing an Artifact; in connection with a Model as Artifact,
+    Use also includes creating content, fine-tuning, updating, running,
+    training, evaluating and/or re-parametrizing such Model.
+
+(q) “You” (or “Your”) means an individual or legal entity receiving and
+    exercising permissions granted by this License and/or making use of
+    the Artifact for permitted purposes and in any permitted field of
+    use, including usage of the Artifact in an end-use application -
+    e.g. chatbot, translator, image generator, etc.
+
+Section II: INTELLECTUAL PROPERTY RIGHTS
+
+Both copyright and patent grants may apply to the Artifact. The Artifact
+is subject to additional terms and conditions as described in Section III
+below. 
+
+2. Grant of Copyright License. Conditioned upon compliance with Section
+III below and subject to the terms and conditions of this License, each
+Contributor hereby grants to You, only in connection with the Permitted
+Purpose, a worldwide, non-exclusive, royalty-free copyright license to
+reproduce, use, publicly display, publicly perform, sublicense, and
+distribute the Artifact and Derivatives thereof.
+
+3. Grant of Patent License. Conditioned upon compliance with Section III
+below and subject to the terms and conditions of this License, and only
+where and as applicable, each Contributor hereby grants to You, only in
+connection with the Permitted Purpose, a worldwide, non-exclusive,
+royalty-free, irrevocable (except as stated in this paragraph) patent
+license to make, have made, use, sell, offer to sell, import, and
+otherwise transfer the Artifact where such license applies only to those
+patent claims licensable by such Contributor that are necessarily
+infringed by their Contribution(s) alone or by combination of their
+Contribution(s) with the Artifact to which such Contribution(s) was
+submitted. If You institute patent litigation against any entity
+(including a cross-claim or counterclaim in a lawsuit) alleging that the
+Artifact and/or a Contribution incorporated within the Artifact
+constitutes direct or contributory patent infringement, then any patent
+licenses granted to You under this License in connection with the
+Artifact shall terminate as of the date such litigation is asserted or
+filed.
+
+Licensor and Contributor each have the right to grant the licenses
+above.
+
+Section III: CONDITIONS OF USAGE, DISTRIBUTION AND REDISTRIBUTION
+
+4. Use-based Restrictions. The restrictions contained in the AMD
+Responsible AI Use Policy set forth in Attachment A are mandatory Use-
+based restrictions. Therefore You may not Use the Artifact in violation
+of such restrictions. You may Use the Artifact only subject to this
+License; if Section II is held unenforceable or inapplicable, this
+Section III will continue to govern any use of the Artifact. You shall
+require all of Your users who Use the Artifact or its Derivative
+to comply with the terms and conditions of this License, including
+those contained in this paragraph, and only for the Permitted Purpose.
+
+5. The Output You Generate with a Model (as Artifact). Except as set
+forth herein, Licensor claims no rights in the Output You generate. You
+are accountable for the Output You generate and its subsequent uses. No
+use of the Output may contravene any provision as stated in this
+License.
+
+6. Distribution and Redistribution. You may host for Third Party remote
+access purposes (e.g. software-as-a-service), reproduce and distribute
+copies of the Artifact or its Derivatives in any medium, with or without
+modifications, provided that You meet the following conditions:
+
+6.1.  Use-based restrictions in paragraph 4 MUST be included as a
+      condition precedent to effect any type of legal agreement (e.g. a
+      license) governing the use and/or distribution of the Artifact or
+      its Derivatives, and You shall give such notice to any subsequent
+      Third Party recipients;
+6.2.  You shall give any Third Party recipients of the Artifact or its
+      Derivatives a copy of this License;
+6.3.  You shall cause any modified files to carry prominent notices
+      stating that You changed the files;
+6.4.  You shall retain all copyright, patent, trademark, and attribution
+      notices excluding those notices that do not pertain to any part of
+      the Artifact or its Derivatives.
+6.5.  You and any Third Party recipients of the Artifact or its
+      Derivative shall adhere to the Permitted Purpose.
+
+You may add Your own copyright statement to Your modifications and may
+provide additional or different license terms and conditions with
+respect to paragraph 6.1., to govern the use, reproduction, or
+Distribution of Your modifications, or for any Derivative, provided that
+Your use, reproduction, and Distribution of the Artifact or its
+Derivative otherwise complies with the conditions stated in this
+License. In other words, the Use-based restrictions in Attachment A form
+the minimum set of terms for You to license to Third Parties any
+Artifact or its Derivative, but You may add more restrictive terms if
+You deem it necessary.
+
+Section IV: OTHER PROVISIONS
+
+7. Updates and Runtime Restrictions. To the maximum extent permitted by
+law, Licensor reserves the right to restrict (remotely or otherwise)
+usage of the Artifact in violation of this License or update the
+Artifact through electronic means.
+
+8. Trademarks and Related. Nothing in this License permits You to make
+use of Licensors’ trademarks, trade names, logos or to otherwise suggest
+endorsement or misrepresent the relationship between the parties; and
+any rights not expressly granted herein are reserved by the Licensors.
+
+9. Disclaimer of Warranty. Unless required by applicable law or agreed
+to in writing, Licensor provides the Artifact (and each Contributor
+provides its Contributions) on an “AS IS” BASIS, WITHOUT WARRANTIES OR
+CONDITIONS OF ANY KIND, either express or implied, including, without
+limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT,
+MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely
+responsible for determining the appropriateness of using the Artifact,
+and assume any risks associated with Your exercise of permissions under
+this License.
+
+10. Limitation of Liability. In no event and under no legal theory,
+whether in tort (including negligence), contract, or otherwise, unless
+required by applicable law (such as deliberate and grossly negligent
+acts) or agreed to in writing, shall any Contributor be liable to You
+for damages, including any direct, indirect, special, incidental, or
+consequential damages of any character arising as a result of this
+License or out of the use or inability to use the Artifact (including
+but not limited to damages for loss of goodwill, work stoppage, computer
+failure or malfunction, or any and all other commercial damages or
+losses), even if such Contributor has been advised of the possibility of
+such damages.
+
+11. If any provision of this License is held to be invalid, illegal or
+unenforceable, the remaining provisions shall be unaffected thereby and
+remain valid as if such provision had not been set forth herein.
+
+12. Term and Termination. The term of this License will commence upon
+the earlier of Your (a) acceptance of this License or (b) accessing the
+Artifact; and will continue in full force and effect until terminated in
+accordance with the terms and conditions herein. Licensor may terminate
+this License if You are in breach of any term or condition of this
+License. Upon termination of this License, all licenses granted to You
+will terminate and You must promptly delete and cease use of the
+Artifact. Sections 1, 7, 8, 9, 10, 11, and 12 survive termination of
+this License.
+
+END OF TERMS AND CONDITIONS
+
+Attachment A
+
+AMD Responsible AI Use Policy
+
+AMD is committed to the responsible use of its Artificial Intelligence
+(AI) products and technologies (“AMD AI”).  AMD AI may include
+artificial intelligence or machine learning technologies that use
+algorithms to analyze data and generate output using predictions based
+on patterns in data.  This policy explains the uses that AMD
+specifically prohibits.
+
+If you use any AMD AI, you are agreeing to use the AMD AI in compliance
+with applicable laws and not for any of the following prohibited uses.
+
+Prohibited Uses:
+
+1) No Illegal Acts.  Do not use AMD AI in violation of any applicable
+national, state, local, or other jurisdictional law, rule, regulation,
+or sanction.
+
+2) No Explicit Content.  Do not use AMD AI to submit (as input),
+generate, or disseminate content depicting violent or sexually explicit
+content or to create sexual chatbots.
+
+3) No Harm.  Do not use AMD AI for any potentially harmful uses,
+   including fraud, deception, discrimination, abuse, or harassment,
+   including the following:
+
+   a) Harm or abuse of a minor, including grooming and child sexual
+      exploitation.
+
+   b) Impersonation of human beings for purposes of deception.
+
+   c) Generation or dissemination of information you know to be false
+      for the purpose of harming others.
+
+   d) Intentionally defame, disparage, or otherwise harass others.
+
+   e) Intentionally attempting to materially distort the behavior of a
+      person in a manner that causes or is likely to cause that person
+      or another person physical or psychological harm.
+
+   f) Providing medical advice or interpretation of medical results that
+      is intended to be a substitute for professional medical advice,
+      diagnosis, or treatment.
+
+   g) Engaging in the unlawful or unauthorized practice of any
+      profession, including financial, legal, medical, health, or
+      related professional practices.
+
+   h) Judgment of, discrimination against, or harm to individuals or
+      groups based on legally protected characteristics or categories,
+      online or offline social behavior, or known or predicted personal
+      or personality characteristics, including any of the foregoing
+      uses in social credit systems.
+
+4) No High-Risk Activity.  Do not use AMD AI in any high-risk activities
+ or applications that create a risk of personal injury, death, or
+severe property or environmental damage, including in weapons or
+military applications.
+
+5) No Personal Information.  Do not use AMD AI to collect, process, or
+disclose personal data, including heath or sensitive personal
+information, without the necessary rights or consents.
+
+6) No Infringement.  Do not use AMD AI to generate or disseminate any
+information that infringes upon or misappropriates the intellectual
+property rights of others, including copyright, trademark, patent, and
+trade secret rights, rights to privacy, and publicity rights.
+
+7) No Malware.  Do not use AMD AI to generate or disseminate malware or
+any other content to be used for the purpose of facilitating unpermitted
+access to, or use of, computer systems or data.
+
+8) No Obfuscation.  Do not inappropriately obfuscate or fail to disclose
+to end users the presence of AI in any application in which AMD AI is
+deployed, along with any known risks or dangers of using AI without
+appropriate safeguards, oversight and human control.
+
+9) No Reliance.  Do not rely on any information generated using AMD AI
+without assessing it for accuracy, potential for harm, or other specific
+risks applicable to the use case.
--- a/209
+++ b/209
@ -0,0 +1,209 @@
+NOTICES Instella-3B
+
+                Dependencies on allenai_OLMo(Apache-2.0) Copyright Allen Institute for AI
+
+
+        Copyright Statements
+
+        # Modifications copyright(c) 2025 Advanced Micro Devices,Inc. All rights reserved.
+    
+        License Text https://spdx.org/licenses/Apache-2.0.html
+    
+        Apache License
+Version 2.0, January 2004
+http://www.apache.org/licenses/
+
+TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+1. Definitions.
+"License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.
+
+"Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.
+
+"Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity.
+
+"You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License.
+
+"Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.
+
+"Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.
+
+"Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below).
+
+"Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof.
+
+"Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution."
+
+"Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work.
+
+2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.
+3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed.
+4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
+(a) You must give any other recipients of the Work or Derivative Works a copy of this License; and
+(b) You must cause any modified files to carry prominent notices stating that You changed the files; and
+(c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and
+(d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License.
+You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License.
+
+5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions.
+6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file.
+7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License.
+8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages.
+9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.
+END OF TERMS AND CONDITIONS
+
+APPENDIX: How to apply the Apache License to your work.
+
+To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives.
+Copyright [yyyy] [name of copyright owner]
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+Standard License Header
+Copyright [yyyy] [name of copyright owner]
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+    
+                Dependencies on Qwen2.5-72B-Instruct(Qwen LICENSE AGREEMENT,Apache-2.0)
+    
+    
+        Copyright Statements
+Qwen is licensed under the Qwen LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved."
+        # Modifications copyright(c) 2025 Advanced Micro Devices,Inc. All rights reserved.
+    
+        License Text 
+    
+Qwen LICENSE AGREEMENT
+ 
+Qwen LICENSE AGREEMENT Release Date: September 19, 2024
+ 
+By clicking to agree or by using or distributing any portion or element of the Qwen Materials, you will be deemed to have recognized and accepted the content of this Agreement, which is effective immediately.
+ 
+1. Definitions
+    a. This Qwen LICENSE AGREEMENT (this "Agreement") shall mean the terms and conditions for use, reproduction, distribution and modification of the Materials as defined by this Agreement.
+    b. "We" (or "Us") shall mean Alibaba Cloud.
+    c. "You" (or "Your") shall mean a natural person or legal entity exercising the rights granted by this Agreement and/or using the Materials for any purpose and in any field of use.
+    d. "Third Parties" shall mean individuals or legal entities that are not under common control with us or you.
+    e. "Qwen" shall mean the large language models, and software and algorithms, consisting of trained model weights, parameters (including optimizer states), machine-learning model code, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing distributed by us.
+    f. "Materials" shall mean, collectively, Alibaba Cloud's proprietary Qwen and Documentation (and any portion thereof) made available under this Agreement.
+    g. "Source" form shall mean the preferred form for making modifications, including but not limited to model source code, documentation source, and configuration files.
+    h. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.
+ 
+2. Grant of Rights
+You are granted a non-exclusive, worldwide, non-transferable and royalty-free limited license under Alibaba Cloud's intellectual property or other rights owned by us embodied in the Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the Materials.
+ 
+3. Redistribution
+You may distribute copies or make the Materials, or derivative works thereof, available as part of a product or service that contains any of them, with or without modifications, and in Source or Object form, provided that you meet the following conditions:
+    a. You shall give any other recipients of the Materials or derivative works a copy of this Agreement;
+    b. You shall cause any modified files to carry prominent notices stating that you changed the files;
+    c. You shall retain in all copies of the Materials that you distribute the following attribution notices within a "Notice" text file distributed as a part of such copies: "Qwen is licensed under the Qwen LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved."; and
+    d. You may add your own copyright statement to your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of your modifications, or for any such derivative works as a whole, provided your use, reproduction, and distribution of the work otherwise complies with the terms and conditions of this Agreement.
+ 
+4. Restrictions
+If you are commercially using the Materials, and your product or service has more than 100 million monthly active users, you shall request a license from us. You cannot exercise your rights under this Agreement without our express authorization.
+ 
+5. Rules of use
+    a. The Materials may be subject to export controls or restrictions in China, the United States or other countries or regions. You shall comply with applicable laws and regulations in your use of the Materials.
+    b. If you use the Materials or any outputs or results therefrom to create, train, fine-tune, or improve an AI model that is distributed or made available, you shall prominently display “Built with Qwen” or “Improved using Qwen” in the related product documentation.
+ 
+6. Intellectual Property
+    a. We retain ownership of all intellectual property rights in and to the Materials and derivatives made by or for us. Conditioned upon compliance with the terms and conditions of this Agreement, with respect to any derivative works and modifications of the Materials that are made by you, you are and will be the owner of such derivative works and modifications.
+    b. No trademark license is granted to use the trade names, trademarks, service marks, or product names of us, except as required to fulfill notice requirements under this Agreement or as required for reasonable and customary use in describing and redistributing the Materials.
+    c. If you commence a lawsuit or other proceedings (including a cross-claim or counterclaim in a lawsuit) against us or any entity alleging that the Materials or any output therefrom, or any part of the foregoing, infringe any intellectual property or other right owned or licensable by you, then all licenses granted to you under this Agreement shall terminate as of the date such lawsuit or other proceeding is commenced or brought.
+ 
+7. Disclaimer of Warranty and Limitation of Liability
+    a. We are not obligated to support, update, provide training for, or develop any further version of the Qwen Materials or to grant any license thereto.
+    b. THE MATERIALS ARE PROVIDED "AS IS" WITHOUT ANY EXPRESS OR IMPLIED WARRANTY OF ANY KIND INCLUDING WARRANTIES OF MERCHANTABILITY, NONINFRINGEMENT, OR FITNESS FOR A PARTICULAR PURPOSE. WE MAKE NO WARRANTY AND ASSUME NO RESPONSIBILITY FOR THE SAFETY OR STABILITY OF THE MATERIALS AND ANY OUTPUT THEREFROM.
+    c. IN NO EVENT SHALL WE BE LIABLE TO YOU FOR ANY DAMAGES, INCLUDING, BUT NOT LIMITED TO ANY DIRECT, OR INDIRECT, SPECIAL OR CONSEQUENTIAL DAMAGES ARISING FROM YOUR USE OR INABILITY TO USE THE MATERIALS OR ANY OUTPUT OF IT, NO MATTER HOW IT’S CAUSED.
+    d. You will defend, indemnify and hold harmless us from and against any claim by any third party arising out of or related to your use or distribution of the Materials.
+ 
+8. Survival and Termination.
+    a. The term of this Agreement shall commence upon your acceptance of this Agreement or access to the Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein.
+    b. We may terminate this Agreement if you breach any of the terms or conditions of this Agreement. Upon termination of this Agreement, you must delete and cease use of the Materials. Sections 7 and 9 shall survive the termination of this Agreement.
+ 
+9. Governing Law and Jurisdiction.
+    a. This Agreement and any dispute arising out of or relating to it will be governed by the laws of China, without regard to conflict of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this Agreement.
+    b. The People's Courts in Hangzhou City shall have exclusive jurisdiction over any dispute arising out of this Agreement.
+
+---------------------------------------------------------------------------------------------------------------
+
+Apache License
+Version 2.0, January 2004
+http://www.apache.org/licenses/
+
+TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+1. Definitions.
+"License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.
+
+"Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.
+
+"Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity.
+
+"You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License.
+
+"Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.
+
+"Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.
+
+"Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below).
+
+"Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof.
+
+"Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution."
+
+"Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work.
+
+2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.
+3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed.
+4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
+(a) You must give any other recipients of the Work or Derivative Works a copy of this License; and
+(b) You must cause any modified files to carry prominent notices stating that You changed the files; and
+(c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and
+(d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License.
+You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License.
+
+5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions.
+6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file.
+7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License.
+8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages.
+9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.
+END OF TERMS AND CONDITIONS
+
+APPENDIX: How to apply the Apache License to your work.
+
+To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives.
+Copyright [yyyy] [name of copyright owner]
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+Standard License Header
+Copyright [yyyy] [name of copyright owner]
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
--- a/README.md
+++ b/README.md
@ -1,3 +1,545 @@
-# Instella-3B-Instruct
+---
+license: other
+license_link: LICENSE
+pipeline_tag: text-generation
+library_name: transformers
+---
+# Instella✨: Fully Open Language Models with Stellar Performance

-Instella-3B-Instruct
+AMD is excited to announce Instella, a family of fully open state-of-the-art 3-billion-parameter language models (LMs) trained from scratch on AMD Instinct&trade; MI300X GPUs. Instella models outperform existing fully open models of similar sizes and achieve competitive performance compared to state-of-the-art open-weight models such as Llama-3.2-3B, Gemma-2-2B, and Qwen-2.5-3B, including their instruction-tuned counterparts.
+
+<div align="center">
+<img src="scaling_perf_instruct.png" style="object-fit: contain;"/>
+<em><b>Figure 1:</b> Pareto frontier of pre-training tokens vs average performance for pre-trained and instruction-tuned models.</em>
+</div>
+
+By training Instella from scratch on Instinct MI300X GPUs, we highlight our hardware’s capability and scalability in handling demanding large-scale AI training workloads, offering a viable alternative in the AI hardware landscape. In line with the AMD commitment to open source, we are releasing all artifacts related to Instella models [here](#additional-resources), including the model weights, detailed training configurations, datasets, and code, enabling the AI community to collaborate, replicate, and innovate, thereby accelerating progress.
+
+## Takeaways
+
+- **Announcing Instella**, a series of 3 billion parameter language models developed by AMD, trained from scratch on 128 Instinct MI300X GPUs.
+- **Instella models significantly outperform existing fully open LMs** (Figure 1) of comparable size, as well as bridge the gap between fully open and open weight models by achieving competitive performance compared state-of-the-art open weight models and their instruction-tuned counterparts.
+- Fully open and accessible: **Fully open-source release of model weights, training hyperparameters, datasets, and code**, fostering innovation and collaboration within the AI community.  
+- Supported by the AMD ROCm software stack, Instella employs efficient training techniques such as **FlashAttention-2, Torch Compile, and Fully Sharded Data Parallelism (FSDP)** with hybrid sharding to **scale model training over a large cluster.**
+
+## Instella Models
+
+In this release, we introduce the following Instella models:
+<div align="center">
+
+| Model  | Stage | Training Data (Tokens) | Description |
+| :----: | :----: | :----: | :---- |
+| [Instella-3B-Stage1](https://huggingface.co/amd/Instella-3B-Stage1)  | Pre-training (Stage 1) | 4.065 Trillion | First stage pre-training to develop proficiency in natural language. |
+| [Instella-3B](https://huggingface.co/amd/Instella-3B)  | Pre-training (Stage 2) | 57.575 Billion | Second stage pre-training to further enhance problem solving capabilities. |
+| [Instella-3B-SFT](https://huggingface.co/amd/Instella-3B-SFT)  | SFT | 8.902 Billion (x3 epochs) | Supervised Fine-tuning (SFT) to enable instruction-following capabilities. |
+| [Instella-3B-Instruct](https://huggingface.co/amd/Instella-3B-instruct)  | DPO | 760 Million | Alignment to human preferences and strengthen chat capabilities with direct preference optimization (DPO). |
+|  | **Total:** | **4.15 Trillion** |  |
+
+<em><b>Table 1:</b> Instella models and training stages.</em>
+</div>
+
+The Instella models are text-only, autoregressive transformer-based LMs having 3 billion parameters. Architecture-wise, Instella is packed with 36 decoder layers, each having 32 attention heads. These models support a sequence length of up to 4,096 tokens and have a vocabulary size of ~50,000 tokens using the OLMo tokenizer. During both pre-training and fine-tuning, we utilized FlashAttention-2, Torch Compile, and bfloat16 mixed-precision training to reduce memory usage, leading to computational speedups and optimal resource utilization. To balance inter-node memory efficiency and intra-node communication overhead within our cluster, we employed fully sharded data parallelism (FSDP) with hybrid sharding, with model parameters, gradients, and optimizer states sharded within a node and replicated across the nodes.
+
+Our training pipeline is based on the open-sourced OLMo codebase, adapted, and optimized for our hardware and model architecture. For pre-training we used a total of 128 Instinct MI300X GPUs distributed across 16 nodes with each node having 8x Instinct MI300X GPUs. We evaluated our models and baselines using standard tasks from [OLMES](https://github.com/allenai/olmes/tree/main), [FastChat MT-Bench](https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/README.md), and [Alpaca](https://github.com/tatsu-lab/alpaca_eval/tree/main). For more details about the architecture, training pipeline/hyperparameters and evaluation results, please refer to our [Blog](https://rocm.blogs.amd.com/artificial-intelligence/introducing-instella-3B/README.html), [Hugging Face model card](https://huggingface.co/amd/Instella-3B) and [Github repository](https://github.com/AMD-AIG-AIMA/Instella).
+
+## Training Pipeline
+
+The training of the Instella models comprised of four stages, where each stage incrementally enhanced the model’s capabilities from fundamental natural language understanding to instruction following and alignment towards human preferences.
+
+### Model Summary
+
+| Stage  | Model | Training Tokens | Layers | Attention Heads | Model Hidden Size | MLP Hidden Size | Context Length | RoPE Theta |
+| :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- |
+| Pre-training  | Instella-3B-stage1 | 4.065T | 36 | 32 | 2560 | 13824 | 4096 | 10,000 | 
+| Pre-training  | Instella-3B | 57.575B | 36 | 32 | 2560 | 13824 | 4096 | 10,000 | 
+| SFT  | Instella-3B-SFT | 8.902B (x3) | 36 | 32 | 2560 | 13824 | 4096 | 10,000 | 
+| SFT+DPO  | Instella-3B-instruct | 760M | 36 | 32 | 2560 | 13824 | 4096 | 10,000 | 
+
+### Hyparparameter
+
+|Stage | Optimizer | Peak LR | LR Scheduler | Alpha F | Warmup (steps) | Weight Decay | Decay Norm & Bias | Decay Embedding | Batch Size (Tokens) | Epochs |
+|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|
+| Pretraining Stage 1 | AdamW(0.9,0.95) | 4.0e-4 | cosine_with_warmup | 0.1 | 2000 | 0.1 | True | True | 4M | 1 |
+| Pretraining Stage 2 | AdamW(0.9,0.95) | 4.0e-5 | cosine_with_warmup | 0.0 | 0 | 0.1 | True | True | 4M | 1 |
+| SFT | AdamW(0.9,0.95) | 1.0e-5 | linear_with_warmup | 0.001 | 500 | 0.1 | True | True | 0.5M | 3 |
+| DPO  | AdamW(0.9,0.95) | 5.0e-7 | linear | -- | 10% | 0.1 | -- | -- | 0.25M | 1 |
+
+## Getting Started
+
+### Installation
+
+First, install [PyTorch](https://pytorch.org) according to the instructions specific to your operating system. For AMD GPUs, you can also start with a [rocm/pytorch](https://hub.docker.com/r/rocm/pytorch/tags?name=pytorch) docker. 
+
+To install from source (recommended for training/fine-tuning) run:
+
+```bash
+git clone https://github.com/AMD-AIG-AIMA/Instella.git
+cd Instella
+# install Flash-Attention on MI300X
+GPU_ARCH=gfx942 MAX_JOBS=$(nproc) pip install git+https://github.com/Dao-AILab/flash-attention.git -v
+# install other dependencies
+pip install -e .[all]
+```
+
+### Example Usage
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+checkpoint = "amd/Instella-3B-Instruct"
+
+tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", trust_remote_code=True)
+
+prompt = [{"role": "user", "content": "What are the benefits of open-source AI research?"}]
+inputs = tokenizer.apply_chat_template(
+    prompt,
+    add_generation_prompt=True,
+    return_tensors='pt'
+)
+
+tokens = model.generate(
+    inputs.to(model.device),
+    max_new_tokens=1024,
+    temperature=0.8,
+    do_sample=True
+)
+
+print(tokenizer.decode(tokens[0], skip_special_tokens=False))
+```
+
+### Chat in TRL
+You can also use the TRL CLI to chat with the model from the terminal:
+```bash
+pip install trl
+trl chat --model_name_or_path amd/Instella-3B-Instruct --trust_remote_code --max_new_tokens 1024
+
+# <root>:
+# which is bigger 9.8 or 9.11?
+
+# <amd/Instella-3B-Instruct>:
+# 9.8 is bigger than 9.11. The difference between the two numbers is 0.69 (9.8 - 9.11 = 0.69), which indicates that 9.8 is 0.69 units larger than 9.11.  
+```
+
+## Results
+
+### Pre-training
+
+<div class="table-wrapper" align="center">
+  <table>
+    <thead>
+      <tr>
+        <th>Models</th>
+        <th>Size</th>
+        <th>Training Tokens</th>
+        <th>Avg</th>
+        <th>ARC Challenge</th>
+        <th>ARC Easy</th>
+        <th>BoolQ</th>
+        <th>Hellaswag</th>
+        <th>PiQA</th>
+        <th>SciQ</th>
+        <th>Winnograde</th>
+        <th>OpenBookQA</th>
+        <th>MMLU</th>
+        <th>BBH (3-shot)</th>
+        <th>GSM8k (8-shot)</th>
+      </tr>
+    </thead>
+    <tbody>
+      <tr>
+        <th colspan="15">Open Weight Models</th>
+      </tr>
+      <tr>
+        <td>Gemma-2-2B</td>
+        <td>2.61B</td>
+        <td>~2T</td>
+        <td>59.34</td>
+        <td>39.46</td>
+        <td>59.30</td>
+        <td>74.50</td>
+        <td>70.50</td>
+        <td>76.40</td>
+        <td><strong>96.60</strong></td>
+        <td>69.80</td>
+        <td>44.80</td>
+        <td>53.28</td>
+        <td>40.75</td>
+        <td>27.37</td>
+      </tr>
+      <tr>
+        <td>Llama-3.2-3B</td>
+        <td>3.21B</td>
+        <td>~9T</td>
+        <td>62.51</td>
+        <td>47.16</td>
+        <td>64.91</td>
+        <td>74.80</td>
+        <td>73.10</td>
+        <td>75.90</td>
+        <td>95.30</td>
+        <td>70.30</td>
+        <td>51.20</td>
+        <td>57.81</td>
+        <td><ins>47.00</ins></td>
+        <td>30.10</td>
+      </tr>
+      <tr>
+        <td>Qwen2.5-3B</td>
+        <td>3.09B</td>
+        <td>~18T</td>
+        <td><strong>68.30</strong></td>
+        <td>51.51</td>
+        <td>67.19</td>
+        <td><strong>79.10</strong></td>
+        <td>72.10</td>
+        <td>77.40</td>
+        <td>95.50</td>
+        <td>69.30</td>
+        <td><ins>51.40</ins></td>
+        <td><strong>67.22</strong></td>
+        <td><strong>56.69</strong></td>
+        <td><strong>63.84</strong></td>
+      </tr>
+      <tr>
+        <th colspan="15">Fully Open Models</th>
+      </tr>
+      <tr>
+        <td>Pythia-2.8b</td>
+        <td>2.91B</td>
+        <td>300B</td>
+        <td>49.83</td>
+        <td>40.47</td>
+        <td>60.70</td>
+        <td>64.80</td>
+        <td>60.10</td>
+        <td>72.50</td>
+        <td>89.70</td>
+        <td>60.80</td>
+        <td>42.60</td>
+        <td>26.09</td>
+        <td>27.69</td>
+        <td>2.73</td>
+      </tr>
+      <tr>
+        <td>GPTNeo-2.7B</td>
+        <td>2.72B</td>
+        <td>~420B</td>
+        <td>47.96</td>
+        <td>38.46</td>
+        <td>54.56</td>
+        <td>62.70</td>
+        <td>55.20</td>
+        <td>70.80</td>
+        <td>88.00</td>
+        <td>58.30</td>
+        <td>40.80</td>
+        <td>27.83</td>
+        <td>27.25</td>
+        <td>3.71</td>
+      </tr>
+      <tr>
+        <td>OpenELM-3B</td>
+        <td>3.04B</td>
+        <td>~1.5T</td>
+        <td>52.28</td>
+        <td>37.46</td>
+        <td>58.42</td>
+        <td>68.60</td>
+        <td>71.70</td>
+        <td>75.60</td>
+        <td>92.50</td>
+        <td>65.40</td>
+        <td>46.40</td>
+        <td>26.69</td>
+        <td>29.40</td>
+        <td>2.96</td>
+      </tr>
+      <tr>
+        <td>StableLM-3B-4E1T</td>
+        <td>2.8B</td>
+        <td>~4T</td>
+        <td>58.51</td>
+        <td>44.82</td>
+        <td>67.02</td>
+        <td>75.40</td>
+        <td><ins>74.20</ins></td>
+        <td><strong>78.40</strong></td>
+        <td>93.40</td>
+        <td>68.40</td>
+        <td>48.60</td>
+        <td>45.19</td>
+        <td>37.33</td>
+        <td>10.84</td>
+      </tr>
+      <tr>
+        <td><strong><a href="https://huggingface.co/amd/Instella-3B-Stage1">Instella-3B-Stage1</a></strong></td>
+        <td>3.11B</td>
+        <td>~4T</td>
+        <td>61.33</td>
+        <td><strong>53.85</strong></td>
+        <td><strong>73.16</strong></td>
+        <td><ins>78.70</ins></td>
+        <td><ins>74.20</ins></td>
+        <td>77.50</td>
+        <td>94.90</td>
+        <td><ins>71.20</ins></td>
+        <td><ins>51.40</ins></td>
+        <td>54.69</td>
+        <td>34.30</td>
+        <td>10.77</td>
+      </tr>
+      <tr>
+        <td><strong><a href="https://huggingface.co/amd/Instella-3B">Instella-3B</a></strong></td>
+        <td>3.11B</td>
+        <td>~4T+60B</td>
+        <td><ins>66.59</ins></td>
+        <td><ins>52.84</ins></td>
+        <td><ins>70.53</ins></td>
+        <td>76.50</td>
+        <td><strong>75.00</strong></td>
+        <td><ins>77.80</ins></td>
+        <td><ins>96.40</ins></td>
+        <td><strong>73.10</strong></td>
+        <td><strong>52.40</strong></td>
+        <td><ins>58.31</ins></td>
+        <td>39.74</td>
+        <td><ins>59.82</ins></td>
+      </tr>
+    </tbody>
+  </table>
+    <em><strong>Table 2:</strong> Pre-trained model performance on standard benchmarks. Here <strong>Bold</strong> represents the best performance, and <ins>Underscore</ins> represents the second best performance.</em>
+</div>
+
+- Both Instella-3B-Stage1 & Instella-3B models outperform all the other fully open models over all the benchmarks individually (except PIQA). **Our final pre-trained checkpoint Instella-3B outperforms the existing top performant fully open pre-trained models by a lead of ⬆️8.08% on average**, with significant improvements in `ARC Challenge [+8.02%], ARC Easy [+3.51%], Winnograde [+4.7%], OpenBookQA [+3.88%], MMLU [+13.12%] and ️GSM8K [+48.98%]`.  
+- **Second stage pre-training elevated the overall average performance relative to stage-1 by ⬆️5.26%**, substantially narrowing the performance gap between Instella-3B model vs the closed-source models, and **outperforming Llama-3.2-3B by ⬆️4.08% on average** (`+5.69% [ARC Challenge], +5.61% [ARC Easy], and +29.72% [GSM8k]`), **Gemma-2-2B by ⬆️7.25% on average** (`+13.38% [ARC Challenge], +11.23% [ARC Easy], +4.5% [Hellaswag], +7.6% [OpenBookQA], +5.03% [MMLU], and +32.45% [GSM8k]`), and is **competitive with Qwen-2.5-3B** on the majority of the benchmarks.  
+- The multi-stage pre-training with diverse and high-quality data mix significantly enhanced Instella-3B’s capabilities, establishing it as a competitive and open alternative in the landscape of comparable size language models.
+
+### Instruction-tuning Results
+
+<div class="table-wrapper" align="center">
+  <table>
+    <thead>
+      <tr>
+        <th>Models</th>
+        <th>Size</th>
+        <th>Training Tokens</th>
+        <th>Avg</th>
+        <th>MMLU</th>
+        <th>TruthfulQA</th>
+        <th>BBH</th>
+        <th>GPQA</th>
+        <th>GSM8K</th>
+        <th>Minerva MATH</th>
+        <th>IFEval</th>
+        <th>AlpacaEval 2</th>
+        <th>MT-Bench</th>
+      </tr>
+    </thead>
+    <tbody>
+      <tr>
+        <th colspan="13">Open Weight Models</th>
+      </tr>
+        <tr>
+        <td>Gemma-2-2B-Instruct</td>
+        <td>2.61B</td>
+        <td>~2T</td>
+        <td>39.04</td>
+        <td>58.35</td>
+        <td><ins>55.76</ins></td>
+        <td>42.96</td>
+        <td>25.22</td>
+        <td>53.45</td>
+        <td>22.48</td>
+        <td>55.64</td>
+        <td><strong>29.41</strong></td>
+        <td><strong>8.07</strong></td>
+      </tr>
+      <tr>
+        <td>Llama-3.2-3B-Instruct</td>
+        <td>3.21B</td>
+        <td>~9T</td>
+        <td><ins>47.53</ins></td>
+        <td><ins>61.50</ins></td>
+        <td>50.23</td>
+        <td><strong>61.50</strong></td>
+        <td><ins>29.69</ins></td>
+        <td><strong>77.03</strong></td>
+        <td><ins>46.00</ins></td>
+        <td><strong>75.42</strong></td>
+        <td>19.31</td>
+        <td>7.13</td>
+      </tr>
+      <tr>
+        <td>Qwen2.5-3B-Instruct</td>
+        <td>3.09B</td>
+        <td>~18T</td>
+        <td><strong>48.72</strong></td>
+        <td><strong>66.90</strong></td>
+        <td><strong>57.16</strong></td>
+        <td><ins>57.29</ins></td>
+        <td>28.13</td>
+        <td><ins>75.97</ins></td>
+        <td><strong>60.42</strong></td>
+        <td>62.48</td>
+        <td><ins>22.12</ins></td>
+        <td><ins>8.00</ins></td>
+      </tr>
+      <tr>
+        <th colspan="13">Fully Open Models</th>
+      </tr>
+      <tr>
+        <td>StableLM-zephyr-3B</td>
+        <td>2.8B</td>
+        <td>4T</td>
+        <td>30.50</td>
+        <td>45.10</td>
+        <td>47.90</td>
+        <td>39.32</td>
+        <td>25.67</td>
+        <td>58.38</td>
+        <td>10.38</td>
+        <td>34.20</td>
+        <td>7.51</td>
+        <td>6.04</td>
+      </tr>
+      <tr>
+        <td>OpenELM-3B-Instruct</td>
+        <td>3.04B</td>
+        <td>~1.5T</td>
+        <td>14.11</td>
+        <td>27.36</td>
+        <td>38.08</td>
+        <td>24.24</td>
+        <td>18.08</td>
+        <td>1.59</td>
+        <td>0.38</td>
+        <td>16.08</td>
+        <td>0.21</td>
+        <td>1.00</td>
+      </tr>
+      <tr>
+        <td><a href="https://huggingface.co/amd/Instella-3B-SFT">Instella-3B-SFT</a></td>
+        <td>3.11B</td>
+        <td>~4T</td>
+        <td>42.05</td>
+        <td>58.76</td>
+        <td>52.49</td>
+        <td>46.00</td>
+        <td>28.13</td>
+        <td>71.72</td>
+        <td>40.50</td>
+        <td>66.17</td>
+        <td>7.58</td>
+        <td>7.07</td>
+      </tr>
+      <tr>
+        <td><a href="https://huggingface.co/amd/Instella-3B-Instruct">Instella-3B-Instruct</a></td>
+        <td>3.11B</td>
+        <td>~4T</td>
+        <td>44.87</td>
+        <td>58.90</td>
+        <td>55.47</td>
+        <td>46.75</td>
+        <td><strong>30.13</strong></td>
+        <td>73.92</td>
+        <td>42.46</td>
+        <td><ins>71.35</ins></td>
+        <td>17.59</td>
+        <td>7.23</td>
+      </tr>
+    </tbody>
+  </table>
+    <em><strong>Table 2:</strong> Instruct model performance on standard benchmarks. Here <strong>Bold</strong> represents the best performance, and <ins>Underscore</ins> represents the second best performance.</em>
+</div>
+
+- **Instella-3B-Instruct model consistently outperforms other fully open models across all evaluated benchmarks with a significant average score lead of ⬆️ 14.37%** w.r.t the next top performing fully open instruction-tuned models. With substantial margins across all the chat benchmarks (`+13% [MMLU], 7.57% [TruthfulQA], 7.43% [BBH], +4.46% [GPQA], +37.15 [IFEval], 10.08% [Alpaca 2], and 1.2% [MT-Bench]`).  
+- **Instella-3B-Instruct narrows the performance gap with leading open-weight models.** Instella-3B-Instruct performs **on par with or slightly surpasses existing state-of-the-art open weight instruction-tuned models** such as Llama-3.2-3B-Instruct (`+5.24% [TruthfulQA], 0.45% [GPQA], and +0.1% [MT-Bench]`), and Qwen2.5-3B-Instruct (`+2.01% [GPQA] and +8.87% [IFEval]`), while significantly outperforming Gemma-2-2B-Instruct with an average score lead of ⬆️5.83% (`+0.55% [MMLU], +3.79 [BBH], +4.91 [GPQA], +20.47 [GSM8k], +19.98 [Minerva MATH], and +15.17% [IFEval]`).
+- **Overall, Instella-3B-Instruct excels in instruction following tasks and multi-turn QA tasks like TruthfulQA, GPQA, IFEval and MT-Bench**, while being highly competitive compared to existing state-of-the-art open weight models on other knowledge recall and math benchmarks, while being trained on significantly fewer training tokens.
+
+## Training Data
+
+| Stage  | Model | Dataset | License | 
+| :---- | :---- | :---- | :---- |
+| Pre-training Stage 1  | Instella-3B-stage1 | [https://huggingface.co/datasets/allenai/OLMoE-mix-0924](https://huggingface.co/datasets/allenai/OLMoE-mix-0924) | ODC-BY-1.0 |
+| Pre-training Stage 2  | Instella-3B | [https://huggingface.co/datasets/allenai/tulu-3-sft-mixture](https://huggingface.co/datasets/allenai/tulu-3-sft-mixture) | ODC-BY-1.0 |
+| Pre-training Stage 2  | Instella-3B | [https://huggingface.co/datasets/allenai/dolmino-mix-1124](https://huggingface.co/datasets/allenai/dolmino-mix-1124) | ODC-BY-1.0 |
+| Pre-training Stage 2  | Instella-3B | [https://huggingface.co/datasets/teknium/OpenHermes-2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) | Refer source materials |
+| Pre-training Stage 2  | Instella-3B | [https://huggingface.co/datasets/TIGER-Lab/WebinstructSub](https://huggingface.co/datasets/TIGER-Lab/WebinstructSub) | Apache-2.0 |
+| Pre-training Stage 2  | Instella-3B | [https://huggingface.co/datasets/m-a-p/Code-Feedback](https://huggingface.co/datasets/m-a-p/Code-Feedback) | Apache-2.0 |
+| Pre-training Stage 2  | Instella-3B | [https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) | MIT |
+| Pre-training Stage 2  | Instella-3B | [https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus/viewer/python-edu](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus/viewer/python-edu) | ODC-BY-1.0 |
+| Pre-training Stage 2  | Instella-3B | [https://github.com/google-deepmind/mathematics_dataset](https://github.com/google-deepmind/mathematics_dataset) | Apache-2.0 |
+| Pre-training Stage 2  | Instella-3B | [https://huggingface.co/datasets/amd/Instella-GSM8K-synthetic](https://huggingface.co/datasets/amd/Instella-GSM8K-synthetic) | [LICENSE](https://huggingface.co/datasets/amd/Instella-GSM8K-synthetic/blob/main/LICENSE) |
+| SFT  | Instella-3B-SFT | [https://huggingface.co/datasets/nvidia/OpenMathinstruct-2](https://huggingface.co/datasets/nvidia/OpenMathinstruct-2) | CC-BY-4.0 |
+| SFT  | Instella-3B-SFT | [https://huggingface.co/datasets/cais/mmlu](https://huggingface.co/datasets/cais/mmlu) | MIT |
+| SFT  | Instella-3B-SFT | [https://huggingface.co/datasets/HuggingFaceTB/smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) | Apache-2.0 |
+| SFT  | Instella-3B-SFT | [https://huggingface.co/datasets/GAIR/o1-journey](https://huggingface.co/datasets/GAIR/o1-journey) | Refer source materials |
+| SFT  | Instella-3B-SFT | [https://huggingface.co/datasets/allenai/tulu-3-sft-personas-instruction-following (subset of Tulu3)](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-instruction-following) | ODC-BY-1.0 |
+| DPO  | Instella-3B-instruct | [https://huggingface.co/datasets/allenai/olmo-2-1124-7b-preference-mix](https://huggingface.co/datasets/allenai/olmo-2-1124-7b-preference-mix) | ODC-BY-1.0 |
+
+> [!NOTE]
+> Further information concerning the training datasets, including applicable licensing terms and use restrictions, may be located at the linked source location.
+
+## Conclusion
+
+The release of the Instella family of models represents a significant stride in advancing open-source AI and demonstrating the capabilities of AMD hardware in large-scale language model training. The 3 billion parameter models from Instella family significantly outperform present fully open comparable size models in key benchmarks while also being competitive to comparable open-weight models, which we attribute to the high-quality data-mix selection, multi-stage training pipeline, and the use of high-performance Instinct MI300X GPUs for large scale training.
+
+By fully open sourcing the Instella models, including weights, training configurations, datasets, and code, we aim to foster innovation and collaboration within the AI community. We believe that transparency, reproducibility and accessibility are key drivers of progress in AI research and development. We invite developers, researchers, and AI enthusiasts to explore Instella, contribute to its ongoing improvement, and join us in pushing the boundaries of what is possible with language models.
+
+We will continue enhancing the models across multiple dimensions, including context length, reasoning ability, and multimodal capabilities. Additionally, we will scale up both the model and dataset while exploring diverse architectural approaches. Keep your eyes peeled for more exciting blogs on the Instella LMs family, its features and capabilities!
+
+## Additional Resources
+
+### Hugging Face Model Cards
+
+- Pre-trained models:
+  - Instella-3B-Stage1: [amd/Instella-3B-Stage1](https://huggingface.co/amd/Instella-3B-Stage1), First stage pre-training checkpoint.
+  - Instella-3B: [amd/Instella-3B](https://huggingface.co/amd/Instella-3B), Final pre-training checkpoint.
+- Instruction-tuned models:
+  - Instella-3B-SFT: [amd/Instella-3B-SFT](https://huggingface.co/amd/Instella-3B-SFT), Supervised fine-tuned checkpoint.
+  - Instella-3B-Instruct: [amd/Instella-3B-Instruct](https://huggingface.co/amd/Instella-3B-Instruct), Final Instruction-tuned checkpoint.
+
+### Datasets
+
+Second stage pre-training GSM8k synthetic dataset: [amd/Instella-GSM8K-synthetic](https://huggingface.co/datasets/amd/Instella-GSM8K-synthetic)
+
+- The dataset consists of two splits: `train` and `train_119K`.
+- For Instella-3B model second stage pre-training we used the `train_119K` split, which is a subset of the larger `train` split.
+
+### Code
+
+- Github: [https://github.com/AMD-AIG-AIMA/Instella](https://github.com/AMD-AIG-AIMA/Instella)
+
+Please refer to the following blogs to get started with using these techniques on AMD GPUs:
+
+- [PyTorch Fully Sharded Data Parallel (FSDP) on AMD GPUs with ROCm™](https://rocm.blogs.amd.com/artificial-intelligence/fsdp-training-pytorch/README.html)
+- [Accelerating Large Language Models with Flash Attention on AMD GPUs](https://rocm.blogs.amd.com/artificial-intelligence/flash-attention/README.html)
+- [Accelerate PyTorch Models using torch.compile on AMD GPUs with ROCm™](https://rocm.blogs.amd.com/artificial-intelligence/torch_compile/README.html)
+- [Introducing the First AMD 1B Language Models: AMD OLMo](https://www.amd.com/en/developer/resources/technical-articles/introducing-the-first-amd-1b-language-model.html)
+
+## Bias, Risks, and Limitations
+
+- The models are being released for research purposes only and are not intended for use cases that require high levels of factuality, safety-critical situations, health, or medical applications, generating false information, facilitating toxic conversations.
+- Model checkpoints are made accessible without any safety promises. It is crucial for users to conduct comprehensive evaluations and implement safety filtering mechanisms as per their respective use cases.
+- It may be possible to prompt the model to generate content that may be factually inaccurate, harmful, violent, toxic, biased, or otherwise objectionable. Such content may also get generated by prompts that did not intend to produce output as such. Users are thus requested to be aware of this and exercise caution and responsible thinking when using the model.
+- Multi-lingual abilities of the models have not been tested and thus may misunderstand and generate erroneous responses across different languages.
+
+## License
+
+- The Instella-3B models are licensed for academic and research purposes under a ResearchRAIL license. 
+- The [amd/Instella-GSM8K-synthetic](https://huggingface.co/datasets/amd/Instella-GSM8K-synthetic) dataset used in second stage pre-training is built with Qwen2.5-72B-Instruct, and is licensed for academic and research purposes under a ResearchRAIL license. Refer to the [LICENSE](https://huggingface.co/datasets/amd/Instella-GSM8K-synthetic/blob/main/LICENSE) and [NOTICES](https://huggingface.co/datasets/amd/Instella-GSM8K-synthetic/blob/main/NOTICES) in the [amd/Instella-GSM8K-synthetic](https://huggingface.co/datasets/amd/Instella-GSM8K-synthetic) dataset card files for more information.
+- Refer to the [LICENSE](https://huggingface.co/amd/Instella-3B/blob/main/LICENSE) and [NOTICES](https://huggingface.co/amd/Instella-3B/blob/main/NOTICES) files for more information.
+
+## Citations
+
+Feel free to cite our Instella-3B models:
+
+```text
+@misc{Instella,
+    title = {Instella: Fully Open Language Models with Stellar Performance},
+    url = {https://huggingface.co/amd/Instella-3B},
+    author = {Jiang Liu, Jialian Wu, Xiaodong Yu, Prakamya Mishra, Sudhanshu Ranjan, Zicheng Liu, Chaitanya Manem, Yusheng Su, Pratik Prabhanjan Brahma, Gowtham Ramesh, Ximeng Sun, Ze Wang, Emad Barsoum},
+    month = {March},
+    year = {2025}
+}
+```
--- a/config.json
+++ b/config.json
@ -0,0 +1,31 @@
+{
+  "architectures": [
+    "InstellaForCausalLM"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "auto_map": {
+    "AutoConfig": "modeling_instella.InstellaConfig",
+    "AutoModelForCausalLM": "modeling_instella.InstellaForCausalLM"
+  },
+  "bos_token_id": 0,
+  "eos_token_id": 0,
+  "hidden_act": "silu",
+  "hidden_size": 2560,
+  "initializer_range": 0.02,
+  "intermediate_size": 6912,
+  "max_position_embeddings": 4096,
+  "model_type": "instella",
+  "num_attention_heads": 32,
+  "num_hidden_layers": 36,
+  "num_key_value_heads": 32,
+  "pad_token_id": 1,
+  "rms_norm_eps": 1e-05,
+  "rope_scaling": null,
+  "rope_theta": 10000.0,
+  "tie_word_embeddings": false,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.48.0",
+  "use_cache": true,
+  "vocab_size": 50304
+}
--- a/generation_config.json
+++ b/generation_config.json
@ -0,0 +1,7 @@
+{
+  "_from_model_config": true,
+  "bos_token_id": 0,
+  "eos_token_id": 0,
+  "pad_token_id": 1,
+  "transformers_version": "4.48.0"
+}
--- a/model-00001-of-00002.safetensors
+++ b/model-00001-of-00002.safetensors
--- a/model-00002-of-00002.safetensors
+++ b/model-00002-of-00002.safetensors
--- a/model.safetensors.index.json
+++ b/model.safetensors.index.json
@ -0,0 +1,406 @@
+{
+  "metadata": {
+    "total_size": 6225351680
+  },
+  "weight_map": {
+    "lm_head.weight": "model-00002-of-00002.safetensors",
+    "model.embed_tokens.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.19.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.19.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.19.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.19.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.19.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.19.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.19.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.19.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.19.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.19.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.19.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.20.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.20.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.20.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.20.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.20.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.20.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.20.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.20.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.20.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.20.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.20.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.21.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.21.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.21.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.21.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.21.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.21.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.21.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.21.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.21.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.21.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.21.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.22.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.22.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.22.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.22.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.22.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.22.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.22.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.22.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.22.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.22.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.22.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.23.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.23.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.23.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.23.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.23.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.23.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.23.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.23.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.23.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.23.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.23.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.24.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.24.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.24.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.24.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.24.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.24.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.24.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.24.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.24.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.24.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.24.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.25.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.25.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.25.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.25.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.25.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.25.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.25.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.25.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.25.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.25.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.25.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.26.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.26.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.26.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.26.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.26.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.26.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.26.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.26.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.26.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.26.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.26.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.27.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.27.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.27.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.27.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.27.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.27.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.27.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.27.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.27.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.27.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.27.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.28.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.28.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.28.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.28.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.28.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.28.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.28.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.28.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.28.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.28.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.28.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.29.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.29.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.29.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.29.pre_attention_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.29.pre_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.29.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.29.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.29.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.29.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.29.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.29.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.30.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.30.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.30.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.30.pre_attention_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.30.pre_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.30.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.30.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.30.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.30.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.30.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.30.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.31.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.31.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.31.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.31.pre_attention_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.31.pre_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.31.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.31.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.31.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.31.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.31.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.31.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.32.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.32.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.32.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.32.pre_attention_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.32.pre_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.32.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.32.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.32.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.32.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.32.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.32.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.33.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.33.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.33.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.33.pre_attention_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.33.pre_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.33.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.33.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.33.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.33.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.33.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.33.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.34.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.34.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.34.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.34.pre_attention_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.34.pre_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.34.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.34.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.34.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.34.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.34.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.34.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.35.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.35.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.35.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.35.pre_attention_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.35.pre_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.35.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.35.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.35.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.35.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.35.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.35.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.4.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.pre_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.norm.weight": "model-00002-of-00002.safetensors"
+  }
+}
--- a/modeling_instella.py
+++ b/modeling_instella.py
--- a/scaling_perf_instruct.png
+++ b/scaling_perf_instruct.png
--- a/special_tokens_map.json
+++ b/special_tokens_map.json
@ -0,0 +1,24 @@
+{
+  "bos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<padding>",
+  "unk_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}
--- a/tokenizer.json
+++ b/tokenizer.json
--- a/tokenizer_config.json
+++ b/tokenizer_config.json
@ -0,0 +1,248 @@
+{
+  "add_bos_token": false,
+  "add_eos_token": false,
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<|padding|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "50254": {
+      "content": "                        ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50255": {
+      "content": "                       ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50256": {
+      "content": "                      ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50257": {
+      "content": "                     ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50258": {
+      "content": "                    ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50259": {
+      "content": "                   ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50260": {
+      "content": "                  ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50261": {
+      "content": "                 ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50262": {
+      "content": "                ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50263": {
+      "content": "               ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50264": {
+      "content": "              ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50265": {
+      "content": "             ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50266": {
+      "content": "            ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50267": {
+      "content": "           ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50268": {
+      "content": "          ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50269": {
+      "content": "         ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50270": {
+      "content": "        ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50271": {
+      "content": "       ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50272": {
+      "content": "      ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50273": {
+      "content": "     ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50274": {
+      "content": "    ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50275": {
+      "content": "   ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50276": {
+      "content": "  ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50277": {
+      "content": "|||EMAIL_ADDRESS|||",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50278": {
+      "content": "|||PHONE_NUMBER|||",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50279": {
+      "content": "|||IP_ADDRESS|||",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50280": {
+      "content": "<padding>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<|endoftext|>",
+  "chat_template": "{{ bos_token }}{% for message in messages %}{% if message['role'] == 'system' %}{{ '<|system|>\n' + message['content'] + '\n' }}{% elif message['role'] == 'user' %}{{ '<|user|>\n' + message['content'] + '\n' }}{% elif message['role'] == 'assistant' %}{% if not loop.last %}{{ '<|assistant|>\n'  + message['content'] + eos_token + '\n' }}{% else %}{{ '<|assistant|>\n'  + message['content'] + eos_token }}{% endif %}{% endif %}{% if loop.last and add_generation_prompt %}{{ '<|assistant|>\n' }}{% endif %}{% endfor %}",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|endoftext|>",
+  "extra_special_tokens": {},
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_token": "<padding>",
+  "tokenizer_class": "GPTNeoXTokenizer",
+  "unk_token": "<|endoftext|>"
+}