first commit

This commit is contained in:
xxl 2025-03-12 17:45:08 +08:00
parent 9d912d0875
commit 7d7785283f
17 changed files with 255767 additions and 2 deletions

316
LICENSE Normal file
View File

@ -0,0 +1,316 @@
Instella-VL-1B Model [RESEARCH-ONLY RAIL-MS]
Licensed Artifact(s):
- Model
- Source Code
Section I: PREAMBLE
BY ACCESSING, DOWNLOADING, INSTALLING, OR USING THE ARTIFACT, YOU AGREE
TO BE BOUND BY THIS LICENSE. IF YOU DO NOT AGREE TO ALL OF THE TERMS AND
CONDITIONS OF THIS LICENSE, DO NOT ACCESS, DOWNLOAD, INSTALL, OR USE THE
ARTIFACT.
1. Definitions
(a) “Application” refers to a sequence of instructions or statements
written in machine code language, including object code (that is the
product of a compiler), binary code (data using a two-symbol system)
or an intermediate language (such as register transfer language).
(b) “Artifact” refers to a software application (in either binary or
source code format), Model, and/or Source Code, in accordance with
what is specified above as the “Licensed Artifact”.
(c) “Contribution” means any work, including any modifications or
additions to an Artifact, that is intentionally submitted to
Licensor for inclusion or incorporation in the Artifact directly or
indirectly by the rights owner. For the purposes of this definition,
“submitted” means any form of electronic, verbal, or written
communication sent to the Licensor or its representatives, including
but not limited to communication on electronic mailing lists, source
code control systems, and issue tracking systems that are managed
by, or on behalf of, the Licensor for the purpose of discussing,
sharing and improving the Artifact, but excluding communication that
is conspicuously marked or otherwise designated in writing by the
contributor as “Not a Contribution.”
(d) “Contributor” means Licensor or any other individual or legal entity
that creates or owns a Contribution that is added to or incorporated
into an Artifact or its Derivative.
(e) “Data” means a collection of information and/or content extracted
from the dataset used with a given Model, including to train,
pretrain, or otherwise evaluate the Model. The Data is not licensed
under this License.
(f) “Derivative” means a work derived from or based upon an Artifact,
and includes all modified versions of such Artifact.
(g) “Distribution” means any transmission, reproduction, publication or
other sharing of an Artifact or Derivative to a Third Party,
including providing a hosted service incorporating the Artifact,
which is made available by electronic or other remote means -
e.g. API-based or web access.
(h) “Harm” includes but is not limited to physical, mental,
psychological, financial and reputational damage, pain, or loss.
(i) “License” means the terms and conditions for use, reproduction, and
Distribution as defined in this document.
(j) “Licensor” means the rights owner (by virtue of creation or
documented transfer of ownership) or entity authorized by the rights
owner (e.g., exclusive licensee) that is granting the rights in this
License.
(k) “Model” means any machine-learning based assembly or assemblies
(including checkpoints), consisting of learnt weights, parameters
(including optimizer states), corresponding to the model
architecture as embodied in the Source Code.
(l) “Output” means the results of operating a Model as embodied in
informational content resulting therefrom.
(m) “Permitted Purpose” means for academic or research purposes only.
(n) “Source Code” means any collection of text written using
human-readable programming language, including the code and scripts
used to define, run, load, benchmark or evaluate a Model or any
component thereof, and/or used to prepare data for training or
evaluation, if any. Source Code includes any accompanying
documentation, tutorials, examples, etc, if any. For clarity, the
term “Source Code” as used in this License includes any and all
Derivatives of such Source Code.
(o) “Third Parties” means individuals or legal entities that are not
under common control with Licensor or You.
(p) “Use” includes accessing, using, copying, modifying, and/or
distributing an Artifact; in connection with a Model as Artifact,
Use also includes creating content, fine-tuning, updating, running,
training, evaluating and/or re-parametrizing such Model.
(q) “You” (or “Your”) means an individual or legal entity receiving and
exercising permissions granted by this License and/or making use of
the Artifact for permitted purposes and in any permitted field of
use, including usage of the Artifact in an end-use application -
e.g. chatbot, translator, image generator, etc.
Section II: INTELLECTUAL PROPERTY RIGHTS
Both copyright and patent grants may apply to the Artifact. The Artifact
is subject to additional terms and conditions as described in Section III
below.
2. Grant of Copyright License. Conditioned upon compliance with Section
III below and subject to the terms and conditions of this License, each
Contributor hereby grants to You, only in connection with the Permitted
Purpose, a worldwide, non-exclusive, royalty-free copyright license to
reproduce, use, publicly display, publicly perform, sublicense, and
distribute the Artifact and Derivatives thereof.
3. Grant of Patent License. Conditioned upon compliance with Section III
below and subject to the terms and conditions of this License, and only
where and as applicable, each Contributor hereby grants to You, only in
connection with the Permitted Purpose, a worldwide, non-exclusive,
royalty-free, irrevocable (except as stated in this paragraph) patent
license to make, have made, use, sell, offer to sell, import, and
otherwise transfer the Artifact where such license applies only to those
patent claims licensable by such Contributor that are necessarily
infringed by their Contribution(s) alone or by combination of their
Contribution(s) with the Artifact to which such Contribution(s) was
submitted. If You institute patent litigation against any entity
(including a cross-claim or counterclaim in a lawsuit) alleging that the
Artifact and/or a Contribution incorporated within the Artifact
constitutes direct or contributory patent infringement, then any patent
licenses granted to You under this License in connection with the
Artifact shall terminate as of the date such litigation is asserted or
filed.
Licensor and Contributor each have the right to grant the licenses
above.
Section III: CONDITIONS OF USAGE, DISTRIBUTION AND REDISTRIBUTION
4. Use-based Restrictions. The restrictions contained in the AMD
Responsible AI Use Policy set forth in Attachment A are mandatory Use-
based restrictions. Therefore You may not Use the Artifact in violation
of such restrictions. You may Use the Artifact only subject to this
License; if Section II is held unenforceable or inapplicable, this
Section III will continue to govern any use of the Artifact. You shall
require all of Your users who Use the Artifact or its Derivative
to comply with the terms and conditions of this License, including
those contained in this paragraph, and only for the Permitted Purpose.
5. The Output You Generate with a Model (as Artifact). Except as set
forth herein, Licensor claims no rights in the Output You generate. You
are accountable for the Output You generate and its subsequent uses. No
use of the Output may contravene any provision as stated in this
License.
6. Distribution and Redistribution. You may host for Third Party remote
access purposes (e.g. software-as-a-service), reproduce and distribute
copies of the Artifact or its Derivatives in any medium, with or without
modifications, provided that You meet the following conditions:
6.1. Use-based restrictions in paragraph 4 MUST be included as a
condition precedent to effect any type of legal agreement (e.g. a
license) governing the use and/or distribution of the Artifact or
its Derivatives, and You shall give such notice to any subsequent
Third Party recipients;
6.2. You shall give any Third Party recipients of the Artifact or its
Derivatives a copy of this License;
6.3. You shall cause any modified files to carry prominent notices
stating that You changed the files;
6.4. You shall retain all copyright, patent, trademark, and attribution
notices excluding those notices that do not pertain to any part of
the Artifact or its Derivatives.
6.5. You and any Third Party recipients of the Artifact or its
Derivative shall adhere to the Permitted Purpose.
You may add Your own copyright statement to Your modifications and may
provide additional or different license terms and conditions with
respect to paragraph 6.1., to govern the use, reproduction, or
Distribution of Your modifications, or for any Derivative, provided that
Your use, reproduction, and Distribution of the Artifact or its
Derivative otherwise complies with the conditions stated in this
License. In other words, the Use-based restrictions in Attachment A form
the minimum set of terms for You to license to Third Parties any
Artifact or its Derivative, but You may add more restrictive terms if
You deem it necessary.
Section IV: OTHER PROVISIONS
7. Updates and Runtime Restrictions. To the maximum extent permitted by
law, Licensor reserves the right to restrict (remotely or otherwise)
usage of the Artifact in violation of this License or update the
Artifact through electronic means.
8. Trademarks and Related. Nothing in this License permits You to make
use of Licensors trademarks, trade names, logos or to otherwise suggest
endorsement or misrepresent the relationship between the parties; and
any rights not expressly granted herein are reserved by the Licensors.
9. Disclaimer of Warranty. Unless required by applicable law or agreed
to in writing, Licensor provides the Artifact (and each Contributor
provides its Contributions) on an “AS IS” BASIS, WITHOUT WARRANTIES OR
CONDITIONS OF ANY KIND, either express or implied, including, without
limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT,
MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely
responsible for determining the appropriateness of using the Artifact,
and assume any risks associated with Your exercise of permissions under
this License.
10. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise, unless
required by applicable law (such as deliberate and grossly negligent
acts) or agreed to in writing, shall any Contributor be liable to You
for damages, including any direct, indirect, special, incidental, or
consequential damages of any character arising as a result of this
License or out of the use or inability to use the Artifact (including
but not limited to damages for loss of goodwill, work stoppage, computer
failure or malfunction, or any and all other commercial damages or
losses), even if such Contributor has been advised of the possibility of
such damages.
11. If any provision of this License is held to be invalid, illegal or
unenforceable, the remaining provisions shall be unaffected thereby and
remain valid as if such provision had not been set forth herein.
12. Term and Termination. The term of this License will commence upon
the earlier of Your (a) acceptance of this License or (b) accessing the
Artifact; and will continue in full force and effect until terminated in
accordance with the terms and conditions herein. Licensor may terminate
this License if You are in breach of any term or condition of this
License. Upon termination of this License, all licenses granted to You
will terminate and You must promptly delete and cease use of the
Artifact. Sections 1, 7, 8, 9, 10, 11, and 12 survive termination of
this License.
END OF TERMS AND CONDITIONS
Attachment A
AMD Responsible AI Use Policy
AMD is committed to the responsible use of its Artificial Intelligence
(AI) products and technologies (“AMD AI”). AMD AI may include
artificial intelligence or machine learning technologies that use
algorithms to analyze data and generate output using predictions based
on patterns in data. This policy explains the uses that AMD
specifically prohibits.
If you use any AMD AI, you are agreeing to use the AMD AI in compliance
with applicable laws and not for any of the following prohibited uses.
Prohibited Uses:
1) No Illegal Acts. Do not use AMD AI in violation of any applicable
national, state, local, or other jurisdictional law, rule, regulation,
or sanction.
2) No Explicit Content. Do not use AMD AI to submit (as input),
generate, or disseminate content depicting violent or sexually explicit
content or to create sexual chatbots.
3) No Harm. Do not use AMD AI for any potentially harmful uses,
including fraud, deception, discrimination, abuse, or harassment,
including the following:
a) Harm or abuse of a minor, including grooming and child sexual
exploitation.
b) Impersonation of human beings for purposes of deception.
c) Generation or dissemination of information you know to be false
for the purpose of harming others.
d) Intentionally defame, disparage, or otherwise harass others.
e) Intentionally attempting to materially distort the behavior of a
person in a manner that causes or is likely to cause that person
or another person physical or psychological harm.
f) Providing medical advice or interpretation of medical results that
is intended to be a substitute for professional medical advice,
diagnosis, or treatment.
g) Engaging in the unlawful or unauthorized practice of any
profession, including financial, legal, medical, health, or
related professional practices.
h) Judgment of, discrimination against, or harm to individuals or
groups based on legally protected characteristics or categories,
online or offline social behavior, or known or predicted personal
or personality characteristics, including any of the foregoing
uses in social credit systems.
4) No High-Risk Activity. Do not use AMD AI in any high-risk activities
or applications that create a risk of personal injury, death, or
severe property or environmental damage, including in weapons or
military applications.
5) No Personal Information. Do not use AMD AI to collect, process, or
disclose personal data, including heath or sensitive personal
information, without the necessary rights or consents.
6) No Infringement. Do not use AMD AI to generate or disseminate any
information that infringes upon or misappropriates the intellectual
property rights of others, including copyright, trademark, patent, and
trade secret rights, rights to privacy, and publicity rights.
7) No Malware. Do not use AMD AI to generate or disseminate malware or
any other content to be used for the purpose of facilitating unpermitted
access to, or use of, computer systems or data.
8) No Obfuscation. Do not inappropriately obfuscate or fail to disclose
to end users the presence of AI in any application in which AMD AI is
deployed, along with any known risks or dangers of using AI without
appropriate safeguards, oversight and human control.
9) No Reliance. Do not rely on any information generated using AMD AI
without assessing it for accuracy, potential for harm, or other specific
risks applicable to the use case.

444
NOTICE Normal file
View File

@ -0,0 +1,444 @@
NOTICES Instella_VL_1B
Copyright Statements
Copyright (c) 2018-2024 Advanced Micro Devices, Inc. All Rights Reserved.
License Text https://spdx.org/licenses/Apache-2.0.html
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.
"Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
(a) You must give any other recipients of the Work or Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License.
You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
amd-AMD-OLMo-1B-SFT v-u (Apache-2.0)
Copyright Statements
Copyright (c) 2018-2024 Advanced Micro Devices, Inc. All Rights Reserved.
License Text https://spdx.org/licenses/Apache-2.0.html
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.
"Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
(a) You must give any other recipients of the Work or Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License.
You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
Dependencies on FastChat v-u (Apache-2.0)
Copyright Statements
Modification Copyright© 2025 Advanced Micro Devices, Inc. All rights reserved."
Copyright 2023 Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li
License Text https://spdx.org/licenses/MIT.html
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.
"Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
(a) You must give any other recipients of the Work or Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License.
You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
Dependencies on LLaVA-NeXT v-u (Apache-2.0)
Copyright Statements
"Modification Copyright© 2025 Advanced Micro Devices, Inc. All rights reserved."
Copyright 2023 Haotian Liu
License Text https://spdx.org/licenses/Apache-2.0.html
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.
"Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
(a) You must give any other recipients of the Work or Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License.
You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
Dependencies on OpenGVLab-InternVL v-u (MIT)
Copyright Statements
Modification Copyright© 2025 Advanced Micro Devices, Inc. All rights reserved."
Copyright (c) 2023 OpenGVLab
License Text https://spdx.org/licenses/MIT.html
# "Modification Copyright© 2025 Advanced Micro Devices, Inc. All rights reserved."
# --------------------------------------------------------
# InternVL
# Copyright (c) 2023 OpenGVLab
# Licensed under The MIT License [see LICENSE for details]
# --------------------------------------------------------
LLaVA-NeXT v-u (Apache-2.0)
Copyright Statements
Copyright 2022 The HuggingFace Team. All rights reserved.
Copyright 2023 Haotian Liu
Copyright 2024 Duc Q. Nguyen, Haotian Liu and Bo Li
Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved
Copyright 2023 DDPO-pytorch authors (Kevin Black), The HuggingFace Team, metric-space. All rights reserved.
License Text https://spdx.org/licenses/Apache-2.0.html
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.
"Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
(a) You must give any other recipients of the Work or Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License.
You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
microsoft-unilm v-u (MIT)
Copyright Statements
Copyright (c) Microsoft Corporation
License Text https://spdx.org/licenses/MIT.html
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the " Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice (including the next paragraph) shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
openai-CLIP v-u (MIT)
Copyright Statements
Copyright (c) 2021 OpenAI.
License Text https://spdx.org/licenses/MIT.html
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the " Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice (including the next paragraph) shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
salesforce-LAVIS v-u (BSD-3-Clause)
Copyright Statements
Copyright (c) 2023, salesforce.com, inc.
License Text https://spdx.org/licenses/BSD-3-Clause.html
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Copyright Statements
Tongyi Qianwen is licensed under the Tongyi Qianwen LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved.
Tongyi Qianwen LICENSE AGREEMENT
Tongyi Qianwen Release Date: August 23, 2023
By clicking to agree or by using or distributing any portion or element of the Tongyi Qianwen Materials, you will be deemed to have recognized and accepted the content of this Agreement, which is effective immediately.
1. Definitions
a. This Tongyi Qianwen LICENSE AGREEMENT (this "Agreement") shall mean the terms and conditions for use, reproduction, distribution and modification of the Materials as defined by this Agreement.
b. "We"(or "Us") shall mean Alibaba Cloud.
c. "You" (or "Your") shall mean a natural person or legal entity exercising the rights granted by this Agreement and/or using the Materials for any purpose and in any field of use.
d. "Third Parties" shall mean individuals or legal entities that are not under common control with Us or You.
e. "Tongyi Qianwen" shall mean the large language models (including Qwen-VL model and Qwen-VL-Chat model), and software and algorithms, consisting of trained model weights, parameters (including optimizer states), machine-learning model code, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing distributed by Us.
f. "Materials" shall mean, collectively, Alibaba Cloud's proprietary Tongyi Qianwen and Documentation (and any portion thereof) made available under this Agreement.
g. "Source" form shall mean the preferred form for making modifications, including but not limited to model source code, documentation source, and configuration files.
h. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation,
and conversions to other media types.
2. Grant of Rights
You are granted a non-exclusive, worldwide, non-transferable and royalty-free limited license under Alibaba Cloud's intellectual property or other rights owned by Us embodied in the Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the Materials.
3. Redistribution
You may reproduce and distribute copies of the Materials or derivative works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
a. You shall give any other recipients of the Materials or derivative works a copy of this Agreement;
b. You shall cause any modified files to carry prominent notices stating that You changed the files;
c. You shall retain in all copies of the Materials that You distribute the following attribution notices within a "Notice" text file distributed as a part of such copies: "Tongyi Qianwen is licensed under the Tongyi Qianwen LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved."; and
d. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such derivative works as a whole, provided Your use, reproduction, and distribution of the work otherwise complies with the terms and conditions of this Agreement.
4. Restrictions
If you are commercially using the Materials, and your product or service has more than 100 million monthly active users, You shall request a license from Us. You cannot exercise your rights under this Agreement without our express authorization.
5. Rules of use
a. The Materials may be subject to export controls or restrictions in China, the United States or other countries or regions. You shall comply with applicable laws and regulations in your use of the Materials.
b. You can not use the Materials or any output therefrom to improve any other large language model (excluding Tongyi Qianwen or derivative works thereof).
6. Intellectual Property
a. We retain ownership of all intellectual property rights in and to the Materials and derivatives made by or for Us. Conditioned upon compliance with the terms and conditions of this Agreement, with respect to any derivative works and modifications of the Materials that are made by you, you are and will be the owner of such derivative works and modifications.
b. No trademark license is granted to use the trade names, trademarks, service marks, or product names of Us, except as required to fulfill notice requirements under this Agreement or as required for reasonable and customary use in describing and redistributing the Materials.
c. If you commence a lawsuit or other proceedings (including a cross-claim or counterclaim in a lawsuit) against Us or any entity alleging that the Materials or any output therefrom, or any part of the foregoing, infringe any intellectual property or other right owned or licensable by you, then all licences granted to you under this Agreement shall terminate as of the date such lawsuit or other proceeding is commenced or brought.
7. Disclaimer of Warranty and Limitation of Liability
a. We are not obligated to support, update, provide training for, or develop any further version of the Tongyi Qianwen Materials or to grant any license thereto.
b. THE MATERIALS ARE PROVIDED "AS IS" WITHOUT ANY EXPRESS OR IMPLIED WARRANTY OF ANY KIND INCLUDING WARRANTIES OF MERCHANTABILITY, NONINFRINGEMENT, OR FITNESS FOR A PARTICULAR PURPOSE. WE MAKE NO WARRANTY AND ASSUME NO RESPONSIBILITY FOR THE SAFETY OR STABILITY OF THE MATERIALS AND ANY OUTPUT THEREFROM.
c. IN NO EVENT SHALL WE BE LIABLE TO YOU FOR ANY DAMAGES, INCLUDING, BUT NOT LIMITED TO ANY DIRECT, OR INDIRECT, SPECIAL OR CONSEQUENTIAL DAMAGES ARISING FROM YOUR USE OR INABILITY TO USE THE MATERIALS OR ANY OUTPUT OF IT, NO MATTER HOW IT S CAUSED.
d. You will defend, indemnify and hold harmless Us from and against any claim by any third party arising out of or related to your use or distribution of the Materials.
8. Survival and Termination.
a. The term of this Agreement shall commence upon your acceptance of this Agreement or access to the Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein.
b. We may terminate this Agreement if you breach any of the terms or conditions of this Agreement. Upon termination of this Agreement, you must delete and cease use of the Materials. Sections 7 and 9 shall survive the termination of this Agreement.
9. Governing Law and Jurisdiction.
a. This Agreement and any dispute arising out of or relating to it will be governed by the laws of China, without regard to conflict of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this Agreement.
b. The People's Courts in Hangzhou City shall have exclusive jurisdiction over any dispute arising out of this Agreement.
------------- LICENSE FOR NVIDIA Megatron-LM code --------------
Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of NVIDIA CORPORATION nor the names of its
contributors may be used to endorse or promote products derived
from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
------------- LICENSE FOR OpenAI tiktoken code --------------
MIT License
Copyright (c) 2022 OpenAI, Shantanu Jain
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

434
README.md
View File

@ -1,3 +1,433 @@
# Instella-VL-1B
---
{}
---
# Instella-VL-1B ✨
Welcome to the official repository for **Instella-VL-1B**, AMD's first ever Vision-Language Model (VLM). This repository provides a detailed guide for training and inference with **Instella-VL-1B**. Developed from AMD's **Instella-1B** (previously known as [AMD OLMo 1B SFT](https://www.amd.com/en/developer/resources/technical-articles/introducing-the-first-amd-1b-language-model.html) LLM), this model is fully open-source, with both model weights and training code available for AMD GPUs (MI300). Its compact size aims to make it accessible to a broad spectrum of researchers, developers, and enthusiasts, enabling them to build upon, modify, and integrate it into their own projects.
Instella-VL-1B
[[GitHub](https://github.com/AMD-AIG-AIMA/InstellaVL)][[Blog](https://rocm.blogs.amd.com/artificial-intelligence/Instella-BL-1B-VLM/README.html)]
## Main Results
We compare our model with models which only releases the model weights (with * in the below table) and also models which releases weights, data curvation and all training details.
<table class="tg"><thead>
<tr>
<td class="tg-0pky"></td>
<td class="tg-c3ow">DeepSeek-VL-1.3B *</td>
<td class="tg-c3ow">InternVL2-1B *</td>
<td class="tg-c3ow">InternVL2.5-1B *</td>
<td class="tg-c3ow">TinyLLaVA-2.4B</td>
<td class="tg-c3ow">TinyLLaVA-1.5B</td>
<td class="tg-c3ow">llava-onevision-1b</td>
<td class="tg-c3ow">MiniCPM-V-2</td>
<td class="tg-c3ow">Instella-VL-1B</td>
</tr></thead>
<tbody>
<tr>
<td class="tg-c3ow">GQA</td>
<td class="tg-c3ow">--</td>
<td class="tg-c3ow">55.06</td>
<td class="tg-c3ow">56.66</td>
<td class="tg-c3ow">61.58</td>
<td class="tg-c3ow">60.28</td>
<td class="tg-c3ow">57.95</td>
<td class="tg-c3ow">--</td>
<td class="tg-c3ow">61.52</td>
</tr>
<tr>
<td class="tg-c3ow">SQA</td>
<td class="tg-c3ow">64.52</td>
<td class="tg-c3ow">89.54</td>
<td class="tg-c3ow">93.90</td>
<td class="tg-c3ow">64.30</td>
<td class="tg-c3ow">59.69</td>
<td class="tg-c3ow">59.25</td>
<td class="tg-c3ow">76.10</td>
<td class="tg-c3ow">83.74</td>
</tr>
<tr>
<td class="tg-c3ow">POPE</td>
<td class="tg-c3ow">85.80</td>
<td class="tg-c3ow">87.40</td>
<td class="tg-c3ow">89.95</td>
<td class="tg-c3ow">85.66</td>
<td class="tg-c3ow">84.77</td>
<td class="tg-c3ow">87.17</td>
<td class="tg-c3ow">86.56</td>
<td class="tg-c3ow">86.73</td>
</tr>
<tr>
<td class="tg-c3ow">MM-Bench</td>
<td class="tg-c3ow">64.34</td>
<td class="tg-c3ow">61.70</td>
<td class="tg-c3ow">68.40</td>
<td class="tg-c3ow">58.16</td>
<td class="tg-c3ow">51.28</td>
<td class="tg-c3ow">44.60</td>
<td class="tg-c3ow">70.44</td>
<td class="tg-c3ow">69.17</td>
</tr>
<tr>
<td class="tg-c3ow">seedbench</td>
<td class="tg-c3ow">65.94</td>
<td class="tg-c3ow">65.90</td>
<td class="tg-c3ow">71.30</td>
<td class="tg-c3ow">63.30</td>
<td class="tg-c3ow">60.04</td>
<td class="tg-c3ow">65.43</td>
<td class="tg-c3ow">66.90</td>
<td class="tg-c3ow">68.47</td>
</tr>
<tr>
<td class="tg-c3ow">MMMU</td>
<td class="tg-c3ow">28.67</td>
<td class="tg-c3ow">32.40</td>
<td class="tg-c3ow">35.60</td>
<td class="tg-c3ow">32.11</td>
<td class="tg-c3ow">29.89</td>
<td class="tg-c3ow">30.90</td>
<td class="tg-c3ow">38.55</td>
<td class="tg-c3ow">29.30</td>
</tr>
<tr>
<td class="tg-c3ow">realworldqa</td>
<td class="tg-c3ow">50.20</td>
<td class="tg-c3ow">51.90</td>
<td class="tg-c3ow">58.30</td>
<td class="tg-c3ow">52.42</td>
<td class="tg-c3ow">46.67</td>
<td class="tg-c3ow">51.63</td>
<td class="tg-c3ow">55.03</td>
<td class="tg-c3ow">58.82</td>
</tr>
<tr>
<td class="tg-c3ow">mmstar</td>
<td class="tg-c3ow">38.30</td>
<td class="tg-c3ow">46.18</td>
<td class="tg-c3ow">47.93</td>
<td class="tg-c3ow">37.17</td>
<td class="tg-c3ow">31.87</td>
<td class="tg-c3ow">37.38</td>
<td class="tg-c3ow">40.93</td>
<td class="tg-c3ow">43.21</td>
</tr>
<tr>
<td class="tg-c3ow"><span style="font-weight:bold">Average</span></td>
<td class="tg-c3ow">-</td>
<td class="tg-c3ow">61.26</td>
<td class="tg-c3ow">65.26</td>
<td class="tg-c3ow">56.84</td>
<td class="tg-c3ow">53.06</td>
<td class="tg-c3ow">54.29</td>
<td class="tg-c3ow">-</td>
<td class="tg-c3ow">62.62</td>
</tr>
<tr>
<td class="tg-c3ow">ocrbench</td>
<td class="tg-c3ow">41.40</td>
<td class="tg-c3ow">74.40</td>
<td class="tg-c3ow">74.20</td>
<td class="tg-c3ow">28.90</td>
<td class="tg-c3ow">34.40</td>
<td class="tg-c3ow">43.00</td>
<td class="tg-c3ow">60.00</td>
<td class="tg-c3ow">67.90</td>
</tr>
<tr>
<td class="tg-c3ow">TextVQA</td>
<td class="tg-c3ow">57.54</td>
<td class="tg-c3ow">69.60</td>
<td class="tg-c3ow">72.96</td>
<td class="tg-c3ow">47.05</td>
<td class="tg-c3ow">49.54</td>
<td class="tg-c3ow">49.54</td>
<td class="tg-c3ow">74.23</td>
<td class="tg-c3ow">71.23</td>
</tr>
<tr>
<td class="tg-c3ow">AI2D</td>
<td class="tg-c3ow">51.13</td>
<td class="tg-c3ow">62.40</td>
<td class="tg-c3ow">67.58</td>
<td class="tg-c3ow">49.58</td>
<td class="tg-c3ow">43.10</td>
<td class="tg-c3ow">57.35</td>
<td class="tg-c3ow">64.40</td>
<td class="tg-c3ow">66.65</td>
</tr>
<tr>
<td class="tg-c3ow">ChartQA</td>
<td class="tg-c3ow">47.40</td>
<td class="tg-c3ow">71.52</td>
<td class="tg-c3ow">75.76</td>
<td class="tg-c3ow">12.96</td>
<td class="tg-c3ow">15.24</td>
<td class="tg-c3ow">61.24</td>
<td class="tg-c3ow">59.80</td>
<td class="tg-c3ow">72.52</td>
</tr>
<tr>
<td class="tg-c3ow">DocVQA</td>
<td class="tg-c3ow">35.70</td>
<td class="tg-c3ow">80.94</td>
<td class="tg-c3ow">82.76</td>
<td class="tg-c3ow">25.82</td>
<td class="tg-c3ow">30.38</td>
<td class="tg-c3ow">71.22</td>
<td class="tg-c3ow">69.54</td>
<td class="tg-c3ow">80.30</td>
</tr>
<tr>
<td class="tg-c3ow">InfoVQA</td>
<td class="tg-c3ow">20.52</td>
<td class="tg-c3ow">46.30</td>
<td class="tg-c3ow">53.62</td>
<td class="tg-c3ow">21.35</td>
<td class="tg-c3ow">24.46</td>
<td class="tg-c3ow">41.18</td>
<td class="tg-c3ow">38.24</td>
<td class="tg-c3ow">46.40</td>
</tr>
<tr>
<td class="tg-c3ow">OCR Average</td>
<td class="tg-c3ow">42.28</td>
<td class="tg-c3ow">67.53</td>
<td class="tg-c3ow">71.15</td>
<td class="tg-c3ow">30.94</td>
<td class="tg-c3ow">32.85</td>
<td class="tg-c3ow">53.92</td>
<td class="tg-c3ow">61.04</td>
<td class="tg-c3ow">67.50</td>
</tr>
</tbody></table>
### Quick Start
> [!NOTE]
> Follow below packages list for setting up the inference environment.
> ```bash
> pip==25.0
> wheel==0.45.1
> setuptools==75.8.0
> torch==2.6.0
> torchvision==0.21.0
> transformers==4.49.0
> einops==0.8.0
> ```
```python
import torch
from transformers import AutoTokenizer, AutoProcessor, AutoConfig, AutoModelForCausalLM
from PIL import Image
import requests
from io import BytesIO
def load_image(image_file):
if image_file.startswith("http") or image_file.startswith("https"):
response = requests.get(image_file)
image = Image.open(BytesIO(response.content)).convert("RGB")
else:
image = Image.open(image_file).convert("RGB")
return image
config = AutoConfig.from_pretrained("amd/Instella-VL-1B", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("amd/Instella-VL-1B", config=config, trust_remote_code=True)
processor = AutoProcessor.from_pretrained("amd/Instella-VL-1B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("amd/Instella-VL-1B", trust_remote_code=True).to('cuda') # or 'cpu'
model.eval()
# For single image and text
query="Describe the image."
image=load_image("path/to/your_image") # can be a https:// url
out = processor.encode(query, image, model.get_vision_tower().image_processor, tokenizer, config)
inputs = {k: v.to(model.device) for k, v in out.items() if isinstance(v, torch.Tensor)}
with torch.inference_mode():
output_ids = model.generate(inputs["input_ids"], images=inputs['image_tensor'], image_sizes=out['image_sizes'], do_sample=True, num_beams=1, temperature=0.2, max_new_tokens=1024, use_cache=True, stopping_criteria=out['stopping_criteria'], eos_token_id=out['eos_token_id'])
outputs = processor.decode(output_ids)
print("InstellaVL: ", outputs)
# For batch of images and text.
query=["Describe the image.", "What is the color of the dog?"]
image=[load_image("../assets/images/instellavl.png"), load_image("../assets/images/example2_dog.jpg")]
outs = processor.batch_encode(query, image, model.get_vision_tower().image_processor, tokenizer, config)
for idx, o in enumerate(outs):
ins = {k: v.to(model.device) for k, v in o.items() if isinstance(v, torch.Tensor)}
with torch.inference_mode():
output_ids = model.generate(ins["input_ids"],
images=ins['image_tensor'],
image_sizes=o['image_sizes'],
do_sample=True,
num_beams=1,
temperature=0.2,
max_new_tokens=1024,
use_cache=True,
stopping_criteria=o['stopping_criteria'],
eos_token_id=o['eos_token_id'])
outputs = processor.decode(output_ids)
print("Query: ", query[idx])
print("InstellaVL: ", outputs)
```
<details>
<summary><b>TL;DR</b>: Loading from locally saved checkpoint</summary>
<p><strong>Note:</strong> Do <code>pip install -e . --no-deps</code> to register/include for InstellaVL repo as <code>instellavl</code> package into Python package list.</p>
``` python
import torch
# Import essential modules
from instellavl.constants import DEFAULT_IMAGE_TOKEN, IMAGE_TOKEN_INDEX
from instellavl.conversation import conv_templates, SeparatorStyle
from instellavl.model.builder import load_pretrained_model
from instellavl.utils import disable_torch_init
from instellavl.mm_utils import process_images, tokenizer_image_token, get_model_name_from_path
from PIL import Image
import requests
from io import BytesIO
# Login into HF Hub
from huggingface_hub import login
login(token = "<Your HFtoken id>") # Enter your token
def load_image(image_file):
if image_file.startswith("http") or image_file.startswith("https"):
response = requests.get(image_file)
image = Image.open(BytesIO(response.content)).convert("RGB")
else:
image = Image.open(image_file).convert("RGB")
return image
#
# ========= CHANGE IMAGE and Query only HERE ============
image_file = '/path/to/Instella-VL-repo/assets/images/example2_dog.jpg' # Enter the test image path
query = 'Describe this image.'
# =======================================================
disable_torch_init()
conv_mode = 'instella'
# Model loading
model_path = '<path/to/model-checkpoint-saved-locally>' # Enter your model path, should contain instellavl substring in the name.
model_name = get_model_name_from_path(model_path)
tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, None, model_name, False, False)
model.eval()
model = model.to('cuda') # change to 'cpu' if not 'cuda'
# Image pre-processing
image = load_image(image_file)
image_tensor = process_images([image], image_processor, model.config)
image_tensor = image_processor.preprocess(image, return_tensors="pt")["pixel_values"].to(model.dtype)
# Text pre-processing - follow the below logic too when there is no Image:
# if images is not None and len(image_tensor) != 0 and DEFAULT_IMAGE_TOKEN not in text:
# question = DEFAULT_IMAGE_TOKEN + "\n" + text
# else:
# question = text
query = query.replace(DEFAULT_IMAGE_TOKEN, "").strip()
question = DEFAULT_IMAGE_TOKEN + "\n" + query
conv = conv_templates[conv_mode].copy()
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()
# Final arrangements required
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0)
keywords = [conv.sep]
image_sizes = [image.size]
stopping_criteria = [KeywordsStoppingCriteria(keywords, tokenizer, input_ids)]
terminators = [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("|||IP_ADDRESS|||")]
with torch.inference_mode():
output_ids = model.generate(input_ids.to(model.device), images=image_tensor.to(model.device), image_sizes=image_sizes, do_sample=True, num_beams=1, temperature=0.2, max_new_tokens=1024, use_cache=True, stopping_criteria=stopping_criteria, eos_token_id=terminators)
outputs = tokenizer.decode(output_ids[0, input_ids.shape[1] :]).strip()
print("InstellaVL: ", outputs)
```
</details>
## Model Architecture
| Parts | Parameter size | Number of layers | Number of heads | Hidden size | Patch Size |
| ------------- |:-------------:|:-----:|:-----:|:-----:|:-----:|
| Vision Encoder | 300M | 24| 16 | 1024 | 14 |
| MLP | 6.3M | 2 | - | 2048 | - |
| LM | 1.2B | 16 | 16 | 2048 | - |
We initialize the vision encoder from [CLIP-ViT-L/14@336](https://huggingface.co/openai/clip-vit-large-patch14-336) and initialize LM from [AMD OLMo 1B SFT](https://huggingface.co/amd/AMD-OLMo-1B-SFT)
## Training Stages
| Stages | MLP Warmup | Pretraining | Instruction Tuning |
| ------------- |:-------------:|:-----:|:-----:|
| Tunable Parts | Adapter | Entire Model | Entire Model |
## Hardware
Training was conducted with up to 4 nodes, totaling 32 GPUs. Each node comprises [8 AMD Instinct™ MI300X GPUs](https://www.amd.com/en/products/accelerators/instinct/mi300/mi300x.html)
**MLP warmup**: 1 node
**Pretraining**: 2 nodes
**Finetune**: 4 nodes
## Datasets
### MLP Warmup
[BLIP558K](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain)
<h3 align="center">Pretraining Stage</h3>
| **Domain** | **Datasets** | **Num of Examples** | **Licenses** |
|---|:---:|---:|:---|
| Image Captions | [BLIP150K](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain), [COCO118K](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain), [CC3M-Recap](https://huggingface.co/datasets/lmms-lab/LLaVA-ReCap-CC3M), [Pixmo_Cap](https://huggingface.co/datasets/allenai/pixmo-cap) | 3.52M | BSD 3-Clause for BLIP150K, COCO118K; Apache 2 for CC3M-Recap; ODC-BY-1.0 for Pixmo_Cap; see source materials for CC3M-Recap |
| OCR | [SynthDog_EN](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Mid-Data), [SynthDog_ZH](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Mid-Data), [UReader](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Mid-Data), [ART](https://rrc.cvc.uab.es/?ch=14&com=downloads), [COCO-Text](https://bgshih.github.io/cocotext/), [HierText](https://github.com/google-research-datasets/hiertext), [Uber-Text](https://s3-us-west-2.amazonaws.com/uber-common-public/ubertext/index.html), [TextOCR](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data), [OpenVINO](https://github.com/openvinotoolkit/cvat), [MLT-17](https://rrc.cvc.uab.es/?ch=8&com=downloads) | 913K | Apache 2 for SynthDog_EN, SynthDog_ZH, UReader, TextOCR, OpenVINO; CC By 4.0 for COCO-Text; CC BY-SA 4.0 for HierText, Uber-Text; See source materials for ART, MLT-17 |
| Doc | [DocVQA](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data), [DocStruct4M](https://huggingface.co/datasets/mPLUG/DocStruct4M) | 410K | Apache 2 |
| Table & Chart & Plot | [Chart2Text](https://github.com/vis-nlp/Chart-to-text/tree/main/pew_dataset/dataset/imgs), [UniChart](https://huggingface.co/datasets/ahmed-masry/unichart-pretrain-data), [PlotQA](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data), [WidgetCaption](https://huggingface.co/datasets/rootsautomation/RICO-WidgetCaptioning?row=0), [Screen2Words](https://huggingface.co/datasets/rootsautomation/RICO-Screen2Words), [SciGraphQA-295K](https://huggingface.co/datasets/alexshengzhili/SciGraphQA-295K-train), [Paper2Fig100K](https://zenodo.org/records/7299423#.Y2lzonbMKUl), [MMC Instruction](https://huggingface.co/datasets/xywang1/MMC/viewer/MMC-Instruction), [M-Paper](https://huggingface.co/datasets/mPLUG/M-Paper) | 1.97M | GPL-3.0 for Chart2Text; MIT for UniChart, SciGraphQA-295K; Apache 2 for PlotQA, M-Paper; CC By 4.0 for WidgetCaption, Screen2Words, Paper2Fig100K; CC BY-SA 4.0 for MMC Instruction |
| Text Only | [Evol-Instruct-GPT-4](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Mid-Data/tree/main/evol_instruct) | 70K | Apache 2 |
<h3 align="center">Instruction-tuning Stage</h3>
| **Domain** | **Datasets** | **Num of Examples** | **Licenses** |
|---|:---:|---:|:---|
| General | [AOKVQA, CLEVR, Hateful Memes, Image Textualization, OKVQA, ScienceQA, ShareGPT-4V, TallyQA, Visual7W, VizWiz, VQAv2, WebSight, ALLaVA Instruct, Cambrian, COCO Caption, IconQA, LLaVA-158K, LLaVAR, RefCOCO, ShareGPT-4O, Vision FLAN, VisText, VQARAD, VSR, InterGPS](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data), [Image-Paragraph-Captioning, ImageNet, COCO-GOI, COCO-ITM, Visual Dialog, SNLI-VE](https://huggingface.co/datasets/MMInstruction/M3IT), [Web-Landmark, Web-Celebrity, SAM, LAION-GPT-4V-Dataset, OODVQA]( https://huggingface.co/datasets/nyu-visionx/Cambrian-10M/tree/main), [Pixmo_Cap](https://huggingface.co/datasets/allenai/pixmo-cap), [Pixmo_Count](https://huggingface.co/datasets/allenai/pixmo-count), [Pixmo_Points](https://huggingface.co/datasets/allenai/pixmo-points), [Pixmo_Ask_Model_Anything](https://huggingface.co/datasets/allenai/pixmo-ask-model-anything), [SVIT_Core_150K](https://huggingface.co/datasets/BAAI/SVIT), [Localized Narratives](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron) | 2.66M | see source materials for Image-Paragraph-Captioning, ImageNet, COCO-GOI, COCO-ITM, Visual Dialog, SNLI-VE; ODC-BY-1.0 for Pixmo_Cap, Pixmo_Count, Pixmo_Points, Pixmo_Ask_Model_Anything; CC By 4.0 for SVIT_Core_150K, Localized Narratives; Apache 2 for rest of the datasets; |
| Table & Chart & Screen | [AI2D, ChartQA, DocVQA, FigureQA, InfographicVQA, RoBUT-SQA, RoBUT-WTQ, TQA, UReader IE, UReader QA, Chart2Text, , Diagram Image2Text, DVQA, HiTab, LRV Chart, RoBUT WikiSQL, Screen2Words, UReader Caption, UReader KG, VisualMRC](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data), [TinyChartData](https://huggingface.co/datasets/mPLUG/TinyChartData) | 866K | Apache 2 |
| Doc | [ArxivQA](https://huggingface.co/datasets/MMInstruction/ArxivQA), [DocDownstream-1.0](https://huggingface.co/datasets/mPLUG/DocDownstream-1.0), [DocReason25K](https://huggingface.co/datasets/mPLUG/DocReason25K), [DocStruct4M](https://huggingface.co/datasets/mPLUG/DocStruct4M), [Pixmo_Docs](https://huggingface.co/datasets/allenai/pixmo-docs) | 522K | CC BY-SA 4.0 for ArxivQA; Apache 2 for DocDownstream-1.0, DocReason25K, DocStruct4M; ODC-BY-1.0 for Pixmo_Docs |
| General OCR | [ChromeWriting, IIIT5K, K12 Printing, Rendered Text, TextCaps, HME100K, IAM, TextOCR-GPT-4V](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data), [SynthDog-EN](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Mid-Data) | 84K | Apache 2 |
| Math & Reasoning | [MAVIS Manual Collection, CLEVR-Math, Geo170K QA, GEOS, GeoMVerse, MapQA, Super-CLEVR, UniGeo, LRV Normal, Visual Genome, MAVIS Data Engine, Geo170K Align, Geometry3K, GeoQA+, TabMWP, GQA, RAVEN, MathVision, KVQA, VCR](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data), [FinQA](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron), [Design2Code, IDK](https://huggingface.co/datasets/nyu-visionx/Cambrian-10M/) | 460K | CC By 4.0 for FinQA; Apache 2 for rest of the datasets |
| Others | [IQA, MOCHEG, Shapes](https://huggingface.co/datasets/MMInstruction/M3IT), [ALFWorld, Q-Instruct-DB](https://huggingface.co/datasets/nyu-visionx/Cambrian-10M/) | 479K | see source materials for IQA, MOCHEG, Shapes; Apache 2 for ALFWorld, Q-Instruct-DB |
| Text Only | [MathQA, Magpie Pro (L3 MT), Magpie Pro (Qwen2 ST), Magpie Pro (L3 ST)](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data) | 480K | Apache 2 |
> [!NOTE]
> Further, to strengthen models understanding of science-based and general reasoning questions, as identified through error analysis, we oversampled (almost doubled the volume) specific datasets from the SFT dataset pool as detailed below.
>
> Oversampled (~2x sampling rate): ScienceQA, AI2D, PMC-VQA, Cambrian, and TQA
>
> Further information concerning the training datasets, including applicable licensing terms and use restrictions, may be located at the linked source location.
For the details of training hyperparameters, please check [our github repo](https://github.com/AMD-AIG-AIMA/Instella-VL)
## Contributors
**Core contributors:** [Ximeng Sun](https://sunxm2357.github.io/), [Aditya Kumar Singh](https://rodosingh.github.io), [Gowtham Ramesh](https://www.linkedin.com/in/gowtham1/), [Zicheng Liu](https://zicliu.wixsite.com/mysite)
**Contributors:** [Pratik Prabhanjan Brahma](https://www.linkedin.com/in/pratik-p-brahma/), [Ze Wang](https://www.linkedin.com/in/ze-wang-1379601a5/), [Jiang Liu](https://joellliu.github.io/), [Jialian Wu](https://jialianwu.com/), [Prakamya Mishra](https://prakamya-mishra.github.io/), [Xiaodong Yu](https://www.xiaodongyu.me/), [Yusheng Su](https://yushengsu-thu.github.io/), [Sudhanshu Ranjan](https://www.linkedin.com/in/sudhanshu-ranjan-33a216124), [Emad Barsoum](https://www.linkedin.com/in/ebarsoum/)
## Bias, Risks, and Limitations
This model is made accessible without any safety guarantees. Users should be aware that the model may generate outputs that are sensitive, inaccurate, harmful, biased, or otherwise objectionable based on user prompts. It is crucial for users to conduct comprehensive safety evaluations, implement safety filtering, and verify the model's outputs to mitigate these risks.
## License
See Files for license and any notices.
## Citing
```bibtex
@misc{Instella-VL-1B,
title = {Instella-VL-1B: First AMD Vision Language Model},
url = {https://huggingface.co/amd/Instella-VL-1B},
author = {Ximeng Sun, Aditya Singh, Gowtham Ramesh, Jiang Liu, Ze Wang, Sudhanshu Ranjan, Pratik Prabhanjan Brahma, Prakamya Mishra, Jialian Wu, Xiaodong Yu, Yusheng Su, Emad Barsoum, Zicheng Liu},
month = {March},
year = {2025}
}
```

2
chat_template.json Normal file
View File

@ -0,0 +1,2 @@
{"chat_template": "|||IP_ADDRESS|||\n{% for message in messages -%}{{ message['role'] + message['content']}}{%- if not loop.last -%}{{ '\\n' if loop.index % 2 == 1 else '|||IP_ADDRESS|||\\n'}}{%- endif %}{%- endfor -%}"
}

100
config.json Normal file
View File

@ -0,0 +1,100 @@
{
"_name_or_path": "/home/goramesh/local/gramesh/Instella-VL-1B/",
"architectures": [
"InstellaVLForCausalLM"
],
"auto_map": {
"AutoConfig": "modeling_instellavl.InstellaVLConfig",
"AutoModelForCausalLM": "modeling_instellavl.InstellaVLForCausalLM"
},
"attention_bias": false,
"attention_dropout": 0.0,
"clip_qkv": null,
"eos_token_id": 50279,
"hidden_act": "silu",
"hidden_size": 2048,
"image_aspect_ratio": "anyres",
"image_crop_resolution": null,
"image_grid_pinpoints": [
[
336,
336
],
[
336,
672
],
[
336,
1008
],
[
336,
1344
],
[
336,
1680
],
[
672,
336
],
[
672,
672
],
[
1008,
336
],
[
1344,
336
],
[
1680,
336
]
],
"image_split_resolution": null,
"initializer_range": 0.02,
"intermediate_size": 8192,
"max_position_embeddings": 2048,
"mm_anyres_choose_method": "best_fit",
"mm_compact_visual_tokens": false,
"mm_downsample_ratio": 1,
"mm_hidden_size": 1024,
"mm_newline_position": "one_token",
"mm_patch_merge_type": "spatial_unpad",
"mm_projector_lr": null,
"mm_projector_type": "mlp2x_gelu",
"mm_resampler_type": null,
"mm_spatial_pool_mode": "bilinear",
"mm_tunable_parts": "mm_vision_tower,mm_mlp_adapter,mm_language_model",
"mm_use_im_patch_token": false,
"mm_use_im_start_end": false,
"mm_vision_select_feature": "patch",
"mm_vision_select_layer": -2,
"mm_vision_tower": "openai/clip-vit-large-patch14-336",
"mm_vision_tower_lr": null,
"model_type": "instellavl",
"num_attention_heads": 16,
"num_hidden_layers": 16,
"num_key_value_heads": 16,
"online_training": true,
"pad_token_id": 1,
"pos_skipping_range": 4096,
"rope_scaling": null,
"rope_theta": 10000.0,
"tie_word_embeddings": true,
"tokenizer_model_max_length": 32768,
"tokenizer_padding_side": "right",
"torch_dtype": "float16",
"transformers_version": "4.45.1",
"use_cache": true,
"use_mm_proj": true,
"use_pos_skipping": false,
"vision_tower_pretrained": null,
"vocab_size": 50282
}

334
conversation.py Normal file
View File

@ -0,0 +1,334 @@
# Modification Copyright© 2025 Advanced Micro Devices, Inc. All rights reserved.
import re
import base64
import dataclasses
from PIL import Image
from io import BytesIO
from enum import auto, Enum
from typing import List, Any, Dict, Union, Tuple
from transformers import AutoTokenizer
class SeparatorStyle(Enum):
"""Different separator style."""
SINGLE = auto()
MPT = auto()
INSTELLA = auto()
@dataclasses.dataclass
class Conversation:
r"""A class that keeps all conversation history."""
system: str
roles: List[str]
messages: List[List[str]]
offset: int
sep_style: SeparatorStyle = SeparatorStyle.SINGLE
sep: str = "###"
sep2: str = None
version: str = "Unknown"
tokenizer_id: str = ""
tokenizer: Any = None
# Stop criteria (the default one is EOS token)
stop_str: Union[str, List[str]] = None
# Stops generation if meeting any token in this list
stop_token_ids: List[int] = None
skip_next: bool = False
def get_prompt(self):
"""
Generates a formatted prompt string based on the messages and separator style.
The function processes the messages stored in the instance, applies specific formatting rules
based on the separator style, and returns the resulting prompt string.
Returns:
`str`: The formatted prompt string.
Raises:
`ValueError`: If an invalid separator style is specified.
"""
messages = self.messages
if len(messages) > 0 and type(messages[0][1]) is tuple:
messages = self.messages.copy()
init_role, init_msg = messages[0].copy()
init_msg = init_msg[0]
if "mmtag" in self.version:
init_msg = init_msg.replace("<image>", "").strip()
messages[0] = (init_role, init_msg)
messages.insert(0, (self.roles[0], "<Image><image></Image>"))
messages.insert(1, (self.roles[1], "Received."))
elif not init_msg.startswith("<image>"):
init_msg = init_msg.replace("<image>", "").strip()
messages[0] = (init_role, "<image>\n" + init_msg)
else:
messages[0] = (init_role, init_msg)
if self.sep_style == SeparatorStyle.SINGLE:
ret = self.system + self.sep
for role, message in messages:
if message:
if type(message) is tuple:
message, _, _ = message
ret += role + ": " + message + self.sep
else:
ret += role + ":"
elif self.sep_style == SeparatorStyle.MPT:
ret = self.system + self.sep
for role, message in messages:
if message:
if type(message) is tuple:
message, _, _ = message
ret += role + message + self.sep
else:
ret += role
elif self.sep_style == SeparatorStyle.INSTELLA:
seps = [self.sep, self.sep2]
ret = "|||IP_ADDRESS|||"
for i, (role, message) in enumerate(messages):
if message:
if type(message) is tuple:
message, _, _ = message
if i % 2 == 1:
message = message.strip()
ret += role + message + seps[i % 2]
else:
ret += role
else:
raise ValueError(f"Invalid style: {self.sep_style}")
return ret
def append_message(self, role, message):
self.messages.append([role, message])
def process_image(self, image: Union[str, Image.Image], image_process_mode: str, return_pil: bool=False, image_format: str="PNG")->Union[str, Image.Image]:
r"""
Processes an image according to the specified mode and returns either a PIL image or a base64 encoded string.
Args:
- image (Union[str, Image.Image]): The image to be processed. Can be a file path or a PIL Image object.
- image_process_mode (str): The mode of image processing. Options are "Pad", "Default", "Crop", or "Resize".
- return_pil (bool, optional): If True, returns a PIL Image object. If False, returns a base64 encoded string. Defaults to False.
- image_format (str, optional): The format to save the image in if returning a base64 encoded string. Defaults to "PNG".
Returns:
Union[str, Image.Image]: The processed image, either as a PIL Image object or a base64 encoded string.
Raises:
ValueError: If an invalid image_process_mode is provided.
"""
if image_process_mode == "Pad":
def expand2square(pil_img, background_color=(122, 116, 104)):
width, height = pil_img.size
if width == height:
return pil_img
elif width > height:
result = Image.new(pil_img.mode, (width, width), background_color)
result.paste(pil_img, (0, (width - height) // 2))
return result
else:
result = Image.new(pil_img.mode, (height, height), background_color)
result.paste(pil_img, ((height - width) // 2, 0))
return result
image = expand2square(image)
elif image_process_mode in ["Default", "Crop"]:
pass
elif image_process_mode == "Resize":
image = image.resize((336, 336))
else:
raise ValueError(f"Invalid image_process_mode: {image_process_mode}")
if type(image) is not Image.Image:
image = Image.open(image).convert("RGB")
max_hw, min_hw = max(image.size), min(image.size)
aspect_ratio = max_hw / min_hw
max_len, min_len = 672, 448
shortest_edge = int(min(max_len / aspect_ratio, min_len, min_hw))
longest_edge = int(shortest_edge * aspect_ratio)
W, H = image.size
if H > W:
H, W = longest_edge, shortest_edge
else:
H, W = shortest_edge, longest_edge
image = image.resize((W, H))
if return_pil:
return image
else:
buffered = BytesIO()
image.save(buffered, format=image_format)
img_b64_str = base64.b64encode(buffered.getvalue()).decode()
return img_b64_str
def get_images(self, return_pil: bool=False, return_path: bool=False) -> List[Union[str, Image.Image]]:
"""
Retrieve images from the conversation messages.
Args:
return_pil (bool): If True, return images as PIL objects. Defaults to False.
return_path (bool): If True, return the image file paths instead of processing them. Defaults to False.
Returns:
list: A list of images or image paths depending on the arguments.
"""
images = []
for i, (role, msg) in enumerate(self.messages[self.offset :]):
if i % 2 == 0:
if type(msg) is tuple:
msg, image, image_process_mode = msg
if type(image) != list:
image = [image]
for img in image:
if not return_path and self.is_image_file(img):
img = self.process_image(img, image_process_mode, return_pil=return_pil)
else:
images.append(img)
return images
def is_image_file(self, filename: str)->bool:
image_extensions = [".png", ".jpg", ".jpeg", ".gif", ".bmp", ".tiff", ".webp"]
return any(filename.lower().endswith(ext) for ext in image_extensions)
def is_video_file(self, filename: str)->bool:
video_extensions = [".mp4", ".mov", ".avi", ".mkv", ".wmv", ".flv", ".mpeg", ".mpg"]
return any(filename.lower().endswith(ext) for ext in video_extensions)
def to_gradio_chatbot(self)->list:
ret = []
for i, (role, msg) in enumerate(self.messages[self.offset :]):
if i % 2 == 0:
if type(msg) is tuple:
msg, image, image_process_mode = msg
if type(image) != list:
image = [image]
if len(image) == 1:
msg = "<image>\n" + msg.replace("<image>", "").strip()
else:
msg = re.sub(r"(<image>)\n(?=<image>)", r"\1 ", msg)
img_str_list = []
for img in image:
if self.is_image_file(img):
img_b64_str = self.process_image(img, "Default", return_pil=False, image_format="JPEG")
img_str = f'<img src="data:image/jpeg;base64,{img_b64_str}" style="max-width: 256px; max-height: 256px; width: auto; height: auto; object-fit: contain;"/>'
img_str_list.append(img_str)
elif self.is_video_file(img):
ret.append(((img,), None))
msg = msg.strip()
img_place_holder = ""
for img_str in img_str_list:
img_place_holder += f"{img_str}\n\n"
if len(img_str_list) > 0:
msg = f"{img_place_holder}\n\n{msg}"
if len(msg) > 0:
ret.append([msg, None])
else:
ret.append([msg, None])
else:
ret[-1][-1] = msg
return ret
def copy(self)->"Conversation":
return Conversation(system=self.system, roles=self.roles, messages=[[x, y] for x, y in self.messages], offset=self.offset, sep_style=self.sep_style, sep=self.sep, sep2=self.sep2, version=self.version)
def dict(self)->Dict[str, Any]:
if len(self.get_images()) > 0:
return {
"system": self.system,
"roles": self.roles,
"messages": [[x, y[0] if type(y) is tuple else y] for x, y in self.messages],
"offset": self.offset,
"sep": self.sep,
"sep2": self.sep2,
}
return {
"system": self.system,
"roles": self.roles,
"messages": self.messages,
"offset": self.offset,
"sep": self.sep,
"sep2": self.sep2,
}
conv_vicuna_v0 = Conversation(
system="A chat between a curious human and an artificial intelligence assistant. " "The assistant gives helpful, detailed, and polite answers to the human's questions.",
roles=("Human", "Assistant"),
messages=[
["Human", "What are the key differences between renewable and non-renewable energy sources?"],
[
"Assistant",
"Renewable energy sources are those that can be replenished naturally in a relatively "
"short amount of time, such as solar, wind, hydro, geothermal, and biomass. "
"Non-renewable energy sources, on the other hand, are finite and will eventually be "
"depleted, such as coal, oil, and natural gas. Here are some key differences between "
"renewable and non-renewable energy sources:\n"
"1. Availability: Renewable energy sources are virtually inexhaustible, while non-renewable "
"energy sources are finite and will eventually run out.\n"
"2. Environmental impact: Renewable energy sources have a much lower environmental impact "
"than non-renewable sources, which can lead to air and water pollution, greenhouse gas emissions, "
"and other negative effects.\n"
"3. Cost: Renewable energy sources can be more expensive to initially set up, but they typically "
"have lower operational costs than non-renewable sources.\n"
"4. Reliability: Renewable energy sources are often more reliable and can be used in more remote "
"locations than non-renewable sources.\n"
"5. Flexibility: Renewable energy sources are often more flexible and can be adapted to different "
"situations and needs, while non-renewable sources are more rigid and inflexible.\n"
"6. Sustainability: Renewable energy sources are more sustainable over the long term, while "
"non-renewable sources are not, and their depletion can lead to economic and social instability.\n",
],
],
offset=2,
sep_style=SeparatorStyle.SINGLE,
sep="###",
)
conv_mpt = Conversation(
system="""<|im_start|>system
A conversation between a user and an LLM-based AI assistant. The assistant gives helpful and honest answers.""",
roles=("<|im_start|>user\n", "<|im_start|>assistant\n"),
version="mpt",
messages=[],
offset=0,
sep_style=SeparatorStyle.MPT,
sep="<|im_end|>",
)
conv_instella = Conversation(
system="",
roles=("<|user|>\n", "<|assistant|>\n"),
version="instella",
messages=(),
offset=0,
sep_style=SeparatorStyle.INSTELLA,
sep="\n",
sep2='|||IP_ADDRESS|||\n'
)
default_conversation = conv_instella
conv_templates = {
"default": conv_instella,
"mpt": conv_mpt,
"instella": conv_instella,
}
if __name__ == "__main__":
print(default_conversation.get_prompt())

6
generation_config.json Normal file
View File

@ -0,0 +1,6 @@
{
"_from_model_config": true,
"eos_token_id": 50279,
"pad_token_id": 1,
"transformers_version": "4.45.1"
}

View File

@ -0,0 +1,30 @@
from typing import List
from PIL.Image import Image
from transformers import CLIPImageProcessor
from transformers.image_processing_utils import BaseImageProcessor
from .mm_utils import process_images
# TODO can inherit from CLIPImageProcessor instead and use the process function directly.
class InstellaVLImageProcessor(BaseImageProcessor):
r"""
Pre-process images
"""
def __init__(self, **kwargs):
super().__init__(**kwargs)
def process(self,
images: List[Image],
processor: CLIPImageProcessor,
model_cfg: dict
):
image_tensors = process_images(images, processor, model_cfg)
if images is None:
return {
"pixel_values": None,
}
else:
return{
"pixel_values": image_tensors,
}
InstellaVLImageProcessor.register_for_auto_class()

519
mm_utils.py Normal file
View File

@ -0,0 +1,519 @@
# Modification Copyright© 2025 Advanced Micro Devices, Inc. All rights reserved.
r"""This module provides various utility functions for processing images, including resizing, cropping, padding,
and extracting patches. It also includes functions for processing images with different resolutions and
tokenizing image prompts."""
import re
import ast
import math
import torch
import base64
import torch.distributed as dist
from PIL import Image
from io import BytesIO
from typing import List, Tuple, Union, Any
from transformers import StoppingCriteria, PreTrainedTokenizer
IGNORE_INDEX = -100
IMAGE_TOKEN_INDEX = -200
DEFAULT_IMAGE_TOKEN = "<image>"
DEFAULT_IMAGE_PATCH_TOKEN = "<im_patch>"
DEFAULT_IM_START_TOKEN = "<im_start>"
DEFAULT_IM_END_TOKEN = "<im_end>"
def resize_and_center_crop(image: Image.Image, shortest_edge_length: int) -> Image.Image:
r"""
Resize the given image such that its shortest edge matches the specified length,
and then center crop it to a square of the same size.
Args:
- image (`Image.Image`): The input image to be resized and cropped.
- shortest_edge_length (`int`): The length of the shortest edge after resizing.
Returns:
`Image.Image`: The resized and center-cropped image.
"""
# Calculate new dimensions and resize
aspect_ratio = float(image.width) / float(image.height)
if (aspect_ratio > 1):
new_width = int(shortest_edge_length * aspect_ratio)
new_height = shortest_edge_length
else:
new_width = shortest_edge_length
new_height = int(shortest_edge_length / aspect_ratio)
resized_image = image.resize((new_width, new_height), Image.ANTIALIAS)
# Calculate the position and perform the center crop
left = (new_width - shortest_edge_length) / 2
top = (new_height - shortest_edge_length) / 2
right = (new_width + shortest_edge_length) / 2
bottom = (new_height + shortest_edge_length) / 2
cropped_image = resized_image.crop((left, top, right, bottom))
return cropped_image
def auto_pad_images(image: Image.Image, grid_params: list) -> Image.Image:
r"""
Automatically pads an input image to match the closest aspect ratio from a list of grid parameters.
Args:
- image (`Image.Image`): The input image to be padded. Must be a Pillow Image object.
- grid_params (`list`): A list of integers representing the grid parameters to determine the target aspect ratio.
Returns:
`Image.Image`: The padded image with the closest aspect ratio from the grid parameters.
Raises:
`AssertionError`: If the input is not a Pillow Image object or if the grid parameters list is empty.
"""
assert isinstance(image, Image.Image), "Input should be a Pillow Image"
assert len(grid_params) > 0, "Grid parameters should not be empty"
# Step 1: Calculate and find the closest aspect ratio
input_width, input_height = image.size
input_aspect_ratio = input_width / input_height
candidate_resolutions = [(w / h, w, h) for w in grid_params for h in grid_params]
closest_aspect_ratio = min(candidate_resolutions, key=lambda x: abs(input_aspect_ratio - x[0]))
candidate_resolutions = [(x[1], x[2]) for x in candidate_resolutions if abs(x[0] - closest_aspect_ratio[0]) < 1e-3]
target_resolution = min(candidate_resolutions, key=lambda res: abs(max(input_width, input_height) / max(res) - 1))
resize_width, resize_height = target_resolution
if input_width > input_height:
resize_height = int(resize_width / input_aspect_ratio)
else:
resize_width = int(resize_height * input_aspect_ratio)
resized_image = image.resize((resize_width, resize_height), Image.ANTIALIAS)
# Step 5: Pad the resized image if necessary to match the target resolution
pad_width = target_resolution[0] - resize_width
pad_height = target_resolution[1] - resize_height
padded_image = Image.new("RGB", target_resolution, color=(0, 0, 0))
padded_image.paste(resized_image, (pad_width // 2, pad_height // 2))
return padded_image
def extract_patches(image: Image.Image, patch_size: int, overlap_ratio: float) -> List[Image.Image]:
r"""
Extracts patches from a given image with specified patch size and overlap ratio.
Args:
- image (`Image.Image`): The input image from which patches are to be extracted. Must be a Pillow Image.
- patch_size (`int`): The size of each patch (both width and height). Must be greater than 0.
- overlap_ratio (`float`): The ratio of overlap between adjacent patches. Must be between 0 and 1 (exclusive).
Returns:
`List[Image.Image]`: A list of extracted patches as Pillow Images.
Raises:
`AssertionError`: If the input image is not a Pillow Image.
`AssertionError`: If the patch size is not greater than 0.
`AssertionError`: If the overlap ratio is not between 0 and 1.
"""
assert isinstance(image, Image.Image), "Input should be a Pillow Image"
assert patch_size > 0, "Patch size should be greater than 0"
assert 0 <= overlap_ratio < 1, "Overlap ratio should be between 0 and 1"
W, H = image.size
patches = []
stride = int(patch_size * (1 - overlap_ratio))
num_patches_y = (H - patch_size) // stride + 1
num_patches_x = (W - patch_size) // stride + 1
y_start = (H - (num_patches_y - 1) * stride - patch_size) // 2
x_start = (W - (num_patches_x - 1) * stride - patch_size) // 2
for y in range(y_start, y_start + num_patches_y * stride, stride):
for x in range(x_start, x_start + num_patches_x * stride, stride):
patch = image.crop((x, y, x + patch_size, y + patch_size))
patches.append(patch)
return patches
def process_highres_image_crop_split(image: Image.Image, data_args, processor=None) -> torch.Tensor:
"""
Process a high-resolution image by cropping and splitting it into patches.
Args:
- image (`PIL.Image.Image`): The input image to be processed.
- data_args: The data arguments containing crop and split resolutions.
- processor: The image processor object. If None, it will be taken from data_args.
Returns:
`torch.Tensor`: A tensor containing the processed image patches.
"""
crop_resolution = data_args.image_crop_resolution
split_resolution = data_args.image_split_resolution
if processor is None:
processor = data_args.image_processor
image_crop = resize_and_center_crop(image, crop_resolution)
image_patches = extract_patches(image_crop, patch_size=split_resolution, overlap_ratio=0)
image_patches = [processor.preprocess(image_patch, return_tensors="pt")["pixel_values"][0] for image_patch in image_patches]
return torch.stack(image_patches, dim=0)
def process_highres_image(image: Image.Image, processor, grid_pinpoints: str) -> torch.Tensor:
r"""
Processes a high-resolution image by resizing, padding, and extracting patches.
Args:
- image (`Image.Image`): The input image to be processed.
- processor: An object that contains image processing parameters and methods.
- grid_pinpoints (`str`): A comma-separated string of grid sizes to consider for resizing.
Returns:
torch.Tensor: A tensor containing the processed image patches.
"""
grid_params = [int(x) for x in grid_pinpoints.split(",")]
width_height = max(image.size)
fit_grid_params = [x for x in grid_params if x >= width_height]
if len(fit_grid_params) == 0:
select_size = max(grid_params)
else:
select_size = min(fit_grid_params)
# FIXME: always select the 448
select_size = max(grid_params)
image_padded = expand2square(image, tuple(int(x * 255) for x in processor.image_mean))
# FIXME: this seems to be a bug that it always resizes instead of padding
image_original_resize = image.resize((processor.size["shortest_edge"], processor.size["shortest_edge"]))
image_padded = image_padded.resize((select_size, select_size))
image_patches = extract_patches(image_padded, patch_size=processor.size["shortest_edge"], overlap_ratio=0)
image_patches = [image_original_resize] + image_patches
image_patches = [processor.preprocess(image_patch, return_tensors="pt")["pixel_values"][0] for image_patch in image_patches]
return torch.stack(image_patches, dim=0)
def select_best_resolution(original_size: tuple, possible_resolutions: List[Tuple[int, int]]) -> tuple:
"""
Selects the best resolution from a list of possible resolutions based on the original size.
Args:
- original_size (`tuple`): The original size of the image in the format (width, height).
- possible_resolutions (`List[Tuple[int, int]]`): A list of possible resolutions in the format [(width1, height1), (width2, height2), ...].
Returns:
`tuple`: The best fit resolution in the format (width, height).
"""
original_width, original_height = original_size
best_fit = None
max_effective_resolution = 0
min_wasted_resolution = float("inf")
for width, height in possible_resolutions:
# Calculate the downscaled size to keep the aspect ratio
scale = min(width / original_width, height / original_height)
downscaled_width, downscaled_height = int(original_width * scale), int(original_height * scale)
# Calculate effective and wasted resolutions
effective_resolution = min(downscaled_width * downscaled_height, original_width * original_height)
wasted_resolution = (width * height) - effective_resolution
if effective_resolution > max_effective_resolution or (effective_resolution == max_effective_resolution and wasted_resolution < min_wasted_resolution):
max_effective_resolution = effective_resolution
min_wasted_resolution = wasted_resolution
best_fit = (width, height)
return best_fit
def resize_and_pad_image(image: Image.Image, target_resolution: tuple) -> Image.Image:
r"""
Resize and pad an image to a target resolution while maintaining aspect ratio.
Args:
- image (`Image.Image`): The input image.
- target_resolution (`tuple`): The target resolution (width, height) of the image.
Returns:
`Image.Image`: The resized and padded image.
"""
original_width, original_height = image.size
target_width, target_height = target_resolution
# Determine which dimension (width or height) to fill
scale_w = target_width / original_width
scale_h = target_height / original_height
if scale_w < scale_h:
# Width will be filled completely
new_width = target_width
new_height = min(math.ceil(original_height * scale_w), target_height)
else:
# Height will be filled completely
new_height = target_height
new_width = min(math.ceil(original_width * scale_h), target_width)
# Resize the image
resized_image = image.resize((new_width, new_height))
# Create a new image with the target size and paste the resized image onto it
new_image = Image.new("RGB", (target_width, target_height), (0, 0, 0))
paste_x = (target_width - new_width) // 2
paste_y = (target_height - new_height) // 2
new_image.paste(resized_image, (paste_x, paste_y))
return new_image
def divide_to_patches(image: Image.Image, patch_size: int) -> list:
"""
Divides an image into patches of a specified size.
Args:
- image (`Image.Image`): The input image.
- patch_size (`int`): The size of each patch.
Returns:
`list`: A list of Image.Image objects representing the patches.
"""
patches = []
width, height = image.size
for i in range(0, height, patch_size):
for j in range(0, width, patch_size):
box = (j, i, j + patch_size, i + patch_size)
patch = image.crop(box)
patches.append(patch)
return patches
def get_anyres_image_grid_shape(image_size: Tuple[int, int], grid_pinpoints: Union[str, list], patch_size: int) -> Tuple[int, int]:
r"""
Calculate the shape of the image patch grid after the preprocessing for images of any resolution.
Args:
- image_size (`tuple`): The size of the input image in the format (width, height).
- grid_pinpoints (`str` or `list`): A string representation of a list of possible resolutions.
- patch_size (`int`): The size of each image patch.
Returns:
`tuple`: The shape of the image patch grid in the format (width, height).
"""
if isinstance(grid_pinpoints, str) and "x" in grid_pinpoints:
assert patch_size in [224, 336, 384, 448, 512], "patch_size should be in [224, 336, 384, 448, 512]"
# Use regex to extract the range from the input string
matches = re.findall(r"\((\d+)x(\d+)\)", grid_pinpoints)
range_start = tuple(map(int, matches[0]))
range_end = tuple(map(int, matches[-1]))
# Generate a matrix of tuples from (range_start[0], range_start[1]) to (range_end[0], range_end[1])
grid_pinpoints = [(i, j) for i in range(range_start[0], range_end[0] + 1) for j in range(range_start[1], range_end[1] + 1)]
# Multiply all elements by patch_size
grid_pinpoints = [[dim * patch_size for dim in pair] for pair in grid_pinpoints]
if type(grid_pinpoints) is list:
possible_resolutions = grid_pinpoints
else:
possible_resolutions = ast.literal_eval(grid_pinpoints)
width, height = select_best_resolution(image_size, possible_resolutions)
return width // patch_size, height // patch_size
def process_anyres_image(image: Image.Image, processor: Any, grid_pinpoints: Union[str, List[Tuple[int, int]]]) -> torch.Tensor:
r"""
Process an image with variable resolutions.
Args:
- image (`Image.Image`): The input image to be processed.
- processor: The image processor object.
- grid_pinpoints (`str`): A string representation of a list of possible resolutions.
Returns:
`torch.Tensor`: A tensor containing the processed image patches.
"""
# Convert grid_pinpoints from string to list
if isinstance(grid_pinpoints, str) and "x" in grid_pinpoints:
try:
patch_size = processor.size[0]
except Exception as e:
patch_size = processor.size["shortest_edge"]
assert patch_size in [224, 336, 384, 448, 512], "patch_size should be in [224, 336, 384, 448, 512]"
# Use regex to extract the range from the input string
matches = re.findall(r"\((\d+)x(\d+)\)", grid_pinpoints)
range_start = tuple(map(int, matches[0]))
range_end = tuple(map(int, matches[-1]))
# Generate a matrix of tuples from (range_start[0], range_start[1]) to (range_end[0], range_end[1])
grid_pinpoints = [(i, j) for i in range(range_start[0], range_end[0] + 1) for j in range(range_start[1], range_end[1] + 1)]
# Multiply all elements by patch_size
grid_pinpoints = [[dim * patch_size for dim in pair] for pair in grid_pinpoints]
if type(grid_pinpoints) is list:
possible_resolutions = grid_pinpoints
else:
possible_resolutions = ast.literal_eval(grid_pinpoints)
best_resolution = select_best_resolution(image.size, possible_resolutions)
image_padded = resize_and_pad_image(image, best_resolution)
patches = divide_to_patches(image_padded, processor.crop_size["height"])
# FIXME: this seems to be a bug that it resizes instead of pad. # FIXME
# but to keep it consistent with previous, i will keep it as it is
# TODO: uncomment below to ablate with the padding
if isinstance(processor.size, dict):
shortest_edge = processor.size["shortest_edge"]
else:
shortest_edge = min(processor.size)
image_original_resize = image.resize((shortest_edge, shortest_edge))
# image_padded_square = expand2square(image, tuple(int(x*255) for x in processor.image_mean))
# image_original_resize = image_padded_square.resize((processor.size['shortest_edge'], processor.size['shortest_edge']))
image_patches = [image_original_resize] + patches
image_patches = [processor.preprocess(image_patch, return_tensors="pt")["pixel_values"][0] for image_patch in image_patches]
image_patches = torch.stack(image_patches, dim=0)
return image_patches
def load_image_from_base64(image):
return Image.open(BytesIO(base64.b64decode(image)))
def expand2square(pil_img: Image.Image, background_color: tuple) -> Image.Image:
r"""
Expands a given PIL image to a square by adding a background color.
Args:
- pil_img (`Image.Image`): The input PIL image to be expanded.
- background_color (`tuple`): The background color to use for expansion, specified as an RGB tuple.
Returns:
`Image.Image`: The expanded square PIL image.
"""
width, height = pil_img.size
if width == height:
return pil_img
elif width > height:
result = Image.new(pil_img.mode, (width, width), background_color)
result.paste(pil_img, (0, (width - height) // 2))
return result
else:
result = Image.new(pil_img.mode, (height, height), background_color)
result.paste(pil_img, ((height - width) // 2, 0))
return result
def process_images(images: List[Image.Image], image_processor: Any, model_cfg: Any) -> Union[torch.Tensor, List[torch.Tensor]]:
r"""
Processes a list of images based on the specified model configuration.
Args:
- images (`list`): A list of images to be processed.
- image_processor (`ImageProcessor`): An instance of the image processor to be used.
- model_cfg (`ModelConfig`): Configuration object containing model settings.
Returns:
`torch.Tensor` or list: Processed images as a tensor if all images have the same shape,
otherwise a list of processed images.
"""
image_aspect_ratio = getattr(model_cfg, "image_aspect_ratio", None)
new_images = []
if image_aspect_ratio == "highres":
for image in images:
image = process_highres_image(image, image_processor, model_cfg.image_grid_pinpoints)
new_images.append(image)
elif image_aspect_ratio == "anyres" or "anyres_max" in image_aspect_ratio:
for image in images:
image = process_anyres_image(image, image_processor, model_cfg.image_grid_pinpoints)
new_images.append(image)
elif image_aspect_ratio == "crop_split":
for image in images:
image = process_highres_image_crop_split(image, model_cfg, image_processor)
new_images.append(image)
elif image_aspect_ratio == "pad":
for image in images:
image = expand2square(image, tuple(int(x * 255) for x in image_processor.image_mean))
image = image_processor.preprocess(image, return_tensors="pt")["pixel_values"][0]
new_images.append(image)
else:
return image_processor.preprocess(images, return_tensors="pt")["pixel_values"]
if all(x.shape == new_images[0].shape for x in new_images):
new_images = torch.stack(new_images, dim=0)
return new_images
def tokenizer_image_token(prompt: str, tokenizer: PreTrainedTokenizer, image_token_index=IMAGE_TOKEN_INDEX, return_tensors=None)->Union[torch.Tensor, List[torch.Tensor]]:
r"""
Tokenizes a prompt containing image tokens and inserts the specified image token index at the appropriate positions.
Args:
- prompt (str): The input prompt string containing text and "<image>" placeholders.
- tokenizer (PreTrainedTokenizer): The tokenizer to use for tokenizing the text chunks.
- image_token_index (int): The token index to use for the image placeholders. Default is IMAGE_TOKEN_INDEX.
- return_tensors (str, optional): The type of tensor to return. If "pt", returns a PyTorch tensor. Default is None.
Returns:
list or torch.Tensor: The tokenized input IDs as a list or a PyTorch tensor if return_tensors is specified.
"""
prompt_chunks = [tokenizer(chunk).input_ids for chunk in prompt.split("<image>")]
# FIXME: prompt_chunks = [tokenizer(chunk, return_tensors="pt", padding="longest", max_length=tokenizer.model_max_length, truncation=True).input_ids for chunk in prompt.split("<image>")]
def insert_separator(X, sep):
return [ele for sublist in zip(X, [sep] * len(X)) for ele in sublist][:-1]
input_ids = []
offset = 0
if len(prompt_chunks) > 0 and len(prompt_chunks[0]) > 0 and prompt_chunks[0][0] == tokenizer.bos_token_id:
offset = 1
input_ids.append(prompt_chunks[0][0])
for x in insert_separator(prompt_chunks, [image_token_index] * (offset + 1)):
input_ids.extend(x[offset:])
if return_tensors is not None:
if return_tensors == "pt":
return torch.tensor(input_ids, dtype=torch.long)
raise ValueError(f"Unsupported tensor type: {return_tensors}")
return input_ids
def get_model_name_from_path(model_path: str)->str:
model_path = model_path.strip("/")
model_paths = model_path.split("/")
if model_paths[-1].startswith("checkpoint-"):
return model_paths[-2] + "_" + model_paths[-1]
else:
return model_paths[-1]
class KeywordsStoppingCriteria(StoppingCriteria):
def __init__(self, keywords, tokenizer, input_ids):
self.keywords = keywords
self.keyword_ids = []
for keyword in keywords:
cur_keyword_ids = tokenizer(keyword).input_ids
if len(cur_keyword_ids) > 1 and cur_keyword_ids[0] == tokenizer.bos_token_id:
cur_keyword_ids = cur_keyword_ids[1:]
self.keyword_ids.append(torch.tensor(cur_keyword_ids))
self.tokenizer = tokenizer
self.start_len = input_ids.shape[1]
def __call__(self, output_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
assert output_ids.shape[0] == 1, "Only support batch size 1 (yet)" # TODO
offset = min(output_ids.shape[1] - self.start_len, 3)
self.keyword_ids = [keyword_id.to(output_ids.device) for keyword_id in self.keyword_ids]
for keyword_id in self.keyword_ids:
if output_ids[0, -keyword_id.shape[0] :] == keyword_id:
return True
outputs = self.tokenizer.batch_decode(output_ids[:, -offset:], skip_special_tokens=True)[0]
for keyword in self.keywords:
if keyword in outputs:
return True
return False
def rank0_print(*args):
if dist.is_initialized():
if dist.get_rank() == 0:
print(f"Rank {dist.get_rank()}: ", *args)
else:
print(*args)

BIN
model.safetensors (Stored with Git LFS) Normal file

Binary file not shown.

2463
modeling_instellavl.py Normal file

File diff suppressed because it is too large Load Diff

7
preprocessor_config.json Normal file
View File

@ -0,0 +1,7 @@
{
"auto_map": {
"AutoImageProcessor": "image_processing_instellavl.InstellaVLImageProcessor",
"AutoProcessor": "processing_instellavl.InstellaVLProcessor"
},
"processor_class": "InstellaVLProcessor"
}

212
processing_instellavl.py Normal file
View File

@ -0,0 +1,212 @@
from PIL import ImageOps
from PIL.Image import Image
import torch
from typing import Union, List
from tqdm import tqdm
from transformers.image_utils import ImageInput
from transformers.tokenization_utils_base import TextInput
from transformers import CLIPImageProcessor
from transformers.processing_utils import (
ProcessorMixin,
)
from transformers import AutoTokenizer, PreTrainedTokenizer
from .image_processing_instellavl import InstellaVLImageProcessor
from .mm_utils import DEFAULT_IMAGE_TOKEN, IMAGE_TOKEN_INDEX, KeywordsStoppingCriteria
from .conversation import conv_templates
def tokenizer_image_token(prompt: str, tokenizer: PreTrainedTokenizer, image_token_index=IMAGE_TOKEN_INDEX, return_tensors=None)->Union[torch.Tensor, List[torch.Tensor]]:
r"""
Tokenizes a prompt containing image tokens and inserts the specified image token index at the appropriate positions.
Args:
- prompt (str): The input prompt string containing text and DEFAULT_IMAGE_TOKEN="<image>" placeholders.
- tokenizer (PreTrainedTokenizer): The tokenizer to use for tokenizing the text chunks.
- image_token_index (int): The token index to use for the image placeholders. Default is IMAGE_TOKEN_INDEX.
- return_tensors (str, optional): The type of tensor to return. If "pt", returns a PyTorch tensor. Default is None.
Returns:
list or torch.Tensor: The tokenized input IDs as a list or a PyTorch tensor if return_tensors is specified.
"""
prompt_chunks = [tokenizer(chunk).input_ids for chunk in prompt.split(DEFAULT_IMAGE_TOKEN)]
def insert_separator(X, sep):
return [ele for sublist in zip(X, [sep] * len(X)) for ele in sublist][:-1]
input_ids = []
offset = 0
if len(prompt_chunks) > 0 and len(prompt_chunks[0]) > 0 and prompt_chunks[0][0] == tokenizer.bos_token_id:
offset = 1
input_ids.append(prompt_chunks[0][0])
for x in insert_separator(prompt_chunks, [image_token_index] * (offset + 1)):
input_ids.extend(x[offset:])
if return_tensors is not None:
if return_tensors == "pt":
return torch.tensor(input_ids, dtype=torch.long)
raise ValueError(f"Unsupported tensor type: {return_tensors}")
return input_ids
class InstellaVLProcessor(ProcessorMixin):
attributes = ["image_processor", "tokenizer"]
image_processor_class = "AutoImageProcessor"
tokenizer_class = ("GPTNeoXTokenizerFast")
def __init__(self, image_processor: InstellaVLImageProcessor = None, tokenizer: AutoTokenizer = None, **kwargs):
super().__init__(image_processor, tokenizer, **kwargs)
def pad_sequence(self, input_ids: Union[List[torch.Tensor], List[List[torch.Tensor]]], batch_first: bool, padding_value: int, tokenizer: AutoTokenizer):
if tokenizer.padding_side == "left":
input_ids = [torch.flip(_input_ids, [0]) for _input_ids in input_ids]
input_ids = torch.nn.utils.rnn.pad_sequence(input_ids, batch_first=batch_first, padding_value=padding_value)
if tokenizer.padding_side == "left":
input_ids = torch.flip(input_ids, [1])
return input_ids
def encode(self,
text: TextInput = None,
images: ImageInput = None,
image_processor: CLIPImageProcessor = None,
tokenizer: AutoTokenizer = None,
model_cfg: dict = None,
) -> dict:
if images is not None:
if isinstance(images, Image):
# Handle images with EXIF orientation tags, which PIL will ignore by default
# https://github.com/python-pillow/Pillow/issues/4703
ImageOps.exif_transpose(images, in_place=True)
image_sizes = [images.size]
images = [images]
elif isinstance(images, list):
image_sizes = []
for i in images:
ImageOps.exif_transpose(i, in_place=True)
image_sizes.append(i.size)
image_tensor = self.image_processor.process(images, image_processor, model_cfg)['pixel_values']
text = text.replace(DEFAULT_IMAGE_TOKEN, "").strip()
if images is not None and len(image_tensor) != 0 and DEFAULT_IMAGE_TOKEN not in text:
question = DEFAULT_IMAGE_TOKEN + "\n" + text
else:
question = text
conv = conv_templates["instella"].copy()
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0)
keywords = [conv.sep]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
terminators = [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("|||IP_ADDRESS|||")]
out = {
"input_ids": input_ids,
"stopping_criteria": [stopping_criteria],
"eos_token_id": terminators,
}
if images is not None:
out = {
"image_tensor": image_tensor,
"image_sizes": image_sizes,
**out,
}
self.tokenizer = tokenizer
return out
def batch_encode(self,
texts: List[TextInput] = None,
images: List[ImageInput] = None,
image_processor: CLIPImageProcessor = None,
tokenizer: AutoTokenizer = None,
model_cfg: dict = None,
):
if texts is None:
raise ValueError("Text must be provided for batch encoding.")
if images is None:
images = [None] * len(text)
assert isinstance(texts, list), "Since batch encoding happening, provide batch of texts in a list."
assert len(texts) == len(images), "The number of texts and images must be equal."
batch_outs = []
for txt, img in tqdm(zip(texts, images), total=len(texts), desc="Total Samples to encode"):
batch_outs.append(self.encode(txt, img, image_processor, tokenizer, model_cfg))
return batch_outs
# batched_image_tensors = []
# batched_text_tokens = []
# stopping_criterias = []
# image_sizes = []
# for t, img in tqdm(zip(text, images), desc="Total Samples to encode"):
# if img is not None:
# if isinstance(img, Image):
# ImageOps.exif_transpose(img, in_place=True)
# image_sizes.append(img.size)
# img = [img]
# elif isinstance(img, list):
# tmp_img_sizes = []
# for i in img:
# ImageOps.exif_transpose(i, in_place=True)
# tmp_img_sizes.append(i.size)
# image_sizes.append(tmp_img_sizes)
# batched_image_tensors.append(self.image_processor.process(img, image_processor, model_cfg)['pixel_values'].squeeze(0))
# t = t.replace(DEFAULT_IMAGE_TOKEN, "").strip()
# if img is not None and len(batched_image_tensors[-1]) != 0 and DEFAULT_IMAGE_TOKEN not in t:
# question = DEFAULT_IMAGE_TOKEN + "\n" + t
# else:
# question = t
# conv = conv_templates["instella"].copy()
# conv.append_message(conv.roles[0], question)
# conv.append_message(conv.roles[1], None)
# prompt_question = conv.get_prompt()
# input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt")
# stopping_criterias.append(KeywordsStoppingCriteria([conv.sep], tokenizer, input_ids.unsqueeze(0)))
# batched_text_tokens.append(input_ids)
# terminators = [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("|||IP_ADDRESS|||")]
# # Pad the text tokens.
# pad_token_ids = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.eos_token_id
# input_ids = self.pad_sequence(batched_text_tokens, batch_first=True, padding_value=pad_token_ids, tokenizer=tokenizer)
# attention_masks = input_ids.ne(pad_token_ids)
# batch_outs = {
# "input_ids": input_ids,
# "attention_mask": attention_masks,
# "pad_token_id": pad_token_ids,
# "stopping_criteria": stopping_criterias,
# "eos_token_id": terminators,
# }
# if images is not None:
# batch_outs = {
# "image_tensor": batched_image_tensors,
# "image_sizes": image_sizes,
# **batch_outs
# }
# self.tokenizer = tokenizer
# return batch_outs
def decode(self, output_ids: torch.Tensor)->str:
return self.tokenizer.decode(output_ids[0, :], skip_special_tokens=True).strip()
def batch_decode(self, output_ids_lst: List[torch.Tensor])->List[str]:
raise NotImplementedError("Batch decode is not implemented for InstellaVLProcessor")
# text_decoded_outs = []
# for out_ids in output_ids_lst:
# text_decoded_outs.append(self.decode(out_ids))
# return text_decoded_outs
InstellaVLProcessor.register_for_auto_class()

6
processor_config.json Normal file
View File

@ -0,0 +1,6 @@
{
"auto_map": {
"AutoProcessor": "processing_instellavl.InstellaVLProcessor"
},
"processor_class": "InstellaVLProcessor"
}

16
special_tokens_map.json Normal file
View File

@ -0,0 +1,16 @@
{
"eos_token": {
"content": "|||IP_ADDRESS|||",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false
},
"pad_token": {
"content": "<|padding|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
}
}

250622
tokenizer.json Normal file

File diff suppressed because it is too large Load Diff

255
tokenizer_config.json Normal file
View File

@ -0,0 +1,255 @@
{
"add_bos_token": false,
"add_eos_token": false,
"add_prefix_space": false,
"added_tokens_decoder": {
"0": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"1": {
"content": "<|padding|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"50254": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50255": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50256": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50257": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50258": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50259": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50260": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50261": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50262": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50263": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50264": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50265": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50266": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50267": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50268": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50269": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50270": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50271": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50272": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50273": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50274": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50275": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50276": {
"content": " ",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50277": {
"content": "|||EMAIL_ADDRESS|||",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50278": {
"content": "|||PHONE_NUMBER|||",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": false
},
"50279": {
"content": "|||IP_ADDRESS|||",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": true
},
"50280": {
"content": "<point>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"50281": {
"content": "</point>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
}
},
"bos_token": null,
"clean_up_tokenization_spaces": true,
"eos_token": "|||IP_ADDRESS|||",
"chat_template": "|||IP_ADDRESS|||\n{% for message in messages -%}{{ message['role'] + message['content']}}{%- if not loop.last -%}{{ '\\n' if loop.index % 2 == 1 else '|||IP_ADDRESS|||\\n'}}{%- endif %}{%- endfor -%}",
"model_max_length": 32768,
"pad_token": "<|padding|>",
"tokenizer_class": "GPTNeoXTokenizer",
"unk_token": null
}