restNet50-sss20240819140609/BLIP-large.md

129 lines
15 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<details open="" style="box-sizing: border-box; scrollbar-color: var(--color-primary)transparent; caret-color: rgb(8, 8, 8); --fonts-regular: var(--fonts-override,var(--fonts-proportional)),&quot;Noto Sans&quot;,&quot;Liberation Sans&quot;,sans-serif,var(--fonts-emoji); margin-top: 0px !important; margin-bottom: 16px; --fonts-override: var(--fonts-default-override-zh-cn); color: rgb(33, 33, 33); font-family: system-ui-zh-cn, -apple-system, &quot;Segoe UI&quot;, system-ui, Roboto, &quot;Helvetica Neue&quot;, Arial, &quot;Noto Sans&quot;, &quot;Liberation Sans&quot;, sans-serif, &quot;Apple Color Emoji&quot;, &quot;Segoe UI Emoji&quot;, &quot;Noto Color Emoji&quot;, &quot;Twemoji Mozilla&quot;; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(255, 255, 255); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><table style="box-sizing: border-box; scrollbar-color: var(--color-primary)transparent; caret-color: var(--color-caret); text-indent: 0px; border-color: inherit; border-collapse: collapse; --fonts-regular: var(--fonts-override,var(--fonts-proportional)),&quot;Noto Sans&quot;,&quot;Liberation Sans&quot;,sans-serif,var(--fonts-emoji); margin-top: 0px; margin-bottom: 16px; width: max-content; max-width: 100%; display: block; overflow: auto; --fonts-override: var(--fonts-default-override-zh-cn);"><thead style="box-sizing: border-box; scrollbar-color: var(--color-primary)transparent; caret-color: var(--color-caret); --fonts-regular: var(--fonts-override,var(--fonts-proportional)),&quot;Noto Sans&quot;,&quot;Liberation Sans&quot;,sans-serif,var(--fonts-emoji); --fonts-override: var(--fonts-default-override-zh-cn);"><tr style="box-sizing: border-box; scrollbar-color: var(--color-primary)transparent; caret-color: var(--color-caret); --fonts-regular: var(--fonts-override,var(--fonts-proportional)),&quot;Noto Sans&quot;,&quot;Liberation Sans&quot;,sans-serif,var(--fonts-emoji); border-top: 1px solid var(--color-secondary); --fonts-override: var(--fonts-default-override-zh-cn);"><th style="box-sizing: border-box; scrollbar-color: var(--color-primary)transparent; caret-color: var(--color-caret); --fonts-regular: var(--fonts-override,var(--fonts-proportional)),&quot;Noto Sans&quot;,&quot;Liberation Sans&quot;,sans-serif,var(--fonts-emoji); font-weight: var(--font-weight-semibold); border: 1px solid var(--color-secondary) !important; padding: 6px 13px !important; --fonts-override: var(--fonts-default-override-zh-cn); transition: none;"><br class="Apple-interchange-newline">pipeline_tag</th><th style="box-sizing: border-box; scrollbar-color: var(--color-primary)transparent; caret-color: var(--color-caret); --fonts-regular: var(--fonts-override,var(--fonts-proportional)),&quot;Noto Sans&quot;,&quot;Liberation Sans&quot;,sans-serif,var(--fonts-emoji); font-weight: var(--font-weight-semibold); border: 1px solid var(--color-secondary) !important; padding: 6px 13px !important; --fonts-override: var(--fonts-default-override-zh-cn); transition: none;">tags</th><th style="box-sizing: border-box; scrollbar-color: var(--color-primary)transparent; caret-color: var(--color-caret); --fonts-regular: var(--fonts-override,var(--fonts-proportional)),&quot;Noto Sans&quot;,&quot;Liberation Sans&quot;,sans-serif,var(--fonts-emoji); font-weight: var(--font-weight-semibold); border: 1px solid var(--color-secondary) !important; padding: 6px 13px !important; --fonts-override: var(--fonts-default-override-zh-cn); transition: none;">languages</th><th style="box-sizing: border-box; scrollbar-color: var(--color-primary)transparent; caret-color: var(--color-caret); --fonts-regular: var(--fonts-override,var(--fonts-proportional)),&quot;Noto Sans&quot;,&quot;Liberation Sans&quot;,sans-serif,var(--fonts-emoji); font-weight: var(--font-weight-semibold); border: 1px solid var(--color-secondary) !important; padding: 6px 13px !important; --fonts-override: var(--fonts-default-override-zh-cn); transition: none;">license</th></tr></thead><tbody style="box-sizing: border-box; scrollbar-color: var(--color-primary)transparent; caret-color: var(--color-caret); --fonts-regular: var(--fonts-override,var(--fonts-proportional)),&quot;Noto Sans&quot;,&quot;Liberation Sans&quot;,sans-serif,var(--fonts-emoji); --fonts-override: var(--fonts-default-override-zh-cn);"><tr style="box-sizing: border-box; scrollbar-color: var(--color-primary)transparent; caret-color: var(--color-caret); --fonts-regular: var(--fonts-override,var(--fonts-proportional)),&quot;Noto Sans&quot;,&quot;Liberation Sans&quot;,sans-serif,var(--fonts-emoji); border-top: 1px solid var(--color-secondary); --fonts-override: var(--fonts-default-override-zh-cn);"><td style="box-sizing: border-box; scrollbar-color: var(--color-primary)transparent; caret-color: var(--color-caret); --fonts-regular: var(--fonts-override,var(--fonts-proportional)),&quot;Noto Sans&quot;,&quot;Liberation Sans&quot;,sans-serif,var(--fonts-emoji); border: 1px solid var(--color-secondary) !important; padding: 6px 13px !important; --fonts-override: var(--fonts-default-override-zh-cn); transition: none;">image-to-text</td><td style="box-sizing: border-box; scrollbar-color: var(--color-primary)transparent; caret-color: var(--color-caret); --fonts-regular: var(--fonts-override,var(--fonts-proportional)),&quot;Noto Sans&quot;,&quot;Liberation Sans&quot;,sans-serif,var(--fonts-emoji); border: 1px solid var(--color-secondary) !important; padding: 6px 13px !important; --fonts-override: var(--fonts-default-override-zh-cn); transition: none;"><table style="box-sizing: border-box; scrollbar-color: var(--color-primary)transparent; caret-color: var(--color-caret); text-indent: 0px; border-color: inherit; border-collapse: collapse; --fonts-regular: var(--fonts-override,var(--fonts-proportional)),&quot;Noto Sans&quot;,&quot;Liberation Sans&quot;,sans-serif,var(--fonts-emoji); margin-top: 0px; margin-bottom: 16px; width: max-content; max-width: 100%; display: block; overflow: auto; --fonts-override: var(--fonts-default-override-zh-cn);"><tbody style="box-sizing: border-box; scrollbar-color: var(--color-primary)transparent; caret-color: var(--color-caret); --fonts-regular: var(--fonts-override,var(--fonts-proportional)),&quot;Noto Sans&quot;,&quot;Liberation Sans&quot;,sans-serif,var(--fonts-emoji); --fonts-override: var(--fonts-default-override-zh-cn);"><tr style="box-sizing: border-box; scrollbar-color: var(--color-primary)transparent; caret-color: var(--color-caret); --fonts-regular: var(--fonts-override,var(--fonts-proportional)),&quot;Noto Sans&quot;,&quot;Liberation Sans&quot;,sans-serif,var(--fonts-emoji); border-top: 1px solid var(--color-secondary); --fonts-override: var(--fonts-default-override-zh-cn);"><td style="box-sizing: border-box; scrollbar-color: var(--color-primary)transparent; caret-color: var(--color-caret); --fonts-regular: var(--fonts-override,var(--fonts-proportional)),&quot;Noto Sans&quot;,&quot;Liberation Sans&quot;,sans-serif,var(--fonts-emoji); border: 1px solid var(--color-secondary) !important; padding: 6px 13px !important; --fonts-override: var(--fonts-default-override-zh-cn); transition: none;">image-captioning</td></tr></tbody></table></td><td style="box-sizing: border-box; scrollbar-color: var(--color-primary)transparent; caret-color: var(--color-caret); --fonts-regular: var(--fonts-override,var(--fonts-proportional)),&quot;Noto Sans&quot;,&quot;Liberation Sans&quot;,sans-serif,var(--fonts-emoji); border: 1px solid var(--color-secondary) !important; padding: 6px 13px !important; --fonts-override: var(--fonts-default-override-zh-cn); transition: none;"><table style="box-sizing: border-box; scrollbar-color: var(--color-primary)transparent; caret-color: var(--color-caret); text-indent: 0px; border-color: inherit; border-collapse: collapse; --fonts-regular: var(--fonts-override,var(--fonts-proportional)),&quot;Noto Sans&quot;,&quot;Liberation Sans&quot;,sans-serif,var(--fonts-emoji); margin-top: 0px; margin-bottom: 16px; width: max-content; max-width: 100%; display: block; overflow: auto; --fonts-override: var(--fonts-default-override-zh-cn);"><tbody style="box-sizing: border-box; scrollbar-color: var(--color-primary)transparent; caret-color: var(--color-caret); --fonts-regular: var(--fonts-override,var(--fonts-proportional)),&quot;Noto Sans&quot;,&quot;Liberation Sans&quot;,sans-serif,var(--fonts-emoji); --fonts-override: var(--fonts-default-override-zh-cn);"><tr style="box-sizing: border-box; scrollbar-color: var(--color-primary)transparent; caret-color: var(--color-caret); --fonts-regular: var(--fonts-override,var(--fonts-proportional)),&quot;Noto Sans&quot;,&quot;Liberation Sans&quot;,sans-serif,var(--fonts-emoji); border-top: 1px solid var(--color-secondary); --fonts-override: var(--fonts-default-override-zh-cn);"><td style="box-sizing: border-box; scrollbar-color: var(--color-primary)transparent; caret-color: var(--color-caret); --fonts-regular: var(--fonts-override,var(--fonts-proportional)),&quot;Noto Sans&quot;,&quot;Liberation Sans&quot;,sans-serif,var(--fonts-emoji); border: 1px solid var(--color-secondary) !important; padding: 6px 13px !important; --fonts-override: var(--fonts-default-override-zh-cn); transition: none;">en</td></tr></tbody></table></td><td style="box-sizing: border-box; scrollbar-color: var(--color-primary)transparent; caret-color: var(--color-caret); --fonts-regular: var(--fonts-override,var(--fonts-proportional)),&quot;Noto Sans&quot;,&quot;Liberation Sans&quot;,sans-serif,var(--fonts-emoji); border: 1px solid var(--color-secondary) !important; padding: 6px 13px !important; --fonts-override: var(--fonts-default-override-zh-cn); transition: none;">bsd-3-clause</td></tr></tbody></table></details>
# BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
在 COCO 数据集上进行预训练的图像描述模型卡 - 基础架构(具有 ViT 大型骨干网络)。
| [![BLIP.gif](https://cdn-uploads.huggingface.co/production/uploads/1670928184033-62441d1d9fdefb55a0b7d12c.gif)](https://cdn-uploads.huggingface.co/production/uploads/1670928184033-62441d1d9fdefb55a0b7d12c.gif) |
| :----------------------------------------------------------: |
| **从BLIP官方仓库中提取数据 ** |
## 摘要
作者在 [论文](https://arxiv.org/abs/2201.12086)的摘要中写道:
视觉-语言预训练VLP提高了许多视觉-语言任务的性能上取得了进展。然而,大多数现有的预训练模型仅在理解任务或生成任务中表现出色。此外,通过使用从网络收集的包含噪声的图像-文本对来扩展数据集,在很大程度上实现了性能的提高,这是一个次优的监督来源。在本文中,我们提出了 BLIP一种新的 VLP 框架,它可以灵活地应用于视觉-语言理解和生成任务。BLIP 通过自展标注bootstrapping the captions可以有效地利用带有噪声的 web 数据其中标注器captioner生成标注过滤器filter去除有噪声的标注。该研究在视觉 - 语言任务上取得了 SOTA 性能,例如在图像 - 文本检索任务上, recall@1 提高 2.7%在图像标注任务上CIDEr 提高 2.8%、VQA 提高 +1.6%。当将 BLIP 以零样本的方式直接迁移到视频 - 语言任务时BLIP 也表现出很强的泛化能力。代码、模型和数据集已发布。
## 用途
您可以使用此模型进行有条件和无条件的图像描述生成。
### 使用 Pytorch 模型
#### 在 CPU 上运行模型
```python
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
```
#### 在 GPU 上运行模型
##### 全精度
```python
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large").to("cuda")
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt").to("cuda")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt").to("cuda")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
```
##### 半精度 (`float16`)
```python
import torch
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large", torch_dtype=torch.float16).to("cuda")
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt").to("cuda", torch.float16)
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
# >>> a photography of a woman and her dog
# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a woman sitting on the beach with her dog
```
## BibTex条目和引用信息
```text
@misc{https://doi.org/10.48550/arxiv.2201.12086,
doi = {10.48550/ARXIV.2201.12086},
url = {https://arxiv.org/abs/2201.12086},
author = {Li, Junnan and Li, Dongxu and Xiong, Caiming and Hoi, Steven},
keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation},
publisher = {arXiv},
year = {2022},
copyright = {Creative Commons Attribution 4.0 International}
}
```