Test-wsg-0001
Go to file
wsgtest 4026eadebb 上传文件至 / 2024-07-05 15:59:54 +08:00
.gitattributes 上传文件至 / 2024-07-05 15:59:54 +08:00
README.md Add README.md 2024-07-05 15:45:18 +08:00
config.json Add config.json 2024-07-05 15:44:45 +08:00
merges.txt Add merges.txt 2024-07-05 15:44:57 +08:00
preprocessor_config.json Add preprocessor_config.json 2024-07-05 15:45:04 +08:00
pytorch_model.bin Add pytorch_model.bin 2024-07-05 15:45:11 +08:00
special_tokens_map.json Add special_tokens_map.json 2024-07-05 15:45:28 +08:00
tokenizer.json Add tokenizer.json 2024-07-05 15:45:35 +08:00
tokenizer_config.json Add tokenizer_config.json 2024-07-05 15:45:42 +08:00
vocab.json Add vocab.json 2024-07-05 15:45:49 +08:00

README.md

tags license widget
image-to-text
image-captioning
apache-2.0
src example_title
https://huggingface.co/datasets/mishig/sample_images/resolve/main/savanna.jpg Savanna
src example_title
https://huggingface.co/datasets/mishig/sample_images/resolve/main/football-match.jpg Football Match
src example_title
https://huggingface.co/datasets/mishig/sample_images/resolve/main/airport.jpg Airport

nlpconnect/vit-gpt2-image-captioning

This is an image captioning model trained by @ydshieh in flax this is pytorch version of this.

The Illustrated Image Captioning using transformers

Sample running code


from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer
import torch
from PIL import Image

model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
feature_extractor = ViTImageProcessor.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
tokenizer = AutoTokenizer.from_pretrained("nlpconnect/vit-gpt2-image-captioning")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)



max_length = 16
num_beams = 4
gen_kwargs = {"max_length": max_length, "num_beams": num_beams}
def predict_step(image_paths):
  images = []
  for image_path in image_paths:
    i_image = Image.open(image_path)
    if i_image.mode != "RGB":
      i_image = i_image.convert(mode="RGB")

    images.append(i_image)

  pixel_values = feature_extractor(images=images, return_tensors="pt").pixel_values
  pixel_values = pixel_values.to(device)

  output_ids = model.generate(pixel_values, **gen_kwargs)

  preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
  preds = [pred.strip() for pred in preds]
  return preds


predict_step(['doctor.e16ba4e4.jpg']) # ['a woman in a hospital bed with a woman in a hospital bed']

Sample running code using transformers pipeline


from transformers import pipeline

image_to_text = pipeline("image-to-text", model="nlpconnect/vit-gpt2-image-captioning")

image_to_text("https://ankur3107.github.io/assets/images/image-captioning-example.png")

# [{'generated_text': 'a soccer game with a player jumping to catch the ball '}]


Contact for any help