mPLUG-Owl3-2B-241014

Go to file

xxl 63eb429d0d first commit		2024-12-26 10:53:19 +08:00
.gitattributes	Add .gitattributes	2024-12-26 09:40:24 +08:00
README.md	first commit	2024-12-26 10:53:19 +08:00
config.json	first commit	2024-12-26 10:53:19 +08:00
configuration.json	first commit	2024-12-26 10:53:19 +08:00
configuration_hyper_qwen2.py	first commit	2024-12-26 10:53:19 +08:00
configuration_mplugowl3.py	first commit	2024-12-26 10:53:19 +08:00
generation_config.json	first commit	2024-12-26 10:53:19 +08:00
image_processing_mplugowl3.py	first commit	2024-12-26 10:53:19 +08:00
merges.txt	first commit	2024-12-26 10:53:19 +08:00
model.safetensors	first commit	2024-12-26 10:53:19 +08:00
modeling_hyper_qwen2.py	first commit	2024-12-26 10:53:19 +08:00
modeling_mplugowl3.py	first commit	2024-12-26 10:53:19 +08:00
processing_mplugowl3.py	first commit	2024-12-26 10:53:19 +08:00
tokenizer.json	first commit	2024-12-26 10:53:19 +08:00
tokenizer_config.json	first commit	2024-12-26 10:53:19 +08:00
vocab.json	first commit	2024-12-26 10:53:19 +08:00
x_sdpa.py	first commit	2024-12-26 10:53:19 +08:00

README.md

license

language

pipeline_tag

mPLUG-Owl3

Introduction

mPLUG-Owl3 is a state-of-the-art multi-modal large language model designed to tackle the challenges of long image sequence understanding. We propose Hyper Attention, which boosts the speed of long visual sequence understanding in multimodal large language models by sixfold, allowing for processing of visual sequences that are eight times longer. Meanwhile, we maintain excellent performance on single-image, multi-image, and video tasks.

Github: mPLUG-Owl

Quickstart

Load the mPLUG-Owl3. We now only support attn_implementation in ['sdpa', 'flash_attention_2'].



import torch
from modelscope import AutoConfig, AutoModel
model_path = 'iic/mPLUG-Owl3-2B-241014'
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
print(config)
# model = mPLUGOwl3Model(config).cuda().half()
model = AutoModel.from_pretrained(model_path, attn_implementation='sdpa', torch_dtype=torch.half, trust_remote_code=True)
model.eval().cuda()

Chat with images.

from PIL import Image

from modelscope import AutoTokenizer
from decord import VideoReader, cpu 
model_path = 'iic/mPLUG-Owl3-2B-241014'
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = model.init_processor(tokenizer)

image = Image.new('RGB', (500, 500), color='red')

messages = [
    {"role": "user", "content": """<|image|>
Describe this image."""},
    {"role": "assistant", "content": ""}
]

inputs = processor(messages, images=[image], videos=None)

inputs.to('cuda')
inputs.update({
    'tokenizer': tokenizer,
    'max_new_tokens':100,
    'decode_text':True,
})


g = model.generate(**inputs)
print(g)

Chat with a video.

from PIL import Image

from modelscope import AutoTokenizer
from decord import VideoReader, cpu    # pip install decord
model_path = 'iic/mPLUG-Owl3-2B-241014'
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = model.init_processor(tokenizer)


messages = [
    {"role": "user", "content": """<|video|>
Describe this video."""},
    {"role": "assistant", "content": ""}
]

videos = ['/nas-mmu-data/examples/car_room.mp4']

MAX_NUM_FRAMES=16

def encode_video(video_path):
    def uniform_sample(l, n):
        gap = len(l) / n
        idxs = [int(i * gap + gap / 2) for i in range(n)]
        return [l[i] for i in idxs]

    vr = VideoReader(video_path, ctx=cpu(0))
    sample_fps = round(vr.get_avg_fps() / 1)  # FPS
    frame_idx = [i for i in range(0, len(vr), sample_fps)]
    if len(frame_idx) > MAX_NUM_FRAMES:
        frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
    frames = vr.get_batch(frame_idx).asnumpy()
    frames = [Image.fromarray(v.astype('uint8')) for v in frames]
    print('num frames:', len(frames))
    return frames
video_frames = [encode_video(_) for _ in videos]
inputs = processor(messages, images=None, videos=video_frames)

inputs.to('cuda')
inputs.update({
    'tokenizer': tokenizer,
    'max_new_tokens':100,
    'decode_text':True,
})


g = model.generate(**inputs)
print(g)

Citation

If you find our work helpful, feel free to give us a cite.

@misc{ye2024mplugowl3longimagesequenceunderstanding,
      title={mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models}, 
      author={Jiabo Ye and Haiyang Xu and Haowei Liu and Anwen Hu and Ming Yan and Qi Qian and Ji Zhang and Fei Huang and Jingren Zhou},
      year={2024},
      eprint={2408.04840},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2408.04840}, 
}