MoE-LLaVA：大型视觉语言模型的专家混合体 | 𝖲𝗈𝗆𝖾𝗍𝗁𝗂𝗇𝗀𝗔𝗜

type

status

slug

summary

category

icon

password

Date

📣 新闻

⏳⏳⏳ 在更高的图像分辨率（例如 768 × 768）下训练更强大的模型。

⏳⏳⏳ 训练 MoE-LLaVA-Qwen1.5 以更好地支持中文。

[2024.03.16] 🎉 我们发布了所有 stage2 模型，检查我们的模型动物园。

[2024.02.03] 🎉 我们发布了更强大的MoE-LLaVA-StableLM 。通过使用2.0B稀疏激活参数，检查我们的模型动物园，平均性能接近 LLaVA-1.5-7B 。

[2024.02.02] 🤝 享受由@camenduru创建的和，他慷慨地支持了我们的研究！

[2024.02.01] 🔥 无法访问 HF 的人现在可以通过模型范围下载模型，检查我们的模型动物园。

[2024.01.30] 🔥 我们发布了更强大的MoE-LLaVA-Phi2 。通过检查我们的模型动物园，使用 3.6B稀疏激活参数，平均性能超越了 LLaVA-1.5-7B 。

[2024.01.27] 🤗 Hugging Face 演示以及所有代码和数据集现已可用！欢迎关注👀 此存储库以获取最新更新。

😮 亮点

MoE-LLaVA 在多模态学习中表现出色。

🔥 性能高，但参数更少

仅使用3B 个稀疏激活参数

，MoE-LLaVA 在各种视觉理解数据集上表现出与 LLaVA-1.5-7B 相当的性能，甚至在物体幻觉基准测试中超越了 LLaVA-1.5-13B。

🚀 简单基线，学习具有稀疏路径的多模态交互。

通过添加简单的 MoE 调整阶段8 个 A100 GPU

，我们可以在 1 天内在

上完成 MoE-LLaVA 的训练。

🤗 演示

Gradio 网页用户界面

强烈建议您通过以下命令试用我们的 Web 演示，该演示包含 MoE-LLaVA 目前支持的所有功能。我们还在Huggingface Spaces 中提供在线演示。


# use phi2
deepspeed --include localhost:0 moellava/serve/gradio_web_server.py --model-path "LanguageBind/MoE-LLaVA-Phi2-2.7B-4e"# use qwen
deepspeed --include localhost:0 moellava/serve/gradio_web_server.py --model-path "LanguageBind/MoE-LLaVA-Qwen-1.8B-4e"# use stablelm
deepspeed --include localhost:0 moellava/serve/gradio_web_server.py --model-path "LanguageBind/MoE-LLaVA-StableLM-1.6B-4e"

20240126_205845.mp4

CLI 推理


# use phi2
deepspeed --include localhost:0 moellava/serve/cli.py --model-path "LanguageBind/MoE-LLaVA-Phi2-2.7B-4e"  --image-file "image.jpg"# use qwen
deepspeed --include localhost:0 moellava/serve/cli.py --model-path "LanguageBind/MoE-LLaVA-Qwen-1.8B-4e"  --image-file "image.jpg"# use stablelm
deepspeed --include localhost:0 moellava/serve/cli.py --model-path "LanguageBind/MoE-LLaVA-StableLM-1.6B-4e"  --image-file "image.jpg"

🐳 模范动物园

模型

激活参数

变压器（高频）

模型范围(HF)

平均

视频量化音频

质量保证

维智

SQA-IMG

视觉品质保证

教皇

女士

MM-工作台

MM-兽医

Files

MoE-LLaVA-1.6B×4-Top2

2.0B

🤗LanguageBind/MoE-LLaVA-StableLM-1.6B-4e

北京大学袁志实验室/MoE-LLaVA-StableLM-1.6B-4e

57.3

76.7

60.3

36.2

62.6

50.1

85.7

1318.1

60.2

26.9

MoE-LLaVA-1.8B×4-Top2

2.2B

🤗LanguageBind/MoE-LLaVA-Qwen-1.8B-4e

北京大学袁志实验室/MoE-LLaVA-Qwen-1.8B-4e

56.7

76.2

61.5

32.6

63.1

1291.6

59.6

25.3

MoE-LLaVA-2.7B×4-Top2

3.6B

🤗LanguageBind/MoE-LLaVA-Phi2-2.7B-4e

北京大学元实验室/教育部-LLaVA-Phi2-2.7B-4e

61.1

77.6

61.4

43.9

68.5

51.4

86.3

1423

65.2

34.3

MoE-LLaVA-1.6B×4-Top2-384

2.0B

🤗LanguageBind/MoE-LLaVA-StableLM-1.6B-4e-384

北京大学元实验室/MoE-LLaVA-StableLM-1.6B-4e-384

78.6

61.5

40.5

63.9

54.3

85.9

1335.7

63.3

32.3

MoE-LLaVA-2.7B×4-Top2-384

3.6B

🤗LanguageBind/MoE-LLaVA-Phi2-2.7B-4e-384

北京大学元实验室/教育部-LLaVA-Phi2-2.7B-4e-384

62.9

79.9

62.6

43.7

70.3

85.7

1431.3

35.9

LLaVA-1.5

🤗liuhaotian/llava-v1.5-7b

78.5

66.8

58.2

85.9

1510.7

64.3

30.5

Stage2 模型

🚨请了解#27。

模型	检查点
MoE-LLaVA-1.6B×4-Top2	LanguageBind/MoE-LLaVA-StableLM-Stage2
MoE-LLaVA-1.6B×4-Top2-384	LanguageBind/MoE-LLaVA-StableLM-Stage2-384
MoE-LLaVA-1.8B×4-Top2	LanguageBind/MoE-LLaVA-Qwen-Stage2
MoE-LLaVA-2.7B×4-Top2	LanguageBind/MoE-LLaVA-Phi2-Stage2
MoE-LLaVA-2.7B×4-Top2-384	LanguageBind/MoE-LLaVA-Phi2-Stage2-384

预训练模型

模型	检查点
MoE-LLaVA-1.6B×4-Top2	LanguageBind/MoE-LLaVA-StableLM-Pretrain
MoE-LLaVA-1.6B×4-Top2-384	LanguageBind/MoE-LLaVA-StableLM-384-Pretrain
MoE-LLaVA-1.8B×4-Top2	LanguageBind/MoE-LLaVA-Qwen-Pretrain
MoE-LLaVA-2.7B×4-Top2	LanguageBind/MoE-LLaVA-Phi2-Pretrain
MoE-LLaVA-2.7B×4-Top2-384	LanguageBind/MoE-LLaVA-Phi2-384-Pretrain

⚙️ 要求和安装

我们建议要求如下。

Python == 3.10

Pytorch == 2.0.1

CUDA 版本 >= 11.7

Transformers == 4.37.0

标记器==0.15.1

安装所需的软件包：


git clone https://github.com/PKU-YuanGroup/MoE-LLaVA
cd MoE-LLaVA
conda create -n moellava python=3.10 -y
conda activate moellava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

# Below are optional. For Qwen model.
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention && pip install .
# Below are optional. Installing them might be slow.
# pip install csrc/layer_norm
# If the version of flash-attn is higher than 2.1.1, the following is not needed.
# pip install csrc/rotary

警告

🚨我们发现使用 flashtention2 会导致性能下降。

🗝️ 训练与验证

训练和验证说明位于TRAIN.md和EVAL.md中。

💡 自定义你的 MoE-LLaVA

说明在CUSTOM.md中。

😍 可视化

说明在VISUALIZATION.md中。

🤖 API

我们开源了所有代码，如果你想在本地加载模型（例如LanguageBind/MoE-LLaVA-Phi2-2.7B-4e），可以使用以下代码片段。

使用以下命令运行代码。


deepspeed --include localhost:0 predict.py


import torch
from PIL import Image
from moellava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from moellava.conversation import conv_templates, SeparatorStyle
from moellava.model.builder import load_pretrained_model
from moellava.utils import disable_torch_init
from moellava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria

def main():
    disable_torch_init()
    image = 'moellava/serve/examples/extreme_ironing.jpg'
    inp = 'What is unusual about this image?'
    model_path = 'LanguageBind/MoE-LLaVA-Phi2-2.7B-4e'  # LanguageBind/MoE-LLaVA-Qwen-1.8B-4e or LanguageBind/MoE-LLaVA-StableLM-1.6B-4e
    device = 'cuda'
    load_4bit, load_8bit = False, False  # FIXME: Deepspeed support 4bit or 8bit?
    model_name = get_model_name_from_path(model_path)
    tokenizer, model, processor, context_len = load_pretrained_model(model_path, None, model_name, load_8bit, load_4bit, device=device)
    image_processor = processor['image']
    conv_mode = "phi"  # qwen or stablelm
    conv = conv_templates[conv_mode].copy()
    roles = conv.roles
    image_tensor = image_processor.preprocess(Image.open(image).convert('RGB'), return_tensors='pt')['pixel_values'].to(model.device, dtype=torch.float16)

    print(f"{roles[1]}: {inp}")
    inp = DEFAULT_IMAGE_TOKEN + '\n' + inp
    conv.append_message(conv.roles[0], inp)
    conv.append_message(conv.roles[1], None)
    prompt = conv.get_prompt()
    input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
    stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
    keywords = [stop_str]
    stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

    with torch.inference_mode():
        output_ids = model.generate(
            input_ids,
            images=image_tensor,
            do_sample=True,
            temperature=0.2,
            max_new_tokens=1024,
            use_cache=True,
            stopping_criteria=[stopping_criteria])

    outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:], skip_special_tokens=True).strip()
    print(outputs)

if __name__ == '__main__':
    main()

🙌 相关项目

Video-LLaVA

该框架使模型能够有效地利用联合的视觉标记。

LanguageBind

一个开源的基于五种模式语言的检索框架。

👍 致谢

LLaVA

是我们构建的代码库，它是一个高效的大型语言和视觉助手。

🔒 许可证

该项目的大部分内容是在LICENSE

文件中找到的 Apache 2.0 许可证下发布的。

该服务是研究预览，仅供非商业用途，并受LLaMA 的示范许可使用条款隐私惯例约束。如果您发现任何潜在违规行为，请联系我们。

、 OpenAI 生成的数据的

和ShareGPT 的

✏️ 引用

如果您发现我们的论文和代码对您的研究有用，请考虑给出星星⭐和引用📝。


@article{lin2024moe,
  title={MoE-LLaVA: Mixture of Experts for Large Vision-Language Models},
  author={Lin, Bin and Tang, Zhenyu and Ye, Yang and Cui, Jiaxi and Zhu, Bin and Jin, Peng and Zhang, Junwu and Ning, Munan and Yuan, Li},
  journal={arXiv preprint arXiv:2401.15947},
  year={2024}
}


@article{lin2023video,
  title={Video-LLaVA: Learning United Visual Representation by Alignment Before Projection},
  author={Lin, Bin and Zhu, Bin and Ye, Yang and Ning, Munan and Jin, Peng and Yuan, Li},
  journal={arXiv preprint arXiv:2311.10122},
  year={2023}
}

✨ 明星历史

MoE-LLaVA

PKU-YuanGroup • Updated Nov 10, 2024

🫂MoE-LLaVA：大型视觉语言模型的专家混合体