MoE-LLaVA: Mixture-of-Experts for Large Vision-Language Models

MoE-LLaVA, a new Mixture of Experts (MoE) model for large vision-language tasks, demonstrates high performance with fewer parameters compared to existing models. With just 3 billion sparsely activated parameters, it rivals the LLaVA-1.5-7B model on various visual understanding benchmarks and even outperforms the LLaVA-1.5-13B in object hallucination tasks. This efficiency is achieved through a simple MoE tuning stage, allowing the model to be trained on 8 V100 GPUs in just two days.

The model’s capabilities can be explored through a Gradio Web UI demo or command-line interface (CLI) inference, with the code and models available for download. MoE-LLaVA’s performance is highlighted in a model zoo, showcasing its average performance and results across multiple datasets, such as VQAv2, GQA, and others.

For those interested in using or adapting MoE-LLaVA, the project provides comprehensive documentation on requirements, installation, training, validation, customization, and visualization. The project is open-source, with all codes available, and the majority of the project is released under the Apache 2.0 license. The project also acknowledges related frameworks like Video-LLaVA and LanguageBind, which contribute to the field of multi-modal learning.
Read more at GitHub… and ArXiv