GitHub - punica-ai/punica: Serving multiple LoRA finetuned LLM as one

Punica introduces an innovative approach to efficiently serve multiple finetuned Large Language Models (LLMs) using Low Rank Adaptation (LoRA), a technique that adds minimal storage and memory overhead to pre-existing models. By leveraging small matrices that modify the weights of a pretrained model, Punica can run multiple LoRA finetuned models with the computational cost of running just one.

The key to Punica’s efficiency is the Segmented Gather Matrix-Vector multiplication (SGMV) CUDA kernel, which handles the additional computations required by the LoRA models. This method maintains the strong batching effect, which is the ability to process multiple inputs simultaneously, thereby reducing latency.

In benchmarks, Punica significantly outperforms other systems, achieving up to 12 times the throughput in text generation tasks compared to state-of-the-art alternatives. This makes it a highly scalable solution for serving diverse LLMs simultaneously.

Punica can be installed either from a binary package for quick setup or built from source for customization. It also provides examples for serving multiple LoRA models, finetuning, converting to Punica format, and benchmarking text generation performance. The project’s paper offers a deeper understanding of the multi-tenant LoRA serving capabilities that Punica enables.
Read more at GitHub…

GitHub – punica-ai/punica: Serving multiple LoRA finetuned LLM as one

Related

When the Vending Machine Went Sentient

Constant-Time Breakthrough Raises the Hash-Table Speed Limit

Star Wars Reimagined: China’s Laser Satellite Outpaces Starlink

Court Rules AI’s Use of Books as Fair Use but Slams Pirated Collection Storage

Introducing the OWASP AI Testing Guide: A New Standard for AI Security Testing

The Low-Background Steel Problem of AI

Chinese AI Firms Dodge US Chip Bans with Cross-Border Data Smuggling to Malaysia

OpenAI open-sources a demo of a UI testing agent

Financial Dynamics in Agentic AI: Cursor’s Rise Versus GitHub Copilot