LLMLingua: To speed up LLMs' inference and enhance LLM's perceive of key information

LLMLingua and its extension, LongLLMLingua, are innovative tools designed to optimize the use of large language models (LLMs) like GPT-3.5, GPT-4, LLama etc. by compressing prompts. These tools address common issues such as token limits and high operational costs associated with LLMs by reducing the number of tokens needed for prompts, thus enabling more efficient inference.

LLMLingua can compress prompts by up to 20 times, significantly reducing the length of both the prompt and the generated text without compromising performance. This compression allows for cost savings and supports longer contexts, which is particularly beneficial in scenarios where maintaining the flow of information is critical, such as in Retrieval-Augmented Generation (RAG), online meetings, Chain of Thought (CoT), and coding applications.

LongLLMLingua further enhances the capabilities of LLMs in processing long-context information. It addresses the “lost in the middle” problem, where models may lose track of earlier parts of the conversation or document. By using prompt compression, LongLLMLingua improves RAG performance by up to 21.4% while using only a quarter of the tokens typically required.

The tools are robust, requiring no additional training for the LLMs, and they retain essential information from the original prompts. They also feature KV-Cache Compression to speed up the inference process and ensure that key information is not lost during compression.

For those interested in utilizing these tools, LLMLingua can be easily installed via pip, and the PromptCompressor class can be used to compress prompts effectively. The project provides a range of examples and documentation to help users apply the tools to real-world scenarios.

The development of LLMLingua and LongLLMLingua is part of an ongoing effort to enhance the practicality and accessibility of LLMs, making them more viable for a wider range of applications while managing costs and computational resources. The research behind these tools has been documented in papers available on arXiv and presented at the EMNLP 2023 conference.
Read more at GitHub…

LLMLingua: To speed up LLMs’ inference and enhance LLM’s perceive of key information

Related

When the Vending Machine Went Sentient

Constant-Time Breakthrough Raises the Hash-Table Speed Limit

Star Wars Reimagined: China’s Laser Satellite Outpaces Starlink

Court Rules AI’s Use of Books as Fair Use but Slams Pirated Collection Storage

Introducing the OWASP AI Testing Guide: A New Standard for AI Security Testing

The Low-Background Steel Problem of AI

Chinese AI Firms Dodge US Chip Bans with Cross-Border Data Smuggling to Malaysia

OpenAI open-sources a demo of a UI testing agent

Financial Dynamics in Agentic AI: Cursor’s Rise Versus GitHub Copilot