New AI System Accelerates Large Language Model Serving

A new artificial intelligence system called SpecInfer can significantly accelerate the speed of large language model inference, according to a recent paper from CMU, Shanghai Jiao Tong University, Peking University, and UCSD.

Large language models like GPT-3 and OPT-175B have shown impressive capabilities in generating natural language text. However, these models contain hundreds of billions of parameters, making inference computationally expensive and slow. For example, GPT-3 takes several seconds to generate text for a single prompt.

SpecInfer combines small speculative models and a novel tree-based parallel decoding algorithm to reduce the latency and computational costs of large language model inference.

Comparing the incremental decoding approach used by existing LLM serving systems and the speculative inference and token tree verification approach used by SpecInfer.

The key ideas are:

  • Using multiple small models to jointly predict the large model’s outputs. Their predictions are organized into a tree structure called a token tree.
  • Verifying the correctness of all speculated tokens in the token tree in parallel against the large model using a novel tree-based decoding mechanism.
  • Reducing the number of large model decoding steps needed via speculative inference and batched token tree verification.

The authors evaluated SpecInfer on large language models like OPT-175B and compared it against existing inference systems. For distributed inference across multiple GPUs, SpecInfer achieved 1.3-2.4x lower latency than current systems. For offloading-based inference on a single GPU, SpecInfer attained 2.6-3.5x speedups.

A key benefit of SpecInfer is reducing accesses to the large model’s parameters, which translates to less GPU memory bandwidth usage and lower energy consumption. SpecInfer also enables greater parallelism across tokens via its tree-based decoding. This is similar to the recently proposed “Skeleton-of-Thought” technique that can do parallel decoding with large language models.

The techniques used in SpecInfer could have broad impacts on serving large language models where latency and cost are important. This includes production environments and more interactive applications. Looking ahead, speculative execution could be combined with other optimizations like model compression and quantization for further improvements. For example, the “LLaMA-SSP” method also uses smaller models to speed up large model inference. Exploring synergies with tree-based parallel decoding could inspire new ways of parallelizing transformer computations.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Discover more from Emsi's feed

Subscribe now to keep reading and get access to the full archive.

Continue reading