Not all Layers of LLMs are Necessary during Inference (Short Summary)

Not all Layers of LLMs are Necessary during Inference (Short Summary)

·

2 min read

It is observed that during LLM inference, only a few layers are actively used.

TLDR - The inference stage of LLMs being computationally expensive poses problems for real-time application use. During LLM inference, not every layer within an LLM is always actively used as per statistical analysis. AdaInfer is a new algorithm designed to decide when to stop inference depending on the input difficulty. Moreover, this algorithm doesn’t change LLM parameters and works across multiple tasks.

--> For video tutorials on top LLM papers, check Kalyan KS YouTube channel

--> For top LLM papers of the week, check the newsletter.

The Problem:

  • Large Language Models (LLMs) are powerful for tasks like text generation and question answering, adapting well through in-context learning.

  • However, running them (inference) is computationally expensive due to their size and complexity (many layers, hidden units, etc.).

Existing Approaches and Limitations

  • Model Pruning: Removing parts of the LLM might hurt its generalization ability.

  • Sparse Models: Designed for efficiency, but compatibility with other acceleration methods can be an issue.

New Strategy: Adaptive Inference

  • Inspiration: Humans think differently for simple vs. complex tasks. Studies suggest 'easy' tasks activate earlier layers in models and vice versa.

  • The Idea: Dynamically stop the LLM's inference process early depending on the input's complexity – saving computation for easier tasks.

AdaInfer: The Proposed Solution

  • Key Principle: Doesn't change the LLM's weights, preserving performance.

  • How it Works:

    1. Statistical analysis of the LLM's internal behavior during different tasks.

    2. Features are built, especially from logits (the model's raw output before final decision)

    3. A simple classifier decides if the LLM can stop confidently at an earlier layer.

Contributions

  • Results: AdaInfer saves around 14.8% of computations on average, up to 50% in some cases, without sacrificing accuracy.

  • Compatibility: Crucially, AdaInfer can be combined with other acceleration techniques for even greater efficiency.

  • First of its Kind: This is an early exploration of adaptive inference specifically for LLMs.

--> For complete details, refer to the paper.