It is observed that during LLM inference, only a few layers are actively used.
TLDR - The inference stage of LLMs being computationally expensive poses problems for real-time application use. During LLM inference, not every layer within an LLM is always actively used as per statistical analysis. AdaInfer is a new algorithm designed to decide when to stop inference depending on the input difficulty. Moreover, this algorithm doesn’t change LLM parameters and works across multiple tasks.
--> For video tutorials on top LLM papers, check Kalyan KS YouTube channel
--> For top LLM papers of the week, check the newsletter.
The Problem:
Large Language Models (LLMs) are powerful for tasks like text generation and question answering, adapting well through in-context learning.
However, running them (inference) is computationally expensive due to their size and complexity (many layers, hidden units, etc.).
Existing Approaches and Limitations
Model Pruning: Removing parts of the LLM might hurt its generalization ability.
Sparse Models: Designed for efficiency, but compatibility with other acceleration methods can be an issue.
New Strategy: Adaptive Inference
Inspiration: Humans think differently for simple vs. complex tasks. Studies suggest 'easy' tasks activate earlier layers in models and vice versa.
The Idea: Dynamically stop the LLM's inference process early depending on the input's complexity – saving computation for easier tasks.
AdaInfer: The Proposed Solution
Key Principle: Doesn't change the LLM's weights, preserving performance.
How it Works:
Statistical analysis of the LLM's internal behavior during different tasks.
Features are built, especially from logits (the model's raw output before final decision)
A simple classifier decides if the LLM can stop confidently at an earlier layer.
Contributions
Results: AdaInfer saves around 14.8% of computations on average, up to 50% in some cases, without sacrificing accuracy.
Compatibility: Crucially, AdaInfer can be combined with other acceleration techniques for even greater efficiency.
First of its Kind: This is an early exploration of adaptive inference specifically for LLMs.
--> For complete details, refer to the paper.