ShortGPT: Layers in Large Language Models are
More Redundant Than You Expect (short summary)

ShortGPT: Layers in Large Language Models are More Redundant Than You Expect (short summary)

ShortGPT - A New LLM Pruning Approach

·

2 min read

TLDR - Large language models (LLMs) are getting larger to achieve better performances, but their large sizes create bottlenecks for deployment. Model compression techniques, like pruning, make LLMs smaller by removing some parameters while maintaining almost the same performance. Existing pruning methods require extra information (like gradients) or complex. The authors proposed a new approach which involves removing less important layers based on new metric called Block Influence (BI).

--> For video tutorials on top LLM papers, check Kalyan KS YouTube channel

--> For top LLM papers of the week, check the newsletter.

Introduction

Large language models (LLMs) have become increasingly powerful, but their massive size makes them difficult to use in real-world applications. Model compression, specifically pruning (removing parameters), is one way to address this. However, many pruning methods are complicated and not well-suited to LLMs.

Main Findings

This paper presents a simpler and more effective LLM pruning method based on these discoveries:

  • LLMs have lots of redundant layers: Layers can be removed with minimal impact on performance. Example: A 40-layer LLM had 25% of its layers removed with only a slight performance drop.

  • Block Influence (BI) metric: BI measures how much information transformation a layer performs within the LLM. This metric is useful for identifying which layers are most important.

  • Layer-Based Pruning: Simply removing layers with low BI scores works remarkably well for pruning LLMs. This method is more effective than previous, more complicated techniques.

Benefits & Additional Points

  • Simplicity: This layer-based pruning method is easy to implement in practice.

  • Efficiency boost: Smaller models are less computationally demanding.

  • Combination potential: This pruning method can be combined with other compression techniques like quantization for further benefits.

--> For complete details, refer to the ShortGPT paper.