2024-04-19 LLM Shearing

Date de récolte : 2024-04-19-vendredi

Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

Mon avis :

Ce papier proposé une méthode, le LLM shearing, permettant de partir d'un "gros" modèle, de réutiliser une partie de ses poids pour constituer un petit modèle, et de l'entraîner longtemps sur peu de données. Comme il y a peu de données d'entraînement c'est peu coûteux, mais les performances sont très bonnes comparativement à la taille du modèle et au coût d'entraînement.
Cette approche s'appuie en particulier sur Llama 2, et sera sans doute dupliquée sur les nouveaux modèles Llama 3, ce qui permettrait de produire d'excellents petits modèles (1B de paramètres ? ).

Texte complet :

Mengzhou Xia, Tianyu Gao Princeton University


We introduce the Sheared-LLaMA models, the strongest 1.3B and 2.7B public base large language models (LLMs). Our models are produced by LLM-Shearing, an efficient method of constructing LLMs by first pruning a larger existing model and then continually pre-training it. Sheared-LLaMA models are first pruned from the LLaMA2-7B model, and then trained on only 50B tokens, 5% budget of the previous strongest public 3B model.

Paper: https://arxiv.org/abs/2310.06694 Code: https://github.com/princeton-nlp/LLM-Shearing Models: Sheared-LLaMA-1.3B, Sheared-LLaMA-2.7B


Highlight of our results

Comparison of a series of ~2.7B public models, including our Sheared-LLaMA model.

Swift Iterations of Open-Source LLMs

Model Date Model Scale Training Tokens Training Corpora
Pythia 02/13/2023 70M - 12B 300B The Pile
LLaMA 02/27/2023 7B - 70B 1T RedPajama*
INCITE 05/05/2023 3B - 7B 800B RedPajama
OpenLLaMA-v1 06/07/2023 3B - 13B 1T RedPajama
OpenLLaMA-v2 07/07/2023 3B - 13B 1T RedPajama, StarCoder, RefinedWeb
LLaMA2 07/18/2023 7B - 70B 2T Unknown
Mistral 09/28/2023 7B Unknown Unknown

*RedPajama is a public reproduction of the LLaMA training data.

Various institutions are actively and consistently releasing more capable open-source LLMs, trained with an increasing amount of compute. Despite their comparatively smaller sizes in comparison to proprietary models (GPT-4, Claude, PaLM), training these open-source models remains a costly endeavor. To put it into perspective, the training process for a LLaMA2 7B model, for instance, demands a substantial 184,320 A100 GPU hours. In this blog post, we introduce our methodology to accelerate pre-training via pruning existing strong LLMs.

Overview

Research Question

Can we produce a smaller, general-purpose, and competitive LLM by leveraging existing pre-trained LLMs, while using much less compute than training one from scratch?

Our answer is yes! And surprisingly, the compute savings will be tremendous. Specifically, We use structured pruning to achieve this goal. To link the approach to some past works:

Our Approach: LLM-Shearing

We propose two techniques in LLM-Shearing:

Targeted structured pruning: We prune a source model to to a pre-specified target architecture (e.g., an existing model's config), and meanwhile maximizing the pruned model’s performance

Dynamic batch loading: Pruning results in varying information retainment across domains. Inspired by (Xie et al., 2023), we load more data for domains that recover slow, and the loading proportion is dynamically decided on the fly.

Combining these two steps allow us to produce a smaller model

Future Implications

Performance

Downstream Tasks

We evaluate on an extensive set of downstream tasks including reasoning, reading comprehension, language modeling and knowledge intensive tasks. Our Sheared-LLaMA models outperform existing large language models.

Model # Pre-training Tokens Average Performance
LLaMA2-7B 2T 64.6

1.3B

OPT-1.3B 300B 48.2
Pythia-1.4B 300B 48.9
Sheared-LLaMA-1.3B 50B 51.0

3B

OPT-2.7B 300B 51.4
Pythia-2.8B 300B 52.5
INCITE-Base-3B 800B 54.7
Open-LLaMA-3B-v1 1T 55.1
Open-LLaMA-3B-v2 1T 55.7
Sheared-LLaMA-2.7B 50B 56.7

Instruction Tuning

We instruction-tuned Sheared-LLaMA and other public LMs of similar scale on ShareGPT and evaluate their open-ended generation ability by GPT-4. We show that Sheared-LLaMA’s instruction following ability is also better.

Continual Pre-Training

When compared to continuing pre-training an existing LM and a pruned model with the same amount of compute, we find that continuing pre-training the pruned model leads to a consistently better performance. When there exists a larger source model that is significantly stronger than all existing smaller ones (e.g., LLaMA2-7B is superior compared to all 3B models), pruning from the larger model is more cost-efficient than continually training existing small models.

Consider using it!

We propose a pruning approach LLM-Shearing which

If you pre-train stronger LLMs with better data compositions or new data:

If you are a LLM practitioner who are looking for strong small-scale LLMs to prototype your experiments: