2024-04-23 Huggingface a utilisé 120k heures de GPU pour publier FineWeb

Date de récolte : 2024-04-23-mardi

Thomas Wolf on LinkedIn: A note on our FineWeb dataset release

Mon avis :

HuggingFace a publié il y a quelques jours 2024-04-23 FineWeb, le plus grand corpus de données textuelles de qualité pour entraîner des LLM (15 trillions de tokens !). Dans ce post LinkedIn, Thomas Wolf, le Chief scientist officer de HuggingFace, raconte comment ils ont créé ce jeu de données. En fait, pour juger de la qualité d'un jeu de données du point de vue de l'entraînement d'un LLM, il n'y a pas de meilleure méthode que d'entraîner des LLM dessus et voir leur qualité. L'équipe de HuggingFace a donc conduit des "ablation studies" et entraîné 200 "petits" et 15 "grands" modèles différents pour en comparer les performances... ce qui représente environ 120.000 heures de GPU. En publiant ces données (et le code utilisé pour préparer les données), HuggingFace fait gagner du temps et de l'argent (plusieurs centaines de milliers d'euros/dollars) à toutes celles et tous ceux qui veulent entraîner des modèles de fondation !

Texte complet :

A note on our FineWeb dataset release: surprisingly, the size of the dataset of 15T tokens is not very important, what is much more important is why we spend ~120k GPU hours on our H100 cluster to prepare/share a ... dataset.

Let's take a moment to dive in it!

First, where can you get data at scale for web-scale llm pretraining? Well we're all lucky that Common Crawl has been doing in the open a large chunk of the work of crawling/archiving the web for years, work that otherwise only private teams like Google/Bing would have access to.

Next: can you just train directly on the petabytes common crawl corpus, maybe just extract the text from the html pages and train on it? You would have the largest possible dataset!

The answer we learned over time is: No! You actually want a dataset which is both large but also high quality

What is high quality for a web-scale LLM pretraining dataset and how do you know you have it? should I just train on wikipedia only! The answer to this is also no. First because wikipedia is too small but more important because our intuitive notion of data quality is not always reflected in the performance of the models and some aspects of data filtering can be counter-intuitive.

Before I dive more in this let me give you an example of unintuitive behavior. Between 2022 and 2023 the "LLM quality" of Common Crawl dropped significantly as in "training a LLM on the crawls btw 2022-2023 will give you lower performances on a set of evals". What happened? it turns out the Common Crawl team has been filtering more strongly domains with adult content. Not really the cause we'd be intuitively thinking about, right?

So how do you know you have good quality data? Well the simple, circular, answer is: you just train on it. Train smaller models so it's not (too) expensive but models that are big enough (and on sensitive enough evaluations) to give signal about the quality of a larger model trained on the dataset: what we call "ablation models"

We settled on 2 ways of training ablation models: in the first we trained a 1.8B parameters model on 28B tokens (about 5h on 64 H100) In the second we trained the same model for 350B tokens (about 2.5 days). Note that these larger ablations were trained on more tokens than GPT3 for instance...

For each options for the filters we explored, we trained ablation models and compared performance to see improvement or regression. Xe trained about 200 small & 15 larger models for a total of ~120k GPU hours

The results of these evaluations are e.g. in the performance plots we include in the dataset card (on 350B tokens)

This is the main difference with raw Common Crawl or RedPajama-V2. In these later case you still need to do the work of selecting how to filter the data yourself while it's the work we wanted to provide the community with in FineWeb.

At least a first version of it since we see FineWeb as a ressource to improve over time along various directions: better filtering, multilinguality etc