The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

Abstract: Large language models are commonly trained on a mixture of filtered web data and curated high-quality corpora, such as social media conversations, books, or technical papers. This curation process is believed to be necessary to produce performant models with broad zero-shot generalization abilities. However, as larger models requiring pretraining on trillions of tokens are considered, it is unclear how scalable is curation and whether we will run out of unique high-quality data soon. At variance with previous beliefs, we show that properly filtered and deduplicated web data alone can lead to powerful models; even significantly outperforming models from the state-of-the-art trained on The Pile. Despite extensive filtering, the high-quality data we extract from the web is still plentiful, and we are able to obtain five trillion tokens from CommonCrawl. We publicly release an extract of 600 billion tokens from our RefinedWeb dataset, and 1.3/7.5B parameters language models trained on it.

Preview

PDF Thumbnail

Synopsis

Overview

  • Keywords: RefinedWeb, Falcon LLM, web data, large language models, dataset quality
  • Objective: Demonstrate that a properly filtered and deduplicated web dataset can outperform curated corpora in training large language models.
  • Hypothesis: Can web data alone, when adequately processed, yield language models that surpass those trained on curated datasets?
  • Innovation: Introduction of the RefinedWeb dataset, which utilizes stringent filtering and deduplication to create a high-quality, large-scale web dataset for training language models.

Background

  • Preliminary Theories:

    • Scaling Laws: The principle that both model size and dataset size should increase together to improve performance in large language models.
    • Curation vs. Web Data: Traditional belief that curated datasets yield better model performance compared to raw web data, which is often seen as inferior.
    • Deduplication: The process of removing duplicate entries from datasets, which has been shown to enhance model performance by reducing memorization.
    • Zero-Shot Learning: The ability of a model to generalize to unseen tasks without additional training, a key evaluation metric for language models.
  • Prior Research:

    • C4 Dataset (2020): Introduced a large-scale web dataset but was criticized for not matching the quality of curated datasets.
    • The Pile (2020): A curated dataset combining various sources, including web data, books, and technical papers, widely used for training language models.
    • OSCAR Datasets (2019-2022): Focused on multilingual datasets but faced issues with deduplication and filtering, impacting performance.
    • GPT-3 (2020): Demonstrated the capabilities of large language models trained on curated datasets, setting a benchmark for subsequent models.

Methodology

  • Key Ideas:

    • MacroData Refinement (MDR): A novel pipeline for filtering and deduplicating web data from CommonCrawl, aimed at producing high-quality datasets.
    • Stringent Filtering: Implementation of strict rules to remove low-quality content and ensure the dataset's integrity.
    • Exact and Fuzzy Deduplication: Combining both methods to maximize the removal of duplicate content while retaining diverse data.
  • Experiments:

    • Ablation Studies: Evaluated the impact of different stages of data processing (raw, filtered, final) on model performance.
    • Zero-Shot Evaluation: Used a variety of tasks to assess the generalization capabilities of models trained on RefinedWeb compared to those trained on curated datasets.
    • Benchmark Comparisons: Compared performance against state-of-the-art models, including those trained on The Pile and GPT-3.
  • Implications: The methodology suggests that high-quality web data can serve as a viable alternative to curated datasets, challenging existing paradigms in dataset preparation for language models.

Findings

  • Outcomes:

    • Models trained on RefinedWeb significantly outperformed those trained on curated datasets like The Pile in zero-shot tasks.
    • Stringent filtering and deduplication were crucial in enhancing model performance, demonstrating that web data can be competitive with curated sources.
    • The release of a 600 billion token extract from RefinedWeb provides a new benchmark for future research.
  • Significance: This research challenges the prevailing belief that curated datasets are essential for high-performing language models, suggesting that with proper processing, web data can achieve similar or superior results.

  • Future Work: Further exploration of the MDR pipeline could enhance the quality of existing datasets, and research into the implications of using web data for multilingual models is warranted.

  • Potential Impact: If the methodologies and findings are widely adopted, they could reshape the landscape of dataset preparation for language models, leading to more efficient training processes and broader accessibility of high-quality datasets.

Notes

Meta

Published: 2023-06-01

Updated: 2025-08-27

URL: https://arxiv.org/abs/2306.01116v1

Authors: Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, Julien Launay

Citations: 502

H Index: 54

Categories: cs.CL, cs.AI

Model: gpt-4o-mini