Building an LLM Pretraining Data Pipeline

Everyone talks about training language models. Almost nobody talks about the step that determines whether training works at all: the data pipeline. A model's quality ceiling is set by the data it sees — its format, cleanliness, and packing efficiency — long before a single gradient is computed.

This repo is my implementation of that unglamorous but critical step. Starting from a raw text corpus, the notebook walks through every stage of the data packaging pipeline: cleaning, deduplication, tokenization, sequence packing, and finally serializing the result into Parquet — a training-ready format that the Hugging Face datasets library can stream directly into a training loop.

This post explains the design decisions behind each stage, the non-obvious choices in the packing step, and why Parquet is the right end format for this kind of work.

Introduction

Before a language model can learn anything, its training data has to be transformed into a shape the model can consume: fixed-length sequences of token IDs, packed densely to minimize wasted compute, stored in a format that supports streaming from disk without loading everything into RAM.

This is what "data packaging" means in the context of LLM pretraining. It's distinct from the more commonly discussed steps of fine-tuning or evaluation. Pretraining data is typically unstructured — web crawls, books, code, Wikipedia dumps — and it arrives in inconsistent formats, variable lengths, and with quality ranging from excellent to unusable. The packaging pipeline is what turns this raw material into something a training loop can ingest at scale.

Three problems drive the design of this pipeline:

Quality decay without filtering. Raw web text contains duplicate paragraphs, near-empty documents, and repetitive boilerplate. Left in the dataset, these inflate token counts without adding learning signal. Filtering them out before tokenization is significantly cheaper than training through them.

Padding waste without packing. A naive tokenization approach produces sequences of variable length, which must be padded to a uniform length before batching. On a dataset of typical web documents, this wastes a substantial fraction of GPU memory attending to meaningless pad tokens.

Memory limits without streaming. Pretraining datasets are large. Loading a full dataset into RAM before processing is either slow or impossible. The pipeline needs to process data in passes, writing the result incrementally to disk.

Core constraint: The output of this pipeline needs to be consumable by a standard Hugging Face training loop with no additional preprocessing — clean token IDs, packed to context length, stored in a streamable format.

Architecture

The pipeline is a linear sequence of transformations applied to the raw dataset, with the preprocessed Parquet file as the only persistent artifact.

Raw Text Corpus (.parquet from Releases)
           │
           ▼
  ┌─────────────────────────┐
  │     Quality Filtering   │  ← Remove short docs, high-repetition paragraphs
  └────────────┬────────────┘
               │
               ▼
  ┌─────────────────────────┐
  │     Deduplication       │  ← Drop duplicate or near-duplicate text blocks
  └────────────┬────────────┘
               │
               ▼
  ┌─────────────────────────┐
  │     Tokenization        │  ← Text → token IDs via HuggingFace tokenizer
  └────────────┬────────────┘
               │
               ▼
  ┌─────────────────────────┐
  │     Sequence Packing    │  ← Concatenate + chunk to fixed context length
  └────────────┬────────────┘
               │
               ▼
  ┌─────────────────────────┐
  │   Parquet Serialization │  → preprocessed_dataset.parquet
  └─────────────────────────┘

Each stage is a pure transformation on the dataset — no stage reads the output of any other stage except its immediate predecessor. This makes each step independently testable and replaceable.

The preprocessed Parquet file is distributed via GitHub Releases rather than committed to the repository directly. This keeps the repo size manageable while making the processed dataset versioned and reproducible alongside the code.

Implementation: Quality Filtering and Deduplication

The first pass over the dataset removes documents that would contribute noise rather than signal to a language model's training. Two categories of documents get filtered:

Short documents below a minimum token threshold are dropped. A two-sentence document provides almost no useful context for next-token prediction. The model sees the beginning of a sequence and immediately hits an EOS token, learning nothing about long-range structure. The filter threshold is a deliberate hyperparameter — too aggressive and you lose real short-form content; too loose and the noise remains.

High-repetition paragraphs are identified by comparing unique word count against total word count within a document. If a paragraph repeats the same phrases above a configurable ratio, it's discarded. This catches the boilerplate-heavy pages that dominate raw web crawls — cookie consent text, navigation menus scraped as body content, generated spam.

def is_quality_document(text: str, min_tokens: int = 50, max_repeat_ratio: float = 0.3) -> bool:
    tokens = text.split()

    # Filter documents that are too short to be useful
    if len(tokens) < min_tokens:
        return False

    # Filter documents with excessive repetition
    unique_tokens = set(tokens)
    repeat_ratio = 1 - (len(unique_tokens) / len(tokens))
    if repeat_ratio > max_repeat_ratio:
        return False

    return True

# Apply across the full dataset using HF datasets .filter()
filtered_dataset = raw_dataset.filter(
    lambda example: is_quality_document(example["text"]),
    batched=False
)

Deduplication follows filtering and operates at the paragraph level rather than the full document level. Exact duplicate paragraphs are identified by hash and removed. This step is less dramatic than filtering in terms of rows removed, but it prevents the model from memorizing specific phrasings that appear hundreds of times across a web corpus.

Implementation: Tokenization

After filtering, the cleaned text is tokenized using a Hugging Face tokenizer. The tokenizer converts raw text strings into sequences of integer token IDs — the actual input the model will consume.

from transformers import AutoTokenizer
from datasets import Dataset

tokenizer = AutoTokenizer.from_pretrained("gpt2")
context_length = 128  # Target sequence length for packing

def tokenize(element: dict) -> dict:
    outputs = tokenizer(
        element["text"],
        truncation=True,
        max_length=context_length,
        return_overflowing_tokens=True,  # Preserve tokens beyond context_length
        return_length=True,
    )

    # Keep only sequences that fill the context window
    input_batch = []
    for length, input_ids in zip(outputs["length"], outputs["input_ids"]):
        if length == context_length:
            input_batch.append(input_ids)

    return {"input_ids": input_batch}

tokenized_dataset = filtered_dataset.map(
    tokenize,
    batched=True,
    remove_columns=filtered_dataset.column_names
)

Two tokenization choices here are worth calling out explicitly:

return_overflowing_tokens=True tells the tokenizer to not discard tokens beyond max_length — instead, it splits long documents into multiple chunks. This is critical for pretraining: a long article shouldn't be truncated to 128 tokens; it should produce multiple training examples.

Only sequences with length == context_length are retained. This might seem wasteful for the final partial chunk of a document, but it eliminates the need for any padding in the training loop. Every sequence in the output is exactly context_length tokens — no attention mask manipulation, no wasted compute on padding tokens.

Implementation: Sequence Packing and Parquet Serialization

Packing concatenates multiple short sequences into a single context-length window, separated by EOS tokens. Without packing, a dataset of short documents produces batches where most of each sequence is padding. With packing, the model sees real tokens from start to finish of every training example.

from datasets import Dataset
import numpy as np

def pack_sequences(tokenized_dataset, context_length: int) -> Dataset:
    # Flatten all token IDs into a single stream
    all_tokens = np.concatenate(tokenized_dataset["input_ids"])

    # Chunk into fixed-length sequences
    total_length = (len(all_tokens) // context_length) * context_length
    packed = all_tokens[:total_length].reshape(-1, context_length)

    return Dataset.from_dict({"input_ids": packed.tolist()})

packed_dataset = pack_sequences(tokenized_dataset, context_length=128)

After packing, the dataset is serialized to Parquet:

# Save to Parquet — columnar format, efficient for streaming during training
packed_dataset.to_parquet("data/preprocessed_dataset.parquet")

# Verify round-trip integrity
reloaded = Dataset.from_parquet("data/preprocessed_dataset.parquet")
print(f"Total sequences: {len(reloaded)}")
print(f"Sequence shape: {len(reloaded[0]['input_ids'])} tokens")

Parquet is the right end format here for three reasons. It's columnar — reading only the input_ids column during training doesn't incur the I/O cost of reading any other metadata. It compresses well compared to raw JSON or CSV. And the Hugging Face datasets library can stream it directly from disk or from S3/Hugging Face Hub, enabling training on datasets larger than available RAM.

Usage

# 1. Clone the repository
git clone https://github.com/omarnahdi/data-packaging
cd data-packaging

# 2. Install dependencies
pip install -r requirements.txt

# 3. Download the preprocessed dataset from GitHub Releases
#    Navigate to: https://github.com/omarnahdi/data-packaging/releases
#    Download: preprocessed_dataset.parquet

# 4. Place it in the data/ directory
mkdir -p data && mv preprocessed_dataset.parquet data/

# 5. Run the notebook
jupyter notebook "data packaging.ipynb"

The notebook is designed to be run top-to-bottom. Each cell is a self-contained stage in the pipeline with its own output inspection — you can verify the shape and contents of the dataset after filtering, after tokenization, and after packing before committing to the full run.

What We Tried That Didn't Work

Padding instead of packing. The first version of the pipeline padded each tokenized document to context_length rather than packing multiple documents into a single sequence. The result was a dataset where the majority of tokens in every sequence were pad tokens. The model trained, but it was spending most of its compute budget attending to padding — GPU time that produced no learning signal.

Full in-memory processing. An early iteration loaded the entire dataset into a Python list before tokenization. For the dataset sizes this pipeline targets, that's workable on a machine with enough RAM, but it's an unnecessary constraint. Switching to Hugging Face datasets .map() — which processes batches and caches intermediates to disk — made the pipeline usable on machines where the dataset exceeds available memory.

Committing the Parquet file to the repository. The preprocessed dataset is binary and large. Committing it to the Git history bloats the repo permanently and makes cloning slow for anyone who only wants to study the code. GitHub Releases is the right home for large binary artifacts — it's versioned, downloadable on demand, and keeps the repository itself lightweight.

Lesson: The data pipeline is where the model's learning potential is either unlocked or constrained. Small decisions — whether to pack or pad, which documents to filter, what context length to target — have downstream effects that are hard to diagnose once training has started.

Takeaways

Packing is not optional at any meaningful scale. Padding sequences to a fixed length is simple to implement but expensive in practice. Every pad token in a training batch is memory allocated and compute spent on noise. Packing multiple documents into a single context window eliminates this waste entirely.

Filter aggressively before tokenization, not after. Tokenization is a relatively expensive step. Running quality filters on raw text — which is fast — before handing data to the tokenizer means you tokenize only what you intend to train on. The reverse order is a common mistake that wastes time on documents that will never enter the model anyway.

Parquet is the right serialization format for ML datasets. JSON and CSV are readable by humans and writable by anything, but they're expensive to read at training time — every row is parsed on the fly. Parquet's columnar format, built-in compression, and first-class support in the Hugging Face datasets library make it the obvious choice for anything beyond a toy dataset.

What's Next

The natural extension to this pipeline is support for multi-source mixing — combining documents from different corpora (web text, code, books) at configurable ratios. The proportions of different data sources during pretraining have a measurable effect on downstream capability, and the pipeline currently treats all input text identically.

A second extension worth building is streaming tokenization with progress checkpointing. For datasets measured in tens of gigabytes, a single tokenization pass can take hours. If the process is interrupted, the current pipeline restarts from the beginning. Adding checkpoint files after each batch would make long preprocessing runs resumable.

If you're building a pretraining pipeline and want to compare notes on filtering strategies or packing implementations, the repository is open on GitHub. The notebook is the best place to start — every design decision is annotated inline.

Built with Hugging Face datasets, transformers, and pandas. The preprocessing decisions here are informed by the data pipelines used for models like GPT-2 and the approaches documented in the Hugging Face LLM course.