Building an LLM Pretraining Data Pipeline
A practical guide to building an LLM pretraining data pipeline — cleaning raw text, tokenizing, packing sequences, and exporting training-ready datasets with Hugging Face and Python.
A running log of what I’m building and learning while developing AI agents, machine learning systems, and modern software. Expect experiments, technical deep dives, architecture decisions, and insights from real projects.
A practical guide to building an LLM pretraining data pipeline — cleaning raw text, tokenizing, packing sequences, and exporting training-ready datasets with Hugging Face and Python.
How I built RustSight, a Rust CLI tool that analyzes CSV datasets of any size in a single streaming pass — no Python, no memory limits, no runtime errors. A technical deep dive into the streaming architecture, column type detection, and report generation that makes pre-ML data validation instant.
A deep dive into building Narrative AI — an AI agent system that generates authentic LinkedIn posts using research, web search, and structured prompts. Learn the architecture, agent design patterns, and lessons from building a real-world AI content system.