RustSight: Building a Fast, Streaming CSV Analyzer in Rust

Data scientists spend a significant portion of their time on the step that comes before any model training: validating and understanding the data. It's unglamorous, often manual, and in Python, it gets slow fast. The moment your CSV crosses a few hundred megabytes, pandas.read_csv() starts to feel like a liability.

That's the problem RustSight was built to solve. Not a new algorithm or a novel architecture — just the boring, critical work of dataset analysis, done properly. Written in Rust, it brings streaming CSV analysis, statistical summaries, and automated report generation to the command line in a tool that handles files larger than your available RAM without breaking a sweat.

This post covers the technical decisions behind RustSight, why Rust was the right choice for this problem specifically, what the streaming architecture looks like in practice, and what I'd extend next.

Introduction

The standard pre-ML workflow looks something like this: load a CSV into a pandas DataFrame, run .describe(), check .isnull().sum(), inspect column dtypes, then start cleaning. For small datasets this is fine. For anything over a few hundred megabytes it becomes painful — and for files that exceed available RAM, it becomes impossible without chunking, which adds complexity and breaks the exploratory flow.

The deeper issue is that Python's data analysis tools are optimized for interactivity, not throughput. pandas loads the entire dataset into memory as a prerequisite to any analysis. Even lightweight profiling tools like ydata-profiling follow the same model: full load first, analysis second.

There's also the ecosystem fragility problem. A Python-based CLI tool carries an implicit dependency on the right Python version, a virtual environment, and a compatible set of library versions. Distributing it to a teammate or a CI pipeline means managing all of that. A compiled Rust binary has none of those requirements.

Core constraint: The tool needs to handle CSV files of arbitrary size — including files larger than available RAM — without requiring the user to know or care about chunking, batching, or memory management.

The combination of these constraints pointed clearly toward a compiled, streaming-first approach. Rust was the natural fit.

Architecture

The central architectural decision in RustSight is that the full dataset is never held in memory at once. Analysis is done in a single streaming pass over the file.

Input CSV File (any size)
        │
        ▼
  ┌─────────────────────┐
  │   csv crate reader  │  ← Row-by-row streaming, no full load
  └──────────┬──────────┘
             │  one row at a time
             ▼
  ┌─────────────────────────────────────────────┐
  │            Column State Accumulators        │
  │                                             │
  │  Numeric columns  →  min, max, sum, count   │
  │  Categorical cols →  value frequency map    │
  │  All columns      →  missing value counter  │
  └──────────────────────┬──────────────────────┘
                         │  after final row
                         ▼
                ┌─────────────────┐
                │  Report Writer  │  → filename_report.txt
                └─────────────────┘

Each column maintains its own accumulator struct. Statistics are computed incrementally — mean, min, max, and missing counts update on each row without storing previous rows. The report is written only after the full file has been streamed through.

The implication: analyzing a 10GB CSV file requires the same working memory as analyzing a 10MB one. The only thing that scales with file size is time, not RAM.

Implementation: Column Type Detection

The first non-trivial problem is that CSV files carry no type information. Every value arrives as a string. RustSight needs to determine — column by column — whether a field is numeric or categorical, and it has to make that determination on the fly without a schema.

The approach is a single-pass heuristic. Every column tries to parse each cell as f64. If parsing succeeds, the value is routed into numeric accumulators. If not, it increments a categorical frequency counter. Empty cells are counted separately as missing values.

// Simplified column classification logic
fn classify_and_accumulate(cell: &str, state: &mut ColumnState) {
    if cell.trim().is_empty() {
        state.missing_count += 1;
        return;
    }

    match cell.trim().parse::<f64>() {
        Ok(val) => {
            // Update numeric accumulators in-place
            state.numeric_sum += val;
            state.numeric_count += 1;
            if val < state.numeric_min { state.numeric_min = val; }
            if val > state.numeric_max { state.numeric_max = val; }
        }
        Err(_) => {
            // Treat as categorical — increment frequency counter
            *state.category_counts
                  .entry(cell.trim().to_string())
                  .or_insert(0) += 1;
        }
    }
}

One deliberate choice: the f64 parse attempt happens on every cell, not just the first N rows. Some real-world CSVs have mixed columns — mostly numeric, with occasional string sentinels like "N/A" or "unknown". Sampling only the header rows would misclassify these. The full-pass approach catches them correctly, at the cost of a parse attempt per cell — which is cheap relative to the I/O cost of reading the file.

Implementation: Report Generation

The output of RustSight isn't printed to stdout and forgotten — it writes a persistent filename_report.txt alongside the analyzed file. This matters for documentation and reproducibility: a data scientist running analysis on CVD Dataset.csv gets CVD Dataset_report.txt automatically, with no extra flags required.

// Report writer — plain text, grep-able, version-control friendly
fn write_report(path: &str, columns: &[ColumnState]) -> std::io::Result<()> {
    let report_path = format!("{}_report.txt", path.trim_end_matches(".csv"));
    let mut file = File::create(&report_path)?;

    writeln!(file, "=== RustSight Dataset Analysis ===")?;
    writeln!(file, "Source: {}\n", path)?;

    for col in columns {
        writeln!(file, "Column: {}", col.name)?;
        writeln!(file, "  Type:    {}", col.inferred_type())?;
        writeln!(file, "  Missing: {} ({:.1}%)",
            col.missing_count,
            col.missing_pct()
        )?;

        if col.is_numeric() {
            writeln!(file, "  Min:     {:.4}", col.numeric_min)?;
            writeln!(file, "  Max:     {:.4}", col.numeric_max)?;
            writeln!(file, "  Mean:    {:.4}", col.mean())?;
        } else {
            writeln!(file, "  Unique:  {}", col.category_counts.len())?;
            // Top 5 categories by frequency
            for (val, count) in col.top_categories(5) {
                writeln!(file, "    {:>6}x  {}", count, val)?;
            }
        }
        writeln!(file)?;
    }
    Ok(())
}

The report format is intentionally plain text rather than JSON or HTML. It's grep-able, diffable in version control, and readable by anyone without a tool. For a workflow artifact that gets shared between teammates or checked into a repo alongside a dataset, that's the right call.

Usage

RustSight ships as a single compiled binary. You can install it directly from crates.io:

# Install via cargo
cargo install rustsight

# Analyze a CSV file — produces stockdata_report.txt
rustsight csv stockdata.csv

# Analyze any file (text or binary metadata)
rustsight analyze dataset.csv

# Healthcare example
rustsight csv "CVD Dataset.csv"

The full benchmark results and feature overview are on the RustSight landing page.

No Python environment. No pip install. No version conflicts. Use rustsight --help for a full list of commands and options.

What We Tried That Didn't Work

A visual TUI dashboard. The temptation was to add a terminal UI with charts using something like ratatui. It was cut. A report file that works in every environment — CI, remote SSH, headless servers — is more useful than a visual that requires a compatible terminal. Adding a TUI is easy later; removing a dependency that users have come to rely on is not.

Automatic outlier detection. Several iterations included a statistical outlier flag (values beyond N standard deviations from the mean). It was removed because the threshold is inherently domain-dependent. A value that's an outlier in healthcare data is perfectly normal in financial data. Flagging it automatically creates false confidence. The report gives you the min, max, and mean — the decision about what constitutes an anomaly belongs to the data scientist, not the tool.

JSON output mode. This one is genuinely worth revisiting. Piping report output into a downstream tool (a CI quality gate, a data contract checker) would be cleaner with structured JSON than with parsed text. It's the one deliberate cut that may get reversed.

Lesson: A focused CLI tool that does one thing correctly is more useful than a feature-rich tool that requires configuration to use correctly. Every flag you add is a decision you're pushing onto the user.

Takeaways

Streaming architecture is the correct default for file analysis tools. Loading a file into memory to analyze it is a design smell, not a starting point. If your analysis can be expressed as a single-pass accumulation — and most summary statistics can — you get arbitrary file size support for free.

Rust's type system prevents the entire class of runtime errors that plague data processing pipelines. There are no NullPointerExceptions, no silent integer overflows, no type coercions between numeric types. The compiler catches the categories of bugs that show up at 2am in production Python pipelines.

A report file is a better artifact than stdout. Results that exist only in a terminal session are lost the moment the session closes. Writing a persistent, human-readable file alongside the input makes analysis reproducible and shareable with no extra steps from the user.

What's Next

The most useful extension is parallel column processing. Right now all columns are processed in a single-threaded loop. For wide datasets — thousands of columns — distributing column accumulation across Rayon's thread pool would reduce analysis time proportionally to core count. The streaming row reader would remain single-threaded (disk I/O doesn't parallelize usefully), but the per-cell accumulation is embarrassingly parallel.

The second extension is a --schema flag that accepts a YAML file defining expected column types, allowed value ranges, and required non-null columns. This turns RustSight from an exploratory tool into a data contract enforcer suitable for CI pipelines — a dataset that violates its schema exits with a non-zero code and a diff-friendly report.

If you're doing pre-training data validation at scale or want to replace a slow Python profiling step in your ML pipeline, the project is MIT licensed on GitHub and contributions are welcome.

Built with Rust and the csv crate. Thanks to the Rust community for making systems programming approachable enough that a tool like this takes days, not weeks.

RustSight is benchmarked, documented, and installable at rustsight.omarnahdi.dev — including the full 8.5M row benchmark results, feature list, and install guide.