casys/engine · deep dive 4 notebooks

Data Quality Odyssey

The data is the real problem. Better data beats better models.

NB-05NB-09NB-19NB-20

The Data Is the Model

Before SHGAT, before GRU, before any architecture decision — there is the data. Four notebooks document how execution traces go from raw, corrupted, mixed-format records to a clean training set. Each notebook surfaced a different class of problem.

NB-05

Benchmark Analysis

First systematic look at Hit@1 scores. Result: 38.2% — disappointing. The question shifts from "how do we train better" to "what is the training data hiding." Each data fix is measured independently: FQDN unification, n8n filter, canonicalization. Every step compounds.

NB-09

BPE Tokenization Experiments

Attempted byte-pair encoding on leaf node names to handle vocabulary explosion. Result: BPE hurts more than it helps at this scale — semantic signal in full leaf node names outweighs the compression benefit.

NB-19

N8N / PML Signal Separation

N8n workflow data was added to augment the training set. This notebook reveals the problem: n8n data is 60% Smithery noise, and PML-only traces represent just 40% of the pool. Mixing the two drowns the production signal. Adding SKIP_N8N=true recovers 3–4 Hit@1 points.

NB-20

P2 Cleanup Impact Analysis

Systematic audit of the data pipeline. UUID corruption (18.4% of traces), FQDN format inconsistencies (6.1%), mixed short/long leaf node name formats. Fixing the source — reading from task_results instead of executed_path — eliminated the corruption without losing traces. Canonicalization collapses 1,240 phantom duplicates: 2,160 phantom leaf nodes → 920 clean.

Bar chart showing cumulative Hit@1 improvement from 32.1% to 40.8% across 4 data fixes — **NB-05 ·** Each data fix measured in isolation. Raw embed = 32.1%. FQDN fix +3.3 pp, n8n filter +2.8 pp, canonicalization +2.6 pp. The baseline is not the model — it is the data.

BPE tokenization impact on parent node prediction accuracy — 3 panels — **NB-09 · BPE experiment.** Full leaf node names carry more semantic signal than tokenized fragments at this vocabulary scale. BPE hurts.

N8n augmentation impact on GRU training — Hit@1 with vs without n8n data — **NB-19 ·** N8n augmentation evaluated end-to-end. PML-only traces score 38.2% Hit@1. Adding n8n drops it: 12K n8n / 4.6K production drowns the signal even with weight=0.3 and 3x oversampling. `SKIP_N8N=true` is now mandatory.

Smithery vs PML leaf node overlap — 60% of n8n data is Smithery noise — **NB-19 · Smithery overlap.** Only 2% near-duplicates with PML leaf nodes. N8n vocabulary is alien to the production signal.

Transition pattern overlap: only 2.9% shared between n8n and PML — **NB-19 · Transition overlap.** 2.9% shared transitions. N8n and PML execute in different universes.

The Key Insight

The same leaf node can appear as std:psql_query, as pml.mcp.std.psql_query.db48, or as pml.mcp.std.psql_query.3cd9 — three different strings for the same operation, varying only by instance hash. Without normalization, these count as three distinct vocabulary entries. The softmax over 2,160 "leaf nodes" includes 1,240 phantom duplicates.

normalizeToolId() collapses all three to std:psql_query. Vocabulary drops from 2,160 to 920. Softmax concentrates. Hit@1 goes up.

Two panels: vocabulary reduction from 2160 to 920 leaf nodes, and Hit@1 progression from 32% to 63.4% — **NB-20 ·** Left: canonicalization removes 1,240 phantom leaf node duplicates (+25.7 pp softmax concentration). Right: cumulative Hit@1 progression through all data quality fixes — from 32% raw to 63.4% production baseline.