casys/engine · deep dive 5 notebooks

What Didn't Work

The roads not taken. Five ideas that failed.

NB-03NB-04NB-18NB-23NB-24

Negative Results Are Results

Five notebooks document failed approaches. Each one taught something useful — either about the data, about the model, or about what assumptions were wrong. They are not regrets. They are the map of the space.

NB-03 Graph Toy Problem (Manual MP)

Message passing implemented manually on a toy 5-node graph to validate the gradient computation before scaling. The toy problem worked — but the lesson was that 282 training examples are not enough for message passing to generalize. W_up and W_down matrices require 1,000+ examples to learn meaningful projections. The manual implementation was abandoned for the attention-based approach.

Lesson: 282 examples is not enough for MP weight matrices.

GRU workflow sequencer on toy graph — manual message passing validation — **NB-03 · Toy graph.** Manual gradient check on a 5-node hypergraph. The math worked — but 282 traces weren't enough to train the weight matrices.

NB-04 Sequential Graph Scaling

Attempted to scale the sequential graph GRU to larger datasets by increasing hidden dimension and adding layers. Training time grew quadratically; accuracy gains were marginal. The scaling law was unfavorable at this data size. Smaller and smarter (attention + residual) outperformed larger and naive.

Lesson: scaling parameters without data is just overfitting.

Sequential graph GRU scaling: accuracy vs hidden dimension — marginal gains, quadratic cost — **NB-04 · Scaling law.** Training cost scales quadratically; accuracy gains are marginal. Smaller and smarter beats larger and naive.

NB-18 N8N Data Augmentation

N8n workflow data was added to the training pool to compensate for the small production dataset. 12,000 n8n traces vs 4,600 production traces, with a weight of 0.3 and 3x oversampling for production. Result: Hit@1 dropped from 38.2% to 35.3%, MRR from 0.547 to 0.508, training time 4x longer. 60% of n8n data is Smithery noise; it expanded the vocabulary with 1,240 phantom nodes and diluted the production signal. The augmentation was removed entirely.

Lesson: noisy data at high volume beats clean data at low volume — in the wrong direction.

SHGAT beam rescoring: baseline vs SHGAT-rescored vs oracle — zero gain from rescoring — **NB-18 · Beam rescoring.** SHGAT rescoring adds 0.0pp over baseline (64.6%). Oracle is 81.3% — the gap is there, the rescorer can't access it.

NB-23 Centroid Inference Strategy

Instead of predicting the next node directly, the model would predict a centroid in embedding space, then retrieve the nearest vocabulary entry. This would generalize to unseen nodes without retraining. 104 nodes are multi-cap (belonging to multiple parent nodes), and 324 pairs are SHGAT-similar — the centroid idea had surface appeal.

In practice: centroid inference produced worse Hit@1 than direct prediction on every test configuration. The retrieval step introduced irreducible error. Not recommended.

Lesson: centroid inference loses to direct classification at this vocabulary size.

Centroid vs medoid inference strategies: Hit@1 comparison — **NB-23 · Centroid vs medoid.** Both retrieval strategies underperform direct classification across all configurations.

Strategy comparison: centroid, medoid, direct classification Hit@1 — **NB-23 · Strategy comparison.** Direct classification wins. Retrieval adds irreducible error at every test.

NB-24 Class-Weighted Cross-Entropy Loss

Five variants of class-weighted CE loss were tested: inverse frequency, sqrt-weighted, sqrt + frequency cap, no focal, and source-based weights. All five performed worse than the baseline 38.2%.

The root cause: the existing focal loss (γ=2) already handles class rebalancing adaptively. Adding static class weights created double-penalty for rare classes. Cap Hit@1 collapsed from 40.5% to 10–13% across all variants. Tool Hit@1 improved by 2pp — but the cap regression was catastrophic.

Lesson: focal loss γ=2 = adaptive rebalancing. Do not stack static weights on top.

Class weight distributions across 5 variants: inverse, sqrt, sqrt+cap, no-focal, source — **NB-24 · Weight distributions.** All 5 class-weight variants produced worse results than baseline. Static weights fight focal loss.

Frequency vs assigned weight scatter — class weights invert the focal loss signal — **NB-24 · Freq vs weight.** Static weights create double-penalty for rare classes. Cap Hit@1 collapsed from 40.5% to 10–13%.