casys/engine · deep dive 5 notebooks

Nodes All The Way Down

Unified vocabulary — leafs AND parents predicted by the same model.

NB-11NB-13NB-17NB-21NB-22

The Vocabulary Problem

A naive GRU predicts from a flat list of 920 leaf nodes. But execution traces contain more than leaf nodes: they contain parent nodes — named sequences of leaf nodes that recur across workflows. Ignoring parent nodes means the model never learns to predict "I need a database operation" before predicting the specific leaf node.

Five notebooks trace the discovery that leaf nodes and parent nodes must share a single vocabulary, and the consequences of that decision.

NB-11

Parent-as-Terminal 918 leaf nodes + 326 parent nodes = 1,244 nodes Parent node Hit@1: 0% (bug in predictNext)

NB-13

Canon + FQDN fix 920 leaf nodes + 245 parent nodes = 1,165 nodes Parent node Hit@1: 41.6% (after bug fix)

NB-21

Frequency capping (MAX=30, FPS) 2,584 → 1,769 training examples Parent node Hit@1: 82.3%

Parent node target distribution: examples per parent node, log-scale with cumulative coverage — **NB-11 · Parent node distribution.** db:postgresQuery dominates with 952 traces. 47% of parent nodes have exactly 1 example — extreme imbalance that makes direct classification hard.

Argmax landing: 40.4% of model predictions land on parent nodes; histogram of correct rank distribution with median=34 — **NB-13 · Argmax landing.** 40.4% of model predictions land on parent nodes. Median correct rank = 34 without unified vocabulary.

The Canonicalization Breakthrough

The softmax had a hidden enemy: 28 duplicate leaf-node groups, each generating a separate parent node entry. A parent node named std:cap_rename and another named code:cap_rename were the same logical operation split across two vocabulary slots.

Canonicalization at training time — without touching the database — collapsed 71 duplicate parent nodes to 240 canonical entries. The softmax signal concentrated. Parent node Hit@1 jumped from 40.6% to 66.3%, a +25.7pp gain from deduplication alone.

Key insight

Do not merge duplicates in the database. Canonicalize at training time. The database stores truth; the model trains on a view.

Embedding space: leaf node vs parent node similarity distributions, per-dim alignment — **NB-11 · Embedding alignment.** Leaf nodes and parent nodes share embedding space with corr=0.98+. They are geometrically compatible.

Parent node ambiguity: parent nodes per leaf-node set and cosine similarity for ambiguous pairs — **NB-11 · Parent node ambiguity.** 36 ambiguous leaf-node sets. 100% of pairs sim > 0.95 — true duplicates that confuse the softmax.

Frequency Capping with FPS

Some parent nodes dominate the training set: std:psql_query appears in thousands of traces. Without capping, the model learns to predict the most common leaf nodes regardless of context.

NB-21 evaluated two capping strategies: random subsampling versus Farthest Point Sampling (FPS). FPS selects the most diverse subset by maximizing distance in intent embedding space, producing a training set with mean pairwise similarity 0.62 vs 0.68 for random — better coverage of the intent distribution with the same budget.

NB-22 revealed the cost: MAX_PER_CAP=30 raises parent node Hit@1 to 82.3% but drops leaf node Hit@1 from 49.3% to 37.2%. The capping was too aggressive. The search for the right cap value continues.

Lorenz curve showing Gini coefficient drop from 0.754 to 0.544 after frequency capping; top-10 parent nodes before and after MAX=30 cap — **NB-21 · Gini coefficient.** Frequency capping reduces imbalance from 0.754 to 0.544. FPS maintains intent coverage.

Horizontal stacked bar showing vocab composition: 920 leaf nodes, 245 parent nodes, 40 canonical duplicates, 24 genuinely missing, 20 test/fake — **NB-22 · Vocab composition.** 1,165 active nodes. 84 excluded — 40 canonical duplicates, 24 genuinely missing, 20 test/fake.

Vocabulary sizes across 5 filter scenarios from 920 leaf-nodes-only to 1257 all-parent-nodes — **NB-17 · Vocab sizes.** Unique+≥2 examples (1,038 nodes) is the recommended configuration — best coverage without noise.

t-SNE projection colored by parent node after SHGAT enrichment — **NB-10 · t-SNE by parent node.** SHGAT-enriched embeddings cluster by parent node. Leaf nodes that co-execute share embedding neighborhoods.