casys/engine · research

Scoring Tool Relevance Without an LLM

How a 258K-parameter pipeline outperforms embedding similarity for next-node prediction in agentic systems.

920 leaf nodes indexed
258K parameters
<5ms inference
24 Jupyter notebooks · February 2026
01 The Graph

920 Tools, Hubs, and a Long Tail

When an LLM agent has 920 tools available, how does it choose the right one? The standard approach is embedding similarity: embed the user's intent, embed each tool description, find the nearest vectors.

This works for 10 tools. At 920, the graph structure matters. Tools form hierarchies and sequences that pure embeddings cannot capture. The co-execution graph has:

L0 Tools 920 leaf nodes
L1 Capabilities 245 parent nodes
L2 Meta compositions

875 edges connect them. Average node degree: 4.7 — a sparse graph where hub parent nodes aggregate many leaf nodes, and the long tail of single-use leaf nodes needs a different treatment than popular hubs.

Degree distribution of the tool co-execution graph showing hub nodes and long tail
Degree distribution · 920-node co-execution graph. A few hub parent nodes connect to 10+ leaf nodes; most leaf nodes appear in 1–3 workflows. This skew drives the adaptive residual design in Section 04.
deep dive Nodes All The Way Down NB-11, 13, 17, 21, 22 · 10 plots
02 Raw ≠ Ready

Chaos vs. Structure

Raw tool embeddings cluster by lexical similarity: tools with similar descriptions sit near each other, regardless of how they are actually used together. Two PostgreSQL tools look identical; generate_bom and query_parts_db look distant despite always executing in sequence.

After SHGAT enrichment, the geometry changes. Tools that co-execute move together. Tools that serve different contexts — even with similar names — are pushed apart.

t-SNE projection comparing raw tool embeddings (left, chaotic) with SHGAT-enriched embeddings (right, structured clusters)
t-SNE projection · 920 tool embeddings. Left: raw 1024-dim vectors — tools cluster by lexical similarity. Right: after SHGAT enrichment — tools that co-execute cluster together.
The structure is learned from behavior, not descriptions. Tools that work together move together in embedding space.
t-SNE projection colored by capability cluster, showing semantic grouping of tools by functional domain
Same embeddings, colored by capability. After SHGAT, tools naturally cluster by functional domain — database tools, file operations, deployment tools form distinct neighborhoods.
deep dive Data Quality Odyssey NB-05, 09, 19, 20 · 8 plots
03 Message Passing

Smoothing Kills. Contrastive Discriminates.

Standard message passing averages a node's embedding with its neighbors. On a tool graph, this is lethal: a tool that rarely co-executes with others gets washed into its neighborhood and loses its identity.

The key insight from NB-01: contrastive loss during message passing pushes siblings apart while pulling co-executors together. Smoothing creates uniformity. Contrastive creates discrimination.

Toy problem results comparing smoothing message passing (collapsed) vs contrastive message passing (discriminative clusters)
NB-01 toy problem · message passing variants. Smoothing (left): embeddings collapse toward neighborhood mean, losing distinctiveness. Contrastive (right): siblings are pushed apart, co-executors attracted — correct geometry.
Smoothing message passing kills discrimination. Contrastive message passing pushes siblings apart.

Production SHGAT uses InfoNCE contrastive loss with PER (Prioritized Experience Replay) and curriculum scheduling: easy pairs first, hard negatives after the model has learned the basic structure.

04 Residual

How Much to Keep of the Original Signal?

After message passing, how much of a node's original embedding should survive? Leaf nodes (few connections) should keep most of their identity. Hub parent nodes (many connections) should blend more aggressively with their neighborhood.

The balance is learned, not hand-tuned. Our residual formula:

Enew[c] = ELU(Σα·H') + γ(nc)·E[c] where γ(n) = σ(a·log(n+1) + b) two learnable parameters
Learned gamma values as a function of number of children, showing adaptive residual weighting
Learned γ(n) function. The model discovered an adaptive strategy: leaf nodes with 1–3 children retain ~80% of their original embedding, while hub parent nodes with 10+ children blend down to ~40%.
43.4% without residual
66.1% Hit@1 with residual
+22.7pp
Residual weight parameter sweep showing optimal range
Residual weight sweep across training runs
PCA comparison: no residual vs fixed vs adaptive
PCA: no residual · fixed · adaptive γ

Training: 1,876 capability-tool pairs · InfoNCE contrastive loss · PER + curriculum · 30 epochs · 7.35M params · ~4 min on CPU

deep dive Two Parameters, +22.7pp NB-12, 14, 15, 16 · 10 plots
05 Sequence

Execution Traces as Directed Sequences

With enriched embeddings from SHGAT, we need a model that predicts which node comes next in a sequence. Execution traces — the ordered list of nodes an agent called to fulfill an intent — are the training signal.

The GRU (Gated Recurrent Unit) is deliberately tiny: 258,000 trainable parameters.

Five inputs at each step:

  1. SHGAT-enriched embedding of the current node (1024→64 projection)
  2. Intent embedding (what the user asked for)
  3. Hierarchy level (L0 / L1 / L2)
  4. Positional encoding (sequence position)
  5. Edge features (from the graph)

The hidden state (64-dim GRU) captures execution context. At each step, it predicts over the full vocabulary: 1,165 nodes = 920 leaf nodes + 245 parent nodes.

Why Not an LLM?

Latency
GRU <1ms
LLM 500ms+
Cost
GRU $0 (local)
LLM per-call
Signal
GRU exec traces
LLM descriptions
Inspect
GRU beam search
LLM black box
Sankey diagram showing execution trace flows from intent to node sequences
Execution trace flows · 2,571 traces. Each path from intent to terminal node is a training sequence for the GRU. Width encodes frequency; color encodes hierarchy level.

Component Results

37.2% Tool Hit@1
82.3% Capability Hit@1
89.9% Terminal Hit@1

Training: 2,571 execution traces · frequency capping (30/cap, FPS) · early stopping ep 48 · ~3 min on CPU

deep dive What Didn't Work NB-03, 04, 18, 23, 24 · 5 failed approaches
06 End-to-End Results

The Full Pipeline

The real test is end-to-end: given an intent, predict the full node sequence using beam search (width 5) with length normalization.

Configuration First-N Accuracy Δ
GRU alone (raw embeddings) 64.6% baseline
GRU + SHGAT enrichment 70.8% +6.2pp
70.8% E2E beam accuracy

For 7 out of 10 user intents, the predicted node sequence starts with the correct nodes.

The entire SHGAT + GRU pipeline completes inference in under 5ms. No GPU. No API dependency. 258K parameters running on any device.

End-to-end benchmark results across configurations
E2E benchmark. Beam search (width 5) with length normalization · 2,571 execution traces.
SHGAT-TF Hit@1 66.1% MRR 0.68 · 7.35M params
GRU Global Hit@1 57.6% Cap 82.3% · 258K params
E2E Pipeline SHGAT contrib. +6.2pp 64.6% → 70.8%

What Remains Open

  • Vocabulary growth — adding leaf nodes requires retraining. No online learning yet.
  • Cold start — new leaf nodes with zero traces get no structural benefit.
  • Cap-frequency tradeoff — aggressive capping helps parent node prediction but hurts leaf node prediction.
  • Canonicalization — 28 duplicate toolset groups dilute softmax (+25.7pp when canonicalized, not yet in prod).
Deep Dives

24 Notebooks of Evidence

Each design decision above is backed by ablation studies, visualizations, and failure analysis. Five deep dive tracks cover the full research arc.

Based on 24 Jupyter notebooks of experiments, January–February 2026.