CSV with columns question (required), ground_truth, subject, difficulty, qid (optional). If ground_truth is present we skip generation; otherwise step 2 generates it with a frontier model.
2 — generate ground truth demo mode — cached
·
qid
question
subject
ground truth
3 — evaluate cascade routers on this data
Sweeps thresholds for each selected router. Each cascade decision is scored against your ground truth by an LLM-as-judge (gpt-4o-mini). Output is a cost-vs-accuracy frontier — you pick your operating point.
Auto-mode samples each router's own observed score distribution. A fixed list like 0.2…0.7 misses
where a router's scores actually live and collapses the Pareto chart to a single point.
Row limit
cap how many questions to evaluate (blank = all ).
Partial runs aren't saved to disk.