Skillforest Kiosk

Agent Skill Evaluation Arena

Trajectory-first benchmarking • self-contained skill packages • live leaderboard demo
DashboardJudge ScriptCompare
Leaderboard Entries
2
Top Composite
0.988
Latest Run
run-supabase-probe-1771761920
Runs (total)
7
Composite Score Curve
Rank-ordered leaderboard scores
Point 1: 0.988Point 1: 0.988Point 2: 0.988Point 2: 0.988
Composite by rank
Top Entry Score Shape
Outcome / trajectory / efficiency / reliability / safety
OutcomeTrajectoryEfficiencyReliabilitySafetyOutcome: 1.000Trajectory: 0.950Efficiency: 1.000Reliability: 1.000Safety: 1.000
Top entry
Judge Flow
1) Trigger run on dashboard → 2) Compare scores → 3) Replay trajectory → 4) Show sandbox + persistence ops
Run BenchmarkReplay Latest Run