Traditional software has tests. AI systems need evals.
The difference: tests check exact behavior ("2 + 2 must equal 4"). Evals check quality on a spectrum ("the answer should be helpful, accurate, and concise — score it 0 to 1").
Tests: assertEqual(add(2, 2), 4) → PASS or FAIL
Evals: scoreQuality(llm_response) → 0.0 to 1.0