The demo is the easy 80%. The product is the boring 20% that decides whether anyone keeps using it after the novelty wears off.
Retrieval over fine-tuning
For most features, getting the right context in front of the model beats teaching the model new facts. Retrieval is cheaper to build, easier to update, and far easier to debug when it goes wrong.
Evals before features
Without an eval set you are tuning prompts by vibes. A few dozen graded examples turn "it feels better" into a number you can move, and stop regressions you would otherwise ship blind.
A fast loop
The team that can change a prompt, run its evals, and see the delta in minutes will out-ship the team with the fancier model. Optimise the loop first; everything else compounds on top of it.
