Let's get that out of the way immediately. If you're a principal engineer, your job isn't to fiddle with hyperparameters at 2am. Your job is to make sure the entire ML system — from data ingestion to model serving to monitoring — works reliably at scale, and that the team building it doesn't burn out or burn money.
The gap between "we trained a model in a notebook" and "this model serves 10M requests/day with p99 < 50ms" is where you live.