While developing machine learning (ML) algorithms in production environments, we usually optimize a function or a loss that has nothing to do with our business goals. We generally care about metrics such as click-through-rate or diversity or ad-revenue but our machine learning algorithms often minimize log-loss or root mean squared error (RMSE) of arbitrary quantities for computational ease. It is often seen that these ML metrics don’t correlate with the original targets, which is why running online A/B tests is so vital in production because it lets us verify our ML models against the true objectives of a task.
However, running A/B tests is expensive because it requires some productionized version of an experiment, which needs to run for a significant amount of time to get reliable results during which you risk potentially exposing a poorer system to the users. This is why reliable offline evaluation is absolutely critical – it encourages more experimentations in a sandbox environment as well as help in gaining intuitions on what is worth launching/testing in the wild.