Author(s)

Bart Van Parys

We study the problem of designing optimal learning and decision-making formulations when only historical data is available. Prior work typically commits to a particular class of data-driven formulation and subsequently tries to establish out-of-sample performance guarantees. Following (Van Parys et al. From data to decisions: Distributionally robust optimization is optimal. Management Science 2020) we take here the opposite approach. We define first a sensible yardstick with which to measure the quality of any data-driven formulation and subsequently seek to find an “optimal” such formulation. Informally, any data-driven formulation can be seen to balance a measure of proximity of the estimated cost to the actual cost while guaranteeing a level of out-of-sample performance. Given an acceptable level of out-of-sample performance, we construct explicitly a data-driven formulation that is uniformly closer to the true cost than any other formulation enjoying the same out-of-sample performance. We show the existence of three distinct out-of-sample performance regimes; a superexponential regime, an exponential regime, and a subexponential regime. The optimal data-driven formulations can be interpreted as a classically robust formulation in the superexponential regime, an entropic distributionally robust formulation in the exponential regime, and finally a variance penalized formulation in the subexponential regime. This final observation unveils a surprising connection between these three, at first glance seemingly unrelated, data-driven formulations which until now remained hidden.

Date Published: 2025

Citations: Bennouna, Amine, Bart Van Parys. 2025. Learning and decision-making with data : optimal formulations and phase transitions. Mathematical Programming.