Carlo Lucibello &mdash; Bocconi University #
Memorization and generalization in generative diffusion under the manifold hypothesis #
We study the memorization and generalization capabilities of a generative diffusion model in the case of structured data defined on a latent manifold. We specifically consider a set of P data points in N dimensions lying on a latent subspace of dimension D, according to the hidden manifold model. Our analysis considers a generative reverse process given by the empirical score function as a proxy of the true one, and then precisely characterizes the process in the high-dimensional limit, by exploiting a connection with the random energy model (REM). We provide evidence for the existence of an onset time, when traps appear in the time-varying potential, although they do not affect typical trajectories. The size of the basins of attraction of such traps is computed at any time. Moreover, we derive the collapse time, at which trajectories fall in the basin of one of the training points, implying memorization. We show that the curse of dimensionality issue is mitigated for highly structured data, i.e. the relevant dimension governing the time at which memorization happens is the intrinsic dimension D instead of the ambient dimension N.
.