Stanford CS236: Deep Generative Models I 2023 I Lecture 12 - Energy Based Models

1,668

14 0

Published 2024-05-06

For more information about Stanford's Artificial Intelligence programs visit: stanford.io/ai

To follow along with the course, visit the course website:
deepgenerativemodels.github.io/

Stefano Ermon
Associate Professor of Computer Science, Stanford University
cs.stanford.edu/~ermon/

Learn more about the online course and how to enroll: online.stanford.edu/courses/cs236-deep-generative-…

To view all online courses and programs offered by Stanford, visit: online.stanford.edu/

All Comments (2)

@CPTSMONSTER 1 month ago

5:40 Contrastive divergence, changes in gradient of log partition function wrt theta easy to evaluate if samples from the model can be accessed 8:25 Training energy based models by maximum likelihood is feasible to the extent that samples can be generated, MCMC 14:00? MCMC methods, detailed balance condition 22:00? log x=x' term 23:25? Computing log-likelihood is easy for EBMs 24:15 Very expensive to train EBMs, every training data point requires a sample to be generated from the model, generating sample involves Langevin MCMC with 1000 steps 37:30 Derivative of KL divergence is Fisher divergence, two densities convolved with Gaussian noise, derivative wrt size of noise is Fisher divergence 38:40 Score matching, theta is continuous 47:10 Score matching derivation, independent of p_data 51:15? Equivalent to Fisher divergence 52:35 Interpretation of loss function, first term makes data points stationary (local minima or maxima) to minimize log-likelihood, small perturbations in the data points should not increase the log-likelihood by a lot, second term makes data points local maxima not minima 55:30? Backprop n times to calculate Hessian 56:20 Proved equivalence to Fisher divergence, infinite data would yield the exact data distribution 57:45 Fitting EBM, similar flavor to GANs. Instead of contrasting data to samples from the model, contrast to noise 1:00:10 Instead of setting the discriminator to some neural network, define it with the same form as the optimal discriminator. Not feeding x arbitrarily into neural network, evaluate the likelihoods under the model p_theta and noise distributions. The optimal p_theta must match p_data, due to the pre-defined form of the discriminator. Parameterize p_theta with EBM. (In a GAN setting, the discriminator itself would be parameterized by a neural network.) 1:03:00? Classifiers in noise correction 1:11:30 Loss function is independent of sampling, getting EBM and sampling still requires MCMC Langevin steps 1:19:00 GAN vs NCE, generator trained in GAN, noise distribution fixed in NCE but need to evaluate likelihood of noise 1:22:20 Noise contrastive estimation, where the noise distribution a flow that is learned adversarially