Melih Gokay Yigit
(gokay.yigit@metu.edu.tr)
(gokay.yigit@metu.edu.tr)
- Why Discrete Latent Variables?
- Stochastic Optimization
- REINFORCE Method
- Variational Learning of Latent Variable Models
- Neural Variational Inference and Learning (NVIL)
- Towards Reparameterized, Continuous Relaxations
- Categorical Distributions and Gumbel-Softmax
- Combinatorial, Discrete Objects
- Summary and Conclusions
- References
Discrete latent variables are hidden variables within models that can take on a finite set of distinct values. These variables play a crucial role in decision-making processes and learning structures because they effectively capture the inherent discreteness present in various types of data and systems. This capability is particularly important in real-world data representations, where discrete attributes are often prevalent.
To illustrate the importance of discrete latent variables, consider several examples:
[Fig3]. A network representation of social relationships among the 34 individuals in the karate club studied by Zachary.
Beyond these examples, there are numerous additional data domains where the data is inherently discrete or exhibits discrete characteristics. These domains include text data, images, speech and audio, molecules, geographical data, market basket items, programming codes, healthcare records, financial transactions, e-commerce clickstream data, etc.
The natural and abundant presence of discrete attributes in these data domains shifts the focus toward using discrete latent variable models. These models allow for better capture of the underlying meanings and representations of the data. Moreover, when working with real-world data, continuous latent variable models often face challenges due to the discontinuity of the data. This often necessitates a shift towards discrete latent variables to more accurately model the data and avoid assumption failures inherent in continuous representations.
In summary, the widespread occurrence of discrete data in various domains highlights the necessity of discrete latent variable models. These models provide a more accurate and meaningful representation of the underlying data structures, making them indispensable for working with real-world data.
The term "stochastic optimization" is used for the process of minimizing or maximizing an objective function when it involves randomness. This non-deterministic optimization process can be useful in many cases, specifically when the data is too large to fit into GPU memory, or when the data is too complex to be processed in a deterministic way. Moreover, SGD (or mini-batch gradient descent) may reduce the chance of getting trapped into a poor-performing shallow local minimum by providing a better exploration of the optimization space with the help of this stochasticity.
In the VAE (Variational Autoencoder) context, this stochasticity is inherent in sampling from the latent space. The latent space is a high-dimensional space where the data is represented in a lower-dimensional form. The VAE model tries to learn the distribution of the data in this latent space, which also can be interpreted as learning the data manifold. Since the latent variables are sampled from a distribution, the optimization process becomes inherently stochastic.
Before diving into the details of the discrete latent variable model optimization methods, here is a brief overview of VAE optimization process:
We model our data as
We can pick a
Roughly, the objective is to maximize the following function (maximizing the ELBO):
Now, we can consider the following objective:
Trying to get the gradients to maximize the ELBO maximization objective below can not be directly achieved with the reparametrization trick. First, let us see if we can obtain the gradient with respect to
We can get the gradient of the expectation with respect to
And we can approximate this expectation by Monte Carlo sampling:
Hence, it is not a problem to optimize the
But, when any of the assumptions above fails (e.g. z is not continuous), directly using the reparametrization trick is not possible.
In this case, one of the things that can be done is utilizing the REINFORCE Method, explained in the next section.REINFORCE is a Monte Carlo variation of a policy gradient method that is used to update a network's weights. It was introduced by Ronald J. Williams in 1992, and the name "REINFORCE" is an acronym for:
REward Increment = Non-negative Factor × Offset Reinforcement × Characteristic Eligibility
Its primary goal is to optimize the expected reward, but it can be used in problems where discrete latent variables or discrete actions exist. We are going to use the REINFORCE method to optimize the following objective function:
Here, our policy is
But first, we need some mathematical tricks in the computation of the gradients. Normally we want to calculate (or estimate):
However, with this form, we cannot calculate the gradient directly (since it is infeasible to calculate the expectation with all possible z values). Furthermore, we cannot estimate the gradient by sampling z values from
Expand the expectation:
Take the derivative inside the sum:
Introduce
Here is the important part. We have the log-derivative rule
Now, we can rewrite the equation as an expectation:
By making these operations, we can now estimate the gradient by sampling z values from
We are able to sample
This form can be interpreted as a gradient update, weighted in a way that maximizes the expected reward.
Variational learning is a powerful technique in probabilistic modeling and Bayesian inference, providing a scalable alternative to traditional methods like Markov Chain Monte Carlo (MCMC). The core idea is to approximate complex, intractable posterior distributions
The variational learning approach typically optimizes the Evidence Lower Bound (ELBO), defined as:
Maximizing the ELBO is equivalent to minimizing the KL divergence, thereby making
While this framework works well for continuous latent variables, handling discrete latent variables introduces significant challenges. Discrete variables do not allow for the straightforward application of gradient-based optimization techniques due to their non-differentiable nature. This makes the direct computation of gradients of the ELBO with respect to the parameters
The ELBO can be expressed as:
or
Here, the
The REINFORCE rule for this objective function is:
This rule is more generic since
The REINFORCE rule is viable for gradient estimation with discrete variables but suffers from high variance in gradient estimates, leading to slow convergence. The variance issue arises due to the stochastic nature of the samples used in Monte Carlo estimates. Each sample
- Inherent stochasticity: Discrete latent variables amplify the variance in gradient estimates.
-
Log-likelihood ratio: The term
$\log p(\mathbf{x}, \mathbf{z}^{(i)}) - \log q(\mathbf{z}^{(i)}; \phi)$ can vary widely, especially when$q(\mathbf{z}; \phi)$ is a poor approximation of$p(\mathbf{z}|\mathbf{x})$ .
For example, in a Gaussian distribution parameterized by
with
Using the reparameterization trick reduces variance, leading to faster convergence:
However, as previously discussed, the reparameterization trick cannot be applied to discrete latent variables. While REINFORCE offers a method for gradient estimation with discrete variables, its high variance presents challenges for efficient optimization. Strategies involving control variates will be discussed in the following section to address this issue.
Neural Variational Inference and Learning (NVIL) represents an advancement in variational inference, particularly tailored for belief networks with discrete latent variables. The high variance in gradient estimates is mitigated in NVIL through the use of control variates.
NVIL extends the traditional variational inference framework by employing neural networks to parameterize the variational distribution
For discrete latent variables, the REINFORCE algorithm is often used to estimate gradients. As noted, this approach suffers from high variance due to the stochastic nature of the samples and the variability in the log-likelihood ratio. NVIL addresses these issues using control variates, significantly reducing variance and enhancing convergence.
Control variates are auxiliary terms that help in reducing the variance of an estimator. In the context of NVIL, these are employed to stabilize the gradient estimates of the ELBO.
-
Baseline Subtraction:
The variance of the gradient estimator can be reduced by subtracting a baseline$b(x)$ from the objective function:
The baseline
-
Learned Baselines:
Instead of a fixed baseline, NVIL often uses a neural network to learn an adaptive baseline$b_\psi(x)$ :
The parameters
-
Control Variate Networks:
NVIL can incorporate control variate networks, which predict components of the objective function that contribute to high variance. The control variate$c(z)$ is introduced to reduce variance:
A well-chosen
To illustrate the variance reduction, let's compare the REINFORCE rule with and without control variates. The REINFORCE gradient estimate is:
In contrast, with a control variate
The variance reduction is mathematically evident if
Control variates help in reducing variance by effectively isolating the noise component of the gradient estimate. By introducing a term
While NVIL provides significant improvements, it also has limitations:
-
Complexity: The introduction of control variate networks and learned baselines increases the computational and implementation complexity. Training additional neural networks alongside the main model requires more resources.
-
Hyperparameter Sensitivity: NVIL involves additional hyperparameters, such as those for the control variate networks, which can complicate the optimization process. Finding the right set of hyperparameters can be challenging and may require extensive tuning.
-
Scalability: For very large and complex models, the effectiveness of control variates can diminish, as the approximation
$c(z)$ might not capture all the variance, especially in high-dimensional spaces. -
Dependence on Quality of Control Variates: The effectiveness of variance reduction heavily relies on the quality of the control variate. If
$c(z)$ poorly approximates the true expectation, the variance reduction may be minimal.
The Neural Variational Inference and Learning (NVIL) method significantly enhances the efficiency of variational learning for belief networks with discrete latent variables. By incorporating control variates, NVIL addresses the high variance in gradient estimates, a critical issue in traditional variational learning methods. This results in more stable and faster convergence, making NVIL a powerful approach for complex probabilistic modeling and Bayesian inference. However, the added complexity, hyperparameter sensitivity, and scalability issues necessitate the exploration of more robust and effective algorithms for learning discrete latent variables. In the following sections, we will investigate continuous relaxations of discrete random variables as a widely used alternative.
The term "continuous relaxation" is a technique to approximate the discrete variables with continuous variables. Since we have discrete latent variables and cannot use the reparametrization technique directly, and we have high variance in the REINFORCE method if we do not meticulously adjust the control variates, it is a plausible choice to go with the continuous relaxation to utilize the reparametrization trick.
To get a continuous relaxation, there is a distribution utilized for this purpose. It is called "Gumbel" distribution, and it is useful in representing extreme events.
The CDF of the Gumbel distribution is:
Again, we want to optimize the following objective function by maximizing it:
If we choose the Gumbel distribution to be standard (
Here are the examples of the Gumbel distribution PDF visualizations for a better understanding:
[Fig4]. Gumbel Distribution Probability Distribution Function Visualization (beta is replaced with sigma here)
Additional knowledge:
Note that the mean is not equal to
Categorical Distributions represent the discrete probability distribution of a random variable. It has a specific probability assigned to each distinct category.
We show a categorical distribution
where
In this conditions, ve can respesent
Now, we can apply a calculation known as the Gumbel-Max trick, which enables sampling from categorical variables, introducing randomness with Gumbel random variables:
Here,
Now, we can sample from a categorical distribution using the Gumbel-Max trick, but the sampling is not differentiable (since it uses the argmax function). To make it differentiable, we can use the Gumbel-Softmax trick, which replaces the argmax function with a softmax function, adding a temperature parameter
[Fig5]. Gumbel-Softmax formula
In the formula above,
Using the Gumbel-Softmax sampler (which is differentiable with respect to
In the original Gumbel-Softmax paper, it is observed that the temperature parameter
The effect of the temperature parameter on the Gumbel-Softmax output is shown below:
[Fig6]. Effect of the temperature parameter in the Gumbe-Softmax output
When the temperature parameter is high, the distribution is more uniform, and when it is low, the distribution is more peaked. More specifically, when
To summarize what we have achieved by applying the Gumbel-Softmax trick:
We normally have the following objective:
And now, we change
Using the Gumbel-Softmax trick (
Note that the
In the realm of unsupervised learning, discovering rankings and matchings often necessitates the representation of data as permutations. A
where
The Plackett-Luce (PL) distribution offers a practical solution for modeling rankings in various fields, such as information retrieval and social choice theory. The
This process continues sequentially, sampling
where
To overcome the non-differentiability of the PL distribution, we can employ a reparameterization technique using Gumbel noise. The Gumbel-PL reparameterized sampler involves adding i.i.d. standard Gumbel noise
The permutation
In summary, by reparameterizing the PL distribution with Gumbel noise and utilizing differentiable relaxations for sorting, we can effectively address the combinatorial challenges and leverage gradient-based methods for optimization in permutation-based models.
[Fig7]. Comparison of the Score function estimator (REINFORCE), Reparametrization trick, and other methods
In this topic summary, we examined various methods and challenges associated with discrete latent variables, stochastic optimization, and variational learning.
Discrete latent variables are crucial for accurately modeling real-world data structures that exhibit inherent discreteness. This discrete nature poses challenges for gradient-based optimization methods, necessitating specialized techniques for efficient learning since the direct application of the reparameterization trick is not feasible.
Stochastic optimization, exemplified by variational autoencoders (VAEs), provides scalable solutions for complex probabilistic models by optimizing the Evidence Lower Bound (ELBO).
The REINFORCE method, despite being effective for discrete latent variables and non-differentiable objectives, is limited by high variance in gradient estimates. To mitigate this issue, we explored Neural Variational Inference and Learning (NVIL), which uses control variates and learned baselines for stabilization.
We also discussed continuous relaxations using the Gumbel distribution, enabling gradient-based optimization techniques. The Gumbel-Softmax trick and the reparameterization of the Plackett-Luce (PL) distribution were presented as practical approaches for managing categorical and permutation-based models.
Our exploration highlights the necessity of developing efficient and scalable methods for discrete latent variable models to enhance their applicability to increasingly complex and high-dimensional data structures in future research.
[1] Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3–4), 229–256. https://doi.org/10.1007/bf00992696
[2] Reinforcement learning: An introduction. Sutton & Barto Book: Reinforcement Learning: An Introduction. (n.d.). http://incompleteideas.net/book/the-book-2nd.html
[3] Maddison, C. J., Tarlow, D., & Minka, T. (2014, October 31). A* sampling. arXiv.org. https://arxiv.org/abs/1411.0030
[4] Maddison, C. J., Tarlow, D., & Minka, T. (2014c, October 31). A* sampling. arXiv.org. https://arxiv.org/abs/1411.0030
[5] Gumbel, E. J. (1954). Statistical theory of extreme valuse and some practical applications. Nat. Bur. Standards Appl. Math. Ser. 33.
[6] Jang, E., Gu, S., & Poole, B. (2022, July 21). Categorical Reparameterization with Gumbel-Softmax. OpenReview. https://openreview.net/forum?id=rkE3y85ee¬eId=S1LB3MLul
[7] Gadetsky, A., Struminsky, K., Robinson, C., Quadrianto, N., & Vetrov, D. (2020). Low-Variance Black-Box gradient estimates for the Plackett-Luce distribution. Proceedings of the . . . AAAI Conference on Artificial Intelligence, 34(06), 10126–10135. https://doi.org/10.1609/aaai.v34i06.6572
[8] Oosterhuis, H. (2022). Computationally Efficient Optimization of Plackett-Luce Ranking Models for Relevance and Fairness (Extended Abstract). Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence. https://doi.org/10.24963/ijcai.2022/743
[9] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction; 2nd Edition. 2017.
[10] Mnih, A. & Gregor, K.. (2014). Neural Variational Inference and Learning in Belief Networks. Proceedings of the 31st International Conference on Machine Learning, in Proceedings of Machine Learning Research 32(2):1791-1799 Available from https://proceedings.mlr.press/v32/mnih14.html.
[11] Bishop, C. M. (2006). Pattern Recognition and Machine Learning (Information Science and Statistics). https://dl.acm.org/citation.cfm?id=1162264
[Fig 1]
DNA sequence data representation:
https://www.researchgate.net/figure/A-human-DNA-and-Part-of-DNA-sequence-28-29_fig1_341901570
[Fig 2]
Game state data representation:
https://medium.com/deepgamingai/game-level-design-with-reinforcement-learning-52b02bb94954
[Fig 3]
A network representation of social relationships among the 34 individuals in the karate club studied by Zachary:
Zachary’s Karate Club graph: Zachary W. (1977). An information flow model for conflict and fission in small groups. Journal of Anthropological Research, 33, 452-473.
[Fig 4]
Gumbel Distribution Probability Distribution Function Visualization:
https://www.researchgate.net/figure/Plot-of-the-Gumbel-distribution-for-various-m-and-s-values_fig1_318519731
[Fig 5,6]
Gumbel-Softmax formula:
Jang, E., Gu, S., & Poole, B. (2022, July 21). Categorical Reparameterization with Gumbel-Softmax. OpenReview. https://openreview.net/forum?id=rkE3y85ee¬eId=S1LB3MLul
[Fig 7]
Comparison of the Score function estimator (REINFORCE), Reparametrization trick and other methods:
https://gabrielhuang.gitbooks.io/machine-learning/content/reparametrization-trick.html
-
For the content, slides from cs236 lecture in "Stanford University, prepared by "Stefano Ermon" and "Aditya Grover" have been utilized.
-
Gpt4o is used to strengthen the text in some places, and to obtain equations (from images).

