High variance gradient estimators make learning difficult in models with discrete latent variables. In general, methods have relied on control variables to lower the REINFORCE estimator's variance. In more recent work, a continuous relaxation of discrete variables has been used to provide low-variance but biased gradient estimates (Jang et al. 2016, Maddison et al. 2016).

Through a novel control variate that generates low-variance, "emph-unbiased" gradient estimates, we combine the two techniques in this work. The tightness of the relaxation can then be modified online, eliminating it as a hyperparameter, after which we make a modification to the continuous relaxation. On a number of benchmark generative modeling problems, we demonstrate state-of-the-art variance reduction, which typically promotes faster convergence to a better final log-likelihood.

In machine learning, discrete latent variable models are ubiquitous.

mixture models, Markov decision processes for reinforcement learning (RL), generative models for structured prediction, and most recently hard attention models (Mnih et al., 2014) and memory networks (Zaremba & Sutskever, 2015).

However, if the individual latent variables cannot be analytically marginalized, maximizing the objective across these models using methods such as REINFORCE (Williams, 1992) can be achieved using the variance gradients obtained from the samples. This is difficult due to the high estimate of . Most approaches to reduce this variance focus on developing clever control variables (Mnih & Gregor, 2014; Titias & Lázaro-Gredilla, 2015; Gu et al., 2015; Mnih & Rezende, 2016 ).

Recently, Jang et al. (2016) and Madison et al. (2016) originally introduced a new distribution that continuously relaxes discrete random variables: the Gumbel softmax or concrete distribution. Replacing each discrete random variable in the model with a concrete random variable leads to a continuous model to which the trick of reparameterization can be applied (Kingma & Welling, 2013; Rezende et al., 2014).

Gradients are biased for discrete models, but can be used effectively to optimize large models. The tightness of relaxation is controlled by the temperature hyperparameter.

The REBAR estimator is due to differences in reparameterization gradients and implicitly implements the recommendations from (Roeder et al., 2017).

When optimizing the relaxation temperature, we need the derivative according to the slope of the parameter. Empirically, temperature varies slowly with parameters, so multiple parameter updates may recover the cost of this operation. Investigation of these ideas is left to future work.

Investigating extensions to multi-sample cases (e.g. VIMCO (Mnih & Rezende, 2016)) and using hierarchical structures in models using the Q function, it is clear to apply this approach to reinforcement learning. .

At the lower temperature bound, the slope estimate is unbiased, but the variance of the slope estimator diverges, so the temperature must be adjusted to compensate for the bias and variance.

## REBAR pytorch implementation in TensorFlow

### Quick Start:

**Requirements**:

- TensorFlow (see tensorflow.org for how to install)
- MNIST dataset
- Omniglot dataset

First download datasets by selecting URLs to download the data from. Then fill in the download_data.py script like so:

```
MNIST_URL = 'http://yann.lecun.com/exdb/mnist'
MNIST_BINARIZED_URL = 'http://www.cs.toronto.edu/~larocheh/public/datasets/binarized_mnist'
OMNIGLOT_URL = 'https://github.com/yburda/iwae/raw/master/datasets/OMNIGLOT/chardata.mat'
```

Then execute the script to download the data:

`python download_data.py`

Then run the model training script:

`python rebar_train.py --hparams="model=SBNDynamicRebar,learning_rate=0.0003,n_layer=2,task=sbn"`

and you should get something like the following:

```
Step 2084: [-231.026474 0.3711713 1. 1.06934261 1.07023323
1.02173257 1.02171052 1. 1. 1. 1. ]
-3.6465678215
Step 4168: [-156.86795044 0.3097114 1. 1.03964758 1.03936625
1.02627242 1.02629256 1. 1. 1. 1. ]
-4.42727231979
Step 6252: [-143.4650116 0.26153237 1. 1.03633797 1.03600132
1.02639604 1.02639794 1. 1. 1. 1. ]
-4.85577583313
Step 8336: [-137.65275574 0.22313026 1. 1.03467286 1.03428006
1.02336085 1.02335203 0.99999988 1. 0.99999988
1. ]
-4.95563364029
```

The first number in the list is the log likelihood lower bound and the number after the list is the log of the variance of the gradient estimator. The rest of the numbers are for debugging.

We can also compare the variance between methods:

```
python rebar_train.py \
--hparams="model=SBNTrackGradVariances,learning_rate=0.0003,n_layer=2,task=omni"
```

and you should see something like:

```
Step 959: [ -2.60478699e+02 3.84281784e-01 6.31126612e-02 3.27319391e-02
6.13379292e-03 1.98278503e-04 1.96425783e-04 8.83973844e-04
8.70995224e-04 -inf]
('DynamicREBAR', -3.725339889526367)
('MuProp', -0.033569782972335815)
('NVIL', 2.7640280723571777)
('REBAR', -3.539274215698242)
('SimpleMuProp', -0.040744658559560776)
Step 1918: [ -2.06948471e+02 3.35904926e-01 5.20901568e-03 7.81541676e-05
2.06885766e-03 1.08521657e-04 1.07351625e-04 2.30646547e-04
2.26554010e-04 -8.22885323e+00]
('DynamicREBAR', -3.864381790161133)
('MuProp', -0.7183765172958374)
('NVIL', 2.266523599624634)
('REBAR', -3.662022113800049)
('SimpleMuProp', -0.7071359157562256)
```

where the tuples show the log of the variance of the gradient estimators.

## REBAR pytorch implementation example

**Source**: https://github.com/pemami4911/REBAR-pytorch

For the Sigmoid Belief Network model and the binarized MNIST benchmark, we attempted to develop REBAR. The performance of my implementation in comparison to the authors' Tensorflow implementation is shown below. Both models employ one nonlinear stochastic layer, a fixed temperature of 0.5, and a fixed eta of 1.0 in this run (the parameter that multiplies the gumbel control variate and is normally optimized with a variance objective).

Here are results for 1 nonlinear stochastic layer on binarized MNIST from the paper: