Mixture Density Networks

Mixture density networks (MDN) (Bishop, 1994) are a class of models obtained by combining a conventional neural network with a mixture density model.

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import inferpy as inf
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import tensorflow as tf
import tensorflow_probability as tfp

from scipy import stats
from sklearn.model_selection import train_test_split
def plot_normal_mix(pis, mus, sigmas, ax, label='', comp=True):
    """Plots the mixture of Normal models to axis=ax comp=True plots all
    components of mixture model
    """
    x = np.linspace(-10.5, 10.5, 250)
    final = np.zeros_like(x)
    for i, (weight_mix, mu_mix, sigma_mix) in enumerate(zip(pis, mus, sigmas)):
        temp = stats.norm.pdf(x, mu_mix, sigma_mix) * weight_mix
        final = final + temp
        if comp:
            ax.plot(x, temp, label='Normal ' + str(i))
    ax.plot(x, final, label='Mixture of Normals ' + label)
    ax.legend(fontsize=13)


def sample_from_mixture(x, pred_weights, pred_means, pred_std, amount):
    """Draws samples from mixture model.

    Returns 2 d array with input X and sample from prediction of mixture model.
    """
    samples = np.zeros((amount, 2))
    n_mix = len(pred_weights[0])
    to_choose_from = np.arange(n_mix)
    for j, (weights, means, std_devs) in enumerate(
                    zip(pred_weights, pred_means, pred_std)):
        index = np.random.choice(to_choose_from, p=weights)
        samples[j, 1] = np.random.normal(means[index], std_devs[index], size=1)
        samples[j, 0] = x[j]
        if j == amount - 1:
            break
    return samples

Data

We use the same toy data from David Ha’s blog post, where he explains MDNs. It is an inverse problem where for every input \(x_n\) there are multiple outputs \(y_n\).

def build_toy_dataset(N):
    y_data = np.random.uniform(-10.5, 10.5, N).astype(np.float32)
    r_data = np.random.normal(size=N).astype(np.float32)    # random noise
    x_data = np.sin(0.75 * y_data) * 7.0 + y_data * 0.5 + r_data * 1.0
    x_data = x_data.reshape((N, 1))
    return x_data, y_data

import random

tf.random.set_random_seed(42)
np.random.seed(42)
random.seed(42)

#inf.setseed(42)

N = 5000    # number of data points
D = 1    # number of features
K = 20    # number of mixture components

x_train, y_train = build_toy_dataset(N)


print("Size of features in training data: {}".format(x_train.shape))
print("Size of output in training data: {}".format(y_train.shape))
sns.regplot(x_train, y_train, fit_reg=False)
plt.show()
Size of features in training data: (5000, 1)
Size of output in training data: (5000,)
../_images/output_4_1.png

Fitting a Neural Network

We could try to fit a neural network over this data set. However, for each x value in this dataset there are multiple y values. So, it poses problems on the use of standard neural networks.

Let’s first define the neural network. We use tf.keras.layers to construct neural networks. We specify a three-layer network with 15 hidden units for each hidden layer.

nnetwork = tf.keras.Sequential([
    tf.keras.layers.Dense(15, activation=tf.nn.relu),
    tf.keras.layers.Dense(15, activation=tf.nn.relu),
    tf.keras.layers.Dense(1, activation=None),
])

The following code fits the neural network to the data

lossfunc = lambda y_out, y: tf.nn.l2_loss(y_out-y)
nnetwork.compile(tf.train.AdamOptimizer(0.1), lossfunc)
nnetwork.fit(x=x_train, y=y_train, epochs=3000)
Epoch 1/3000
5000/5000 [==============================] - 0s 45us/sample - loss: 386.4314
Epoch 2/3000
5000/5000 [==============================] - 0s 24us/sample - loss: 360.6320
    [...]
Epoch 2997/3000
5000/5000 [==============================] - 0s 25us/sample - loss: 368.1469
Epoch 2998/3000
5000/5000 [==============================] - 0s 23us/sample - loss: 371.1811
Epoch 2999/3000
5000/5000 [==============================] - 0s 24us/sample - loss: 371.4650
Epoch 3000/3000
5000/5000 [==============================] - 0s 23us/sample - loss: 370.4930
<tensorflow.python.keras.callbacks.History at 0x135680198>
sess = tf.keras.backend.get_session()
x_test, _ = build_toy_dataset(200)
y_test = sess.run(nnetwork(x_test))

plt.figure(figsize=(8, 8))
plt.plot(x_train,y_train,'ro',x_test,y_test,'bo',alpha=0.3)
plt.show()
../_images/output_10_0.png

It can be seen, the neural network is not able to fit this data.

Mixture Density Network (MDN)

We use a MDN with a mixture of 20 normal distributions parameterized by a feedforward network. That is, the membership probabilities and per-component means and standard deviations are given by the output of a feedforward network.

We define our probabilistic model using InferPy constructs. Specifically, we use the MixtureGaussian distribution, where the the parameters of this network are provided by the feedforwrad network.

def neural_network(X):
    """loc, scale, logits = NN(x; theta)"""
    # 2 hidden layers with 15 hidden units
    net = tf.keras.layers.Dense(15, activation=tf.nn.relu)(X)
    net = tf.keras.layers.Dense(15, activation=tf.nn.relu)(net)
    locs = tf.keras.layers.Dense(K, activation=None)(net)
    scales = tf.keras.layers.Dense(K, activation=tf.exp)(net)
    logits = tf.keras.layers.Dense(K, activation=None)(net)
    return locs, scales, logits


@inf.probmodel
def mdn():
    with inf.datamodel():
        x = inf.Normal(loc = tf.ones([D]), scale = 1.0, name="x")
        locs, scales, logits = neural_network(x)
        y = inf.MixtureGaussian(locs, scales, logits=logits, name="y")

m = mdn()

Note that we use the MixtureGaussian random variable. It collapses out the membership assignments for each data point and makes the model differentiable with respect to all its parameters. It takes a list as input—denoting the probability or logits for each cluster assignment—as well as components, which are lists of loc and scale values.

For more background on MDNs, take a look at Christopher Bonnett’s blog post or at Bishop (1994).

Inference

Next we train the MDN model. For details, see the documentation about Inference in Inferpy

@inf.probmodel
def qmodel():
        return;

VI = inf.inference.VI(qmodel(), epochs=4000)
m.fit({"y": y_train, "x":x_train}, VI)
0 epochs    129578.296875....................
200 epochs  113866.8046875....................
400 epochs  110405.765625....................
600 epochs  108311.9296875....................
800 epochs  107741.84375....................
1000 epochs         106996.3359375....................
1200 epochs         106747.328125....................
1400 epochs         106299.640625....................
1600 epochs         106157.328125....................
1800 epochs         106087.8125....................
2000 epochs         106019.1875....................
2200 epochs         105955.0703125....................
2400 epochs         105751.9765625....................
2600 epochs         105717.4609375....................
2800 epochs         105693.375....................
3000 epochs         105676.3984375....................
3200 epochs         105664.40625....................
3400 epochs         105655.578125....................
3600 epochs         105648.265625....................
3800 epochs         105639.09375....................

After training, we can now see how the same network embbeded in a mixture model is able to perfectly capture the training data.

X_test, y_test = build_toy_dataset(N)
y_pred = m.posterior_predictive(["y"], data = {"x": X_test}).sample()

plt.figure(figsize=(8, 8))
sns.regplot(X_test, y_test, fit_reg=False)
sns.regplot(X_test, y_pred, fit_reg=False)
plt.show()
../_images/output_18_0.png

Acknowledgments

This tutorial is inspired by David Ha’s blog post and Edward’s tutorial.