Rayleigh Weight Initializations
Initializing the weights of complexvalued neural networks with Rayleigh distributions.
 Overview
 Background on initializations
 Complex networks
 Rayleigh Distribution
 Generating Rayleigh samples
 Adding phase information
 Matching He and Glorot variance criteria
 Putting it all together: complex pytorch initializer
 Conclusion
Overview
This is the first in a series of posts that will build complexvalued neural networks in pytorch with an eye towards Radio Frequency (RF) applications.
Part one introduces complex weight initializations using two pieces of information:
 Magnitude: the length of a vector
 Phase: where a vector is pointing
The magnitudes will be drawn from a Rayleigh distribution which is a natural representation for RF data.
Next, we make sure the weights are good initial values for neural networks by having their variance meet certain criteria.
We then put everything together by creating complex weights as pytorch tensors.
To start, we give some backgroud on weight initializations and why they are important.
Background on initializations
It is crucial to train neural networks from properly initialized weights. While initializations are now taken for granted, they were one of the key pieces that made it possible to train deep networks in the first place. The main insights about good weight initializations came from analyzing the variance of weights. Specifically, in how the weight's variance affected gradients during backpropagation.
It was great work by He and Glorot, Bengio that showed how the variance of the weights should meet certain criteria to keep gradients flowing smoothly. Here, "smoothly" means that during training the gradients neither disappear (go to 0) nor explode (grow without bound).
The best known initializations are now the defaults in popular deep learning libraries such as TensorFlow and pytorch. However, complexvalued deep networks are newer and not as well established.
Complex networks
Complexvalued networks have a long history, see Chapter 3 of this thesis for a great recap. The first modern and complete take on deep complex neural nets is likely Deep Complex Networks by Trabelsi et. al.
This paper explored many fundamental building blocks for deep complex networks. It developed initializations, convolutions, activations, batch normalizations, and put them together into complex Residual Networks (ResNets).
Despite this fantastic work the field stayed quiet at first. But, there has been a recent spike in activity with follow ups in: medical imaging, radio frequency (RF) signal processing, physical optical networks, and even some quantum networks!
Next we take the first steps in making complex weights: creating their magnitudes.
To best describe a Rayleigh distribution, imagine setting up a sensor that measures wind speed out in an open field. If we analyze the wind speed passing through this sensor in two directions, say North and East, then the magnitude of the wind's velocity will be Rayleigh distributed.
A Rayleigh distribution happens in general when two random variables are added. To be Rayleigh distributed, the random variables must be uncorrelated, normally distributed, zero mean, and share the same standard deviation.
A more relevant example for our purposes is tuning in to a radio signal. Imagine we tune a radio to an empty spectrum region where all we hear is noise. If we record the real and imaginary components of this RF stream, then their magnitudes will be Rayleigh distributed. In other words, the magnitude of RF noise follows a Rayleigh distribution.
Let's dive into the details. The equation below is the Probability Density Function (PDF) of a Rayleigh distribution.
$$f(x,\sigma) = \frac{x}{\sigma^2}e^{x^2/(2\sigma^2)}, \ \ x \geq 0$$
If this equation looks intimidating, we can instead code it up as a python function to make it much clearer:
def rayleigh_pdf(x, sigma):
"Finds Rayleigh PDF evaluated at `x`."
p = (x / sigma**2) * np.exp(x**2 / (2*sigma**2))
return p
In the equation and code above sigma ($\sigma$) is the scale parameter. It is common in many probability distributions and usually controls how spread out or narrow a distribution is.
Let's start by setting $\sigma = 1$ to see the "basic" Rayleigh shape. We will then change sigma to see how it affects the distribution.
sigma = 1
# calculate PDF on 100 equally spaced points between 0 and 5
points = np.linspace(0, 5, 100)
ray_pdf = rayleigh_pdf(points, sigma)
# plot the rayleigh pdf
fig,ax = plt.subplots(figsize=(8,7))
ax.plot(ray_pdf)
ax.set_xticklabels([1] + list(range(len(ax.get_xticklabels()))))
ax.set_xlabel('Sample Value', fontsize='xlarge')
ax.set_ylabel('Probability Density', fontsize='xlarge')
ax.set_title('Rayleigh PDF', fontsize='xxlarge');
As we mentioned the scale, $\sigma$, changes the width or narrowness of the distribution. Let's both halve and double sigma to see what happens.
# setup plot
fig,ax = plt.subplots(figsize=(8,7))
ax.set_xlabel('Sample Value', fontsize='xlarge')
ax.set_ylabel('Probability Density', fontsize='xlarge')
ax.set_title('Rayleigh PDFs', fontsize='xxlarge');
# plot sigmas in different colors
sigmas = [0.5, 1, 2]
colors = ['m', 'b', 'r']
for color,sig in zip(colors,sigmas):
rpdf = rayleigh_pdf(points, sig)
ax.plot(points, rpdf, c=color, label=f'σ: {sig}')
ax.set_xticklabels([1] + list(range(len(ax.get_xticklabels()))))
ax.legend();
The blue line in plot above is the same PDF from the first plot where $\sigma = 1$.
We can see how $\sigma = 0.5$ pulls the distribution up and to the left, while $\sigma = 2$ squishes it down and to the right.
In other words, a smaller sigma makes our distribution narrower while a larger sigma makes it wider.
Plotting the theoretical Rayleigh PDF above only shows what the distribution should looks like. Now we need to actually generate the Rayleigh values.
We will use the RandomState class in the numpy library to generate Rayleigh samples. RandomState is a helpful class that can sample from just about every known distribution.
from numpy.random import RandomState
First we create the RandomState class with the chosen seed of $0$.
seed = 0
rand = RandomState(seed)
This class can directly sample from a Rayleigh distribution. We use the sampling function RandomState.rayleigh
which accepts two parameters:

scale
: $\sigma$ with a default value of 1. 
size
: the shape of the output array
Let's start by drawing 1,000 Rayleigh samples with $\sigma = 1$.
sigma = 1
shape = 1000 # one dimensional vector with 1000 samples
ray_vals = rand.rayleigh(scale=sigma, size=shape)
How to check if these samples are actually Rayleigh distributed? We can refer back to our PDF plots at the beginning, which tell us how Rayleigh samples should be "spread out". The easiest way to check if these samples are spread out as expected is with a histogram.
# plot histogram of 1000 drawn samples
plt.figure(figsize=(8,7))
plt.hist(ray_vals)
plt.xlabel('Sample Value', fontsize='xlarge')
plt.ylabel('Counts', fontsize='xlarge')
plt.title(f'Histogram of {shape:,} Rayleigh samples', fontsize='xxlarge');
That looks pretty good! As we generate more samples it should get even closer to the earlier PDF plots. Let's make sure this happens by now drawing 10,000 samples and using more bins.
large_shape = 10000
many_ray_vals = rand.rayleigh(scale=sigma, size=large_shape)
plt.figure(figsize=(8,7))
plt.hist(many_ray_vals, bins=20);
plt.xlabel('Sample Value', fontsize='xlarge')
plt.ylabel('Counts', fontsize='xlarge')
plt.title(f'Histogram of {large_shape:,} Rayleigh samples', fontsize='xxlarge');
Looking better. Let's compare this histogram against the theoretical Rayleigh PDF.
Note that we pass density=True
to the histogram function below to make it an approximate PDF.
# compare theoretical vs. sampled Rayleigh PDFs
plt.figure(figsize=(8,7))
plt.hist(many_ray_vals, density=True, bins=20) # makes the histogram sum to one, to mimic pdf
plt.plot(points, ray_pdf, c='r', label='theoretical rayleigh pdf', linewidth=3)
plt.title('Sampled vs. Theoretical Rayleigh PDFs', fontsize='xlarge')
plt.xlabel('Sample Value', fontsize='xlarge')
plt.ylabel('Probability Density', fontsize='xlarge')
plt.legend();
With the Rayleigh values from above we now have the magnitude, or length, of the complex weights. But that is only one part of a complex number. We are still missing the phase, or angle, information. The phase tells us which direction in the complex plane a vector is pointing in.
For our purposes it is enough to use random angles. Why? Many processes such as speech, images, and RF modulations encode information in phase. We do not want to bias the networks to any of these particular phase setups. By adding random uniform phase, it is like starting with many vectors pointing out in all directions. Then, during training, the network learns how to best orient and scale the weights for its task.
Adding this random uniform phase is straightforward. We pick uniform samples from $\pi$ to $\pi$ radians which maps to a full loop of the unit circle. This can be done with the same RandomState
class from before.
# pick random directions along the unit circle
phase = rand.uniform(low=np.pi, high=np.pi, size=ray_vals.shape)
The phase can split the Rayleigh magnitudes into real and imaginary components. We use the cosine of the phase for the real part, and the sine of the phase for the imaginary part.
real = ray_vals * np.cos(phase)
imag = ray_vals * np.sin(phase)
Let's plot some of these values in the complex plane to see if we truly have randomly oriented vectors. We pick 100 random weights and expect to see vectors with different magnitudes spread out in all directions.
# plot a few complex weights
chosen_samples = range(100) # plot 100 random samples
plt.figure(figsize=(8,7))
for idx in chosen_samples:
# index into phase and magnitude variables
angle,mag = phase[idx],ray_vals[idx]
plt.polar([0,angle],[0,mag], marker='o') # plot them starting from the origin
Even though we have real and imaginary components, they are not good initializations yet. The polar plot above gives some clues as to why (hint: look at the range of magnitudes in the vectors). Recall from the background on initializations: the key insight was that the variance of the weights should meet certain criteria. Matching this criteria helps the gradients flow well during backpropagation.
To get more specific, both the He and Glorot criteria are based on the incoming and outgoing connections in a network layer. The number of connections are typically called fan_in
and fan_out
, respectively.
The He criteria says that the variance of weights $W$ should be: $$\text{Var}(W) = \frac{2}{\text{fan_in}}$$
The Glorot criteria says that the variance should be: $$\text{Var}(W) = \frac{2}{\text{fan_in + fan_out}}$$
Deep networks typically have hundreds or thousands of connections. In practice this means that the variance of the weights has to be very small. Now we can see why the values in the earlier polar plot are not good: their variance is clearly too large.
How can we generate Rayleigh samples that meet the He or Glorot variance criteria?
The complex nets paper from earlier includes a nice derivation for the variance of a complex Rayleigh distribution: $$\text{Var}(W) = 2\sigma^{2}$$
We can set the Rayleigh variance equal to the He and Glorot criteria and solve for sigma.
To meet the He criteria, sigma should be: $$\sigma_{\text{He}} = \frac{1}{\sqrt{\text{fan_in}}}$$
To meet the Glorot criteria, sigma should be: $$\sigma_{\text{Glorot}} = \frac{1}{\sqrt{\text{fan_in + fan_out}}}$$
Let's take a step back. In the earlier sections we used a flat vector of complex weights as an example. It's as if we took a single series of wind velocity or RF sample measurements.
Since the He and Glorot criteria are defined specifically for network layers, we now switch to a simple onelayer network as an example. We aribtrarily choose a layer with 100 inputs and 50 outputs (fan_in
= 100, fan_out
= 50).
Plugging in these fan_in
and fan_out
numbers into the sigma equations gives:
$$\sigma_{\text{He}} = \frac{1}{10}$$
$$\sigma_{\text{Glorot}} = \frac{1}{5\sqrt{6}}$$
By plugging in these sigmas into the random sampler, it will draw Rayleigh values that match the chosen variance criteria. We can then add phase information in the same way as before.
Here is a short recap of the previous sections:
 Drew a flat series of Rayleigh magnitudes.
 Picked a random phase component, then split the magnitudes into real and imaginary parts.
 Saw how to match the He and Glorot criteria with Rayleigh samples for a single network layer.
To make the above practical and usable, we need to automatically generate complex weigths that:
 Match the He/Glorot variance criteria
 Are pytorch tensors
 Have the correct shape for the given network layer
We can do this in a python function.
One quick word about calculating fan_in
and fan_out
values. We saw the feedforward case with our single network layer. There the number of incoming connections was simply fan_in
and the outgoing connections were fan_out
.
However, the convolutional case is slightly more complicated. A convolutional layer has input and output feature maps which are roughly analogous to input and output units in feedforward layers. But we also have to factor in the kernel size. Pytorch has a nice convenience function to help.
We can now package all of the code above, with some refactoring, into a function that automatically generates complex Rayleigh weights given an input pytorch module.
import torch
def get_complex_inits(module, seed=None, criterion='he'):
"Creates real and imaginary Rayleigh weights for initialization."
# create random number generator
rand = RandomState(seed if seed is None else torch.initial_seed())
# get shape of the weights
weight_size = module.weight.size()
# find number of input and output connection
fan_in,fan_out = torch.nn.init._calculate_fan_in_and_fan_out(module.weight)
# find sigma to meet chosen variance criteria
assert criterion in ('he','glorot')
factor = fan_in if criterion == 'he' else fan_in + fan_out
sigma = 1. / np.sqrt(factor)
# draw rayleigh magnitude samples
magnitude = rand.rayleigh(scale=sigma, size=weight_size)
# draw uniform angle samples
phase = rand.uniform(low=np.pi, high=np.pi, size=magnitude.shape)
# split magnitudes into real and imaginary components
real = magnitude * np.cos(phase)
imag = magnitude * np.sin(phase)
# turn into float tensors and return
real,imag = map(torch.from_numpy, [real,imag])
return real,imag
We can test this on a single layer using pytorch's nn.Linear
module.
m = torch.nn.Linear(100, 50)
real,imag = get_complex_inits(m)
Let's check if the weights are correctly distributed. Going back to our Rayleigh introduction, it is the magnitude that should be Rayleigh distributed.
# grab magnitude as flat vector of numpy samples
magnitude = torch.sqrt(real**2 + imag**2).numpy().reshape(1)
# pick points that cover sample range to compare with theoretical rayleigh pdf
points = np.linspace(0, magnitude.max(), 1000)
ray_pdf = rayleigh_pdf(points, sigma=1./np.sqrt(100))
# plot histogram of magnitudes vs. theoretical pdf
plt.figure(figsize=(8,7))
plt.title('nn.Linear vs. Theoretical Rayleigh PDFs', fontsize='xlarge')
plt.xlabel('Sample Value', fontsize='xlarge')
plt.ylabel('Probability Density', fontsize='xlarge')
plt.hist(magnitude, density=True)
plt.plot(points, ray_pdf, c='r', linewidth=3);
Success!
What about a convolutional layer? Our main concern here is that both the tensor shape and fan_in
/fan_out
are handled correctly.
# make conv layer with 100 input features, 50 output features, and (3x3) kernel
m = torch.nn.Conv2d(100, 50, 3)
real,imag = get_complex_inits(m) # get the initial complex weights
# make sure the shape of weights is ok
print(f'Shapes of real and imaginary convolution tensors: {real.shape}, {imag.shape}')
Let's check if they are still Rayleigh distributed in the same way as before.
# grab magnitude as flat vector of numpy samples
magnitude = torch.sqrt(real**2 + imag**2).numpy().reshape(1)
# pick points that cover sample range to compare with theoretical rayleigh pdf
points = np.linspace(0, magnitude.max(), 1000)
# we need fan_in for the more complicated conv case
fan_in,fan_out = torch.nn.init._calculate_fan_in_and_fan_out(m.weight)
ray_pdf = rayleigh_pdf(points, sigma=1./np.sqrt(fan_in))
# plot histogram of magnitudes vs. theoretical pdf
plt.figure(figsize=(8,7))
plt.title('nn.Conv2d vs. Theoretical Rayleigh PDFs', fontsize='xlarge')
plt.xlabel('Sample Value', fontsize='xlarge')
plt.ylabel('Probability Density', fontsize='xlarge')
plt.hist(magnitude, density=True)
plt.plot(points, ray_pdf, c='r', linewidth=3);
In this post we created Rayleigh initializations for complexvalued neural networks. We started with an overview of the Rayleigh distribution. We then used this distribution to generate the magnitudes of complex weights. Next we added some phase information to randomly orient the vectors. After that, we made sure the weights matched a certain variance criteria to be good initializations.
Lastly we put all that work together into a python function that returns pytorch tensors. The function as is not quite ready for use in a fully fledged complexvalued setup, but we will get there.
Part two will look at another type of complex initialization based on (semi) unitary matrixes. After that we will proceed to build complex convolutions and actvations.