Spectrograms are often used as images to train deep neural networks for audio tasks. By treating spectrograms as images, we can borrow from the many powerful ideas in image recognition with deep learning. A spectrogram, however, is fundamentally different than natural images as we will see below. That brings up the central question of this post: how should spectrograms be normalized during training?

This post assumes some familiarity with deep learning and signal processing concepts like the FFT. It is also a light introduction to the fastaudio library.

Transforming audio into two dimensions

Image classification is a challenging task that was previously done with expert, handcrafted features. Now, features are automatically learned from labeled data instead. The success of these learned features has completely shifted the paradigm of Computer Vision. We would ideally like to apply these same, proven techniques on audio tasks.

However, audio is treated like a one dimensional signal in most Machine Learning applications. That means raw audio is unusable with 2-D Convolutional Neural Networks (CNNs), which are the workhorses of modern image recognition. If we could somehow represent audio in two dimensions, like an image, then we could leverage the successful approaches in image classification.

Thankfully there are many ways of transforming audio into two dimensions. The most popular one is turning audio into a spectrogram. As an example, the image below shows the spectrogram of this violin recording taken from Wikipedia. The spectrogram of a violin recording

The spectrogram is a 2-D signal representation in time and frequency, so we can use it with 2-D CNNs! But first it is crucial to preprocess and normalize the spectrograms. Neural networks have a much easier time learning when their inputs are normalized.

For natural images, normalization uses an estimated mean ($\mu$) and standard deviation ($\sigma$) as follows:

  • Subtract $\mu$ from the image values to give them a mean of $0$.
  • Divide the image values by $\sigma$ to give them a variance of $1$.

In math terms, if $x$ is our image then $x_{\text{norm}}$ is: $$x_{\text{norm}} = \frac{(x - \mu)}{\sigma}$$

Since spectrograms are fundamentally different than natural images, we should reevaluate if this same normalization makes sense.

Why spectrograms are not images and how to normalize them

Now we can describe what makes spectrograms different from natural images. We start with a high-level overview of images and their normalization, then do the same for spectrograms. A quick recap of how spectrograms are computed will further show how different they are from images. This recap naturally leads to a specific normalization for spectrogram features. Finally, we talk about Transfer Learning and why we avoid it in this post.

In an image, both axes (height and width) are in the spatial domain and at the same scale. Images are stored as integers in the range of [0, 255]. To normalize them we first divide all pixels by 255, the max possible value, to map them into the range [0, 1]. Then, we find the statistics that approximately center the data with a mean of $0$ and a variance of $1$. The three RGB channels in a color image are normalized separately. If an image is greyscale then we normalize its single channel instead.

The axes in a spectrogram are from different domains than the axes in an image. In a spectrogram, the horizontal axis represents time and the vertical axis represents frequency. Each of these quantities has its own scale. The frequency dimension is determined by the size of the FFT window. The time dimension is set by the total length of the signal, the size of the FFT window, and the hop size of the window. You can check the documentation of the torch.stft function for a breakdown of how each axis is determined.

To be more specific, a spectrogram is actually the log of the power spectrum. Below we give a quick recap of how the spectrogram is computed to show how much it differs from images.

If $\text{x}$ is our input audio then the STFT returns the spectrum: $$\text{spectrum} = \text{STFT(x)}$$ We are more interested in the energy or power of the signal, so we take the absolute value of the STFT and square it:
$$\text{power_spectrum} = |\text{STFT(x)}|^2$$ We cannot use the power spectrum as a feature because it has a few strong peaks and many small values. You can check this other fantastic post on spectrogram normalization to learn why this is a problem. Taking the log of the power spectrum spreads out the values and makes them better features. This becomes the spectrogram: $$\text{spectrogram} = log(|\text{STFT(x)}|^2)$$ The range of the log function is $-\infty$ to $+\infty$ which is clearly different than the integers from 0 to 255 in an image.

A spectrogram transformation can also be thought of as a very simple "channelizer" in Digital Signal Processing (DSP) terms. That is a fancy way of saying that it splits the continuous frequency spectrum of a signal into discrete bins, or channels. For example, consider taking a spectrogram with 512 bins from a signal sampled at 16 kHz. This spectrogram will have 512 channels where each channel has a "bandwidth" of $$16 \ \text{kHz} \ \ / \ \ 512 \ \text{bins} = 31.25 \ \text{Hz per bin}$$

Spectrogram channels are very different from the image channels we are used to. So it raises the question: should we normalize the entire spectrogram "image" with a single, global value? Or should we normalize each spectrogram channel just like the channels in an image? In the rest of this post, we compare global and channel-based spectrogram normalizations on a real-world dataset to find which is better.

A quick note on Transfer Learning

We also have to talk about Transfer Learning in the context of normalization. In Transfer Learning, it is best-practice to normalize the new dataset with the statistics from the old dataset. This makes sure that the new network inputs are at the same scale as the original inputs. Since most pretrained vision models were trained on ImageNet, we normalize any new inputs with ImageNet statistics.
However, we avoid Transfer Learning in this post and instead train an 18-layer xResNet from scratch. The reason is that pretrained image models operate at a completely different scale than spectrograms. And the main goal here is to learn our own scalings instead!

Downloading a sample dataset

To keep things practical, we will apply these spectrogram normalization techniques to a sound classification challenge hosted by fastaudio. fastaudio is a community extension of the fastai library to make audio tasks with neural networks more accessible.
The challenge here is to classify sounds in the ESC-50 dataset, where ESC-50 stands for "Environment Sound Classification with 50 classes". This dataset has many different types of sounds which show how varied audio spectrograms can be.

Many of the lines below are based on the fastaudio baseline results notebook.

The ESC-50 dataset

The first step is to download the data. ESC-50 is already included in fastaudio so we can grab it with untar_data.

from import *
from fastaudio.core.all import *
from fastaudio.augment.all import *

# already in fastaudio, can download with fastai's `untar_data`
path = untar_data(URLs.ESC50)

The downloaded audio files are inside the aptly named audio folder. Below we use the ls method, a fastai addition to python's pathlib.Path, to check the contents of this folder.

wavs = (path/"audio").ls()
(#2000) [Path('/home/titan2/2A/cck/fastai/data/master/audio/3-68630-A-40.wav'),Path('/home/titan2/2A/cck/fastai/data/master/audio/5-260433-A-39.wav'),Path('/home/titan2/2A/cck/fastai/data/master/audio/5-188796-A-45.wav'),Path('/home/titan2/2A/cck/fastai/data/master/audio/1-57318-A-13.wav'),Path('/home/titan2/2A/cck/fastai/data/master/audio/4-141365-A-18.wav'),Path('/home/titan2/2A/cck/fastai/data/master/audio/1-9886-A-49.wav'),Path('/home/titan2/2A/cck/fastai/data/master/audio/3-71964-C-4.wav'),Path('/home/titan2/2A/cck/fastai/data/master/audio/5-201172-A-46.wav'),Path('/home/titan2/2A/cck/fastai/data/master/audio/3-253084-E-2.wav'),Path('/home/titan2/2A/cck/fastai/data/master/audio/1-91359-A-11.wav')...]

The output of ls shows 2,000 audio files. But the filenames are not very descriptive, so how do we know what is actually in each one?
Thankfully, as with many datasets, the download includes a table with more information about the data (aka metadata).

# read the audio metadata and show the first few rows
df = pd.read_csv(path/"meta"/"esc50.csv")
filename fold target category esc10 src_file take
0 1-100032-A-0.wav 1 0 dog True 100032 A
1 1-100038-A-14.wav 1 14 chirping_birds False 100038 A
2 1-100210-A-36.wav 1 36 vacuum_cleaner False 100210 A
3 1-100210-B-36.wav 1 36 vacuum_cleaner False 100210 B
4 1-101296-A-19.wav 1 19 thunderstorm False 101296 A

The key info from this table are in the filename and category columns.
filename gives the name of a file inside of the audio folder.
category tells us which class a file belongs to.

The last file in the data directory will be our working example for normalization. We can index into the metadata table above using this file's name to learn more about it.

# pick the row where "filename" matches the file's "name".
df.loc[df.filename == wavs[-1].name]
filename fold target category esc10 src_file take
1826 5-216213-A-13.wav 5 13 crickets False 216213 A

This is a recording of crickets!
We can load this file with the AudioTensor class in fastaudio. Its create function reads the audio samples straight into a torch.Tensor.

# create an AudioTensor from a file path
sample = AudioTensor.create(wavs[-1])

An AudioTensor can plot and even play the audio with its show method.

print(f'Audio shape [channels, samples]: {sample.shape}');
Audio shape [channels, samples]: torch.Size([1, 220500])

Each "burst" in the plot above is a cricket chirp. There are three full chirps and the early starts of a fourth chirp.

Normalizing an audio waveform

The first step is normalizing the audio waveform itself. We give it a mean of zero and unit variance in the usual way:

$$\text{norm_audio} = \frac{\text{audio} - mean(\text{audio})}{std(\text{audio})} $$

# normalize the waveform
norm_sample = (sample - sample.mean()) / sample.std()

Let's check if the mean is roughly $0$ and the variance is roughly $1$:

# checking the mean
print(f'Original audio mean:   {sample.mean()}')
print(f'Normalized audio mean: {norm_sample.mean()}')
Original audio mean:   -1.781299215508625e-05
Normalized audio mean: 2.876160698495056e-10
# checking the standard deviation
print(f'Original audio standard dev:   {sample.var()}')
print(f'Normalized audio standard dev: {norm_sample.var()}')
Original audio standard dev:   0.008586421608924866
Normalized audio standard dev: 1.0

Success! The waveform is normalized.

For convenience later on, we define the AudioNormalize transform to normalize waveforms in a fastai training loop.

class AudioNormalize(Transform):
    "Normalizes a single `AudioTensor`."
    def encodes(self, x:AudioTensor): return (x-x.mean()) / x.std()
# checking if the Transform normalized the waveform
wav_norm = AudioNormalize()
norm_sample = wav_norm(sample)
print(f'Audio mean after transform: {norm_sample.mean()}')
print(f'Audio standard dev after transform: {norm_sample.var()}')
Audio mean after transform: 2.876160698495056e-10
Audio standard dev after transform: 1.0

Extracting spectrograms from audio

The next step is to extract a spectrogram from the normalized audio. We can do this with the AudioToSpec class in fastaudio. This class takes an AudioTensor as input and, as we might expect, returns an AudioSpectrogram.

# create a fastaudio Transform to convert audio into spectrograms
cfg = AudioConfig.BasicSpectrogram() # with default torchaudio parameters
audio2spec = AudioToSpec.from_cfg(cfg)

# extract the spectrogram
spec = audio2spec(norm_sample)

The show method of the AudioSpectrogram is a great, quick way to plot the spectrogram.

print(f'Spectrogram shape [channels, bins, time_steps]: {spec.shape}');
Spectrogram shape [channels, bins, time_steps]: torch.Size([1, 201, 1103])

The colorbar on the right showing the power in the signal is especially helpful since matplotlib always scales the values in a plot to the same color range. Without this colorbar, it is impossible to know or even guess the specific values in a spectrogram plot.

Finding spectrogram normalization stats

To get the normalization stats, we have to step through the training set and find the mean and standard deviation of each mini-batch. Then we average all the mini-batch statistics to get a single pair of ($\mu,\sigma)$ normalization statistics. Note that normalization statistics must alway come from the training set. This is a crucial place to avoid data leakage.

One small detail: if your training dataset is large enough it is not necessary to go through the whole set. Sampling 10% to 20% of the dataset can be enough for accurate statistics. However, since ESC-50 is small we find ($\mu,\sigma)$ from the whole set.

To accumulate these statistics over mini-batches we can borrow and slightly refactor a class from this very helpful post. The StatsRecorder class below tracks the mean and standard deviation across mini-batches.

class StatsRecorder:
    def __init__(self, red_dims=(0,2,3)):
        """Accumulates normalization statistics across mini-batches.
        self.red_dims = red_dims # which mini-batch dimensions to average over
        self.nobservations = 0   # running number of observations

    def update(self, data):
        data: ndarray, shape (nobservations, ndimensions)
        # initialize stats and dimensions on first batch
        if self.nobservations == 0:
            self.mean = data.mean(dim=self.red_dims, keepdim=True)
            self.std  = data.std (dim=self.red_dims,keepdim=True)
            self.nobservations = data.shape[0]
            self.ndimensions   = data.shape[1]
            if data.shape[1] != self.ndimensions:
                raise ValueError('Data dims do not match previous observations.')
            # find mean of new mini batch
            newmean = data.mean(dim=self.red_dims, keepdim=True)
            newstd  = data.std(dim=self.red_dims, keepdim=True)
            # update number of observations
            m = self.nobservations * 1.0
            n = data.shape[0]

            # update running statistics
            tmp = self.mean
            self.mean = m/(m+n)*tmp + n/(m+n)*newmean
            self.std  = m/(m+n)*self.std**2 + n/(m+n)*newstd**2 +\
                        m*n/(m+n)**2 * (tmp - newmean)**2
            self.std  = torch.sqrt(self.std)
            # update total number of seen samples
            self.nobservations += n

By default StatsRecorder averages over the image channel dimensions (grayscale or RGB). The red_dims argument might look familiar from normalization code in other Computer Vision tasks (also the Normalize in fastai).
To average over spectrogram channels instead we only need to pass a different red_dims.

Building the dataset loader

The setup below follows the fastaudio ESC-50 baseline to step through the training dataset. It is worth mentioning that the files in ESC-50 are sampled 44.1 kHz, but fastaudio will resample them to 16 kHz by default. Downsampling like this risks throwing away some information. But, keeping the higher sampling rate almost triples the "width" (aka time) of the spectrogram. This larger image will take up more memory in the GPU and limits our batch size and architecture choices. We keep this downsampling since it gives the spectrograms a very reasonable shape of [201, 401], compared with the much larger shape of [201, 1103] if we don't downsample.

def CrossValidationSplitter(col='fold', fold=1):
    "Split `items` (supposed to be a dataframe) by fold in `col`"
    def _inner(o):
        assert isinstance(o, pd.DataFrame), "ColSplitter only works when your items are a pandas DataFrame"
        col_values = o.iloc[:,col] if isinstance(col, int) else o[col]
        valid_idx = (col_values == fold).values.astype('bool')
        return IndexSplitter(mask2idxs(valid_idx))(o)
    return _inner

auds = DataBlock(blocks=(AudioBlock, CategoryBlock),  
                 get_x=ColReader("filename", pref=path/"audio"), 
                 item_tfms = [AudioNormalize],
                 batch_tfms = [audio2spec],
dbunch = auds.dataloaders(df, bs=64)