Normalizing spectrograms for Deep Learning

deep learning
spectrogram normalizations



August 20, 2022

How to normalize spectrograms

Scaling spectrograms for classification tasks with neural networks.

NOTE: under heavy construction


Spectrograms are often used as images to train deep neural networks for audio tasks. By treating spectrograms as images, we can borrow from the many powerful ideas in image recognition with deep learning. A spectrogram, however, is fundamentally different than natural images as we will see below. That brings up the central question of this post: how should spectrograms be normalized during training?

This post assumes some familiarity with deep learning and signal processing concepts like the FFT. It is also a light introduction to the fastaudio library.

import matplotlib.pyplot as plt
import numpy as np

Transforming audio into two dimensions

Image classification is a challenging task that was previously done with expert, handcrafted features. Now, features are automatically learned from labeled data instead. The success of these learned features has completely shifted the paradigm of Computer Vision. We would ideally like to apply these same, proven techniques on audio tasks.

However, audio is treated like a one dimensional signal in most Machine Learning applications. That means raw audio is unusable with 2-D Convolutional Neural Networks (CNNs), which are the workhorses of modern image recognition. If we could somehow represent audio in two dimensions, like an image, then we could leverage the successful approaches in image classification.

Thankfully there are many ways of transforming audio into two dimensions. The most popular one is turning audio into a spectrogram. As an example, the image below shows the spectrogram of this violin recording taken from Wikipedia.

The spectrogram of a violin recording

The spectrogram is a 2-D signal representation in time and frequency, so we can use it with 2-D CNNs! But first it is crucial to preprocess and normalize the spectrograms. Neural networks have a much easier time learning when their inputs are normalized.

For natural images, normalization uses an estimated mean (\(\mu\)) and standard deviation (\(\sigma\)) as follows: - Subtract \(\mu\) from the image values to give them a mean of \(0\). - Divide the image values by \(\sigma\) to give them a variance of \(1\).

In math terms, if \(x\) is our image then \(x_{\text{norm}}\) is: \[x_{\text{norm}} = \frac{(x - \mu)}{\sigma}\]

Since spectrograms are fundamentally different than natural images, we should reevaluate if this same normalization makes sense.

Why spectrograms are not images and how to normalize them

Now we can describe what makes spectrograms different from natural images. We start with a high-level overview of images and their normalization, then do the same for spectrograms. A quick recap of how spectrograms are computed will further show how different they are from images. This recap naturally leads to a specific normalization for spectrogram features. Finally, we talk about Transfer Learning and why we avoid it in this post.

In an image, both axes (height and width) are in the spatial domain and at the same scale. Images are stored as integers in the range of [0, 255]. To normalize them we first divide all pixels by 255, the max possible value, to map them into the range [0, 1]. Then, we find the statistics that approximately center the data with a mean of \(0\) and a variance of \(1\). The three RGB channels in a color image are normalized separately. If an image is greyscale then we normalize its single channel instead.

The axes in a spectrogram are from different domains than the axes in an image. In a spectrogram, the horizontal axis represents time and the vertical axis represents frequency. Each of these quantities has its own scale. The frequency dimension is determined by the size of the FFT window. The time dimension is set by the total length of the signal, the size of the FFT window, and the hop size of the window. You can check the documentation of the torch.stft function for a breakdown of how each axis is determined.

To be more specific, a spectrogram is actually the log of the power spectrum. Below we give a quick recap of how the spectrogram is computed to show how much it differs from images.

If \(\text{x}\) is our input audio then the STFT returns the spectrum: \[\text{spectrum} = \text{STFT(x)}\] We are more interested in the energy or power of the signal, so we take the absolute value of the STFT and square it:
\[\text{powerSpectrum} = |\text{STFT(x)}|^2\] We cannot use the power spectrum as a feature because it has a few strong peaks and many small values. You can check this other fantastic post on spectrogram normalization to learn why this is a problem. Taking the log of the power spectrum spreads out the values and makes them better features. This becomes the spectrogram: \[\text{spectrogram} = log(|\text{STFT(x)}|^2)\] The range of the log function is \(-\infty\) to \(+\infty\) which is clearly different than the integers from 0 to 255 in an image.

A spectrogram transformation can also be thought of as a very simple “channelizer” in Digital Signal Processing (DSP) terms. That is a fancy way of saying that it splits the continuous frequency spectrum of a signal into discrete bins, or channels. For example, consider taking a spectrogram with 512 bins from a signal sampled at 16 kHz. This spectrogram will have 512 channels where each channel has a “bandwidth” of \[16 \ \text{kHz} \ \ / \ \ 512 \ \text{bins} = 31.25 \ \text{Hz per bin}\]

Spectrogram channels are very different from the image channels we are used to. So it raises the question: should we normalize the entire spectrogram “image” with a single, global value? Or should we normalize each spectrogram channel just like the channels in an image? In the rest of this post, we compare global and channel-based spectrogram normalizations on a real-world dataset to find which is better.

A quick note on Transfer Learning

We also have to talk about Transfer Learning in the context of normalization. In Transfer Learning, it is best-practice to normalize the new dataset with the statistics from the old dataset. This makes sure that the new network inputs are at the same scale as the original inputs. Since most pretrained vision models were trained on ImageNet, we normalize any new inputs with ImageNet statistics.
However, we avoid Transfer Learning in this post and instead train an 18-layer xResNet from scratch. The reason is that pretrained image models operate at a completely different scale than spectrograms. And the main goal here is to learn our own scalings instead!

Downloading a sample dataset

To keep things practical, we will apply these spectrogram normalization techniques to a sound classification challenge hosted by fastaudio. fastaudio is a community extension of the fastai library to make audio tasks with neural networks more accessible.
The challenge here is to classify sounds in the ESC-50 dataset, where ESC-50 stands for “Environment Sound Classification with 50 classes”. This dataset has many different types of sounds which show how varied audio spectrograms can be.

Many of the lines below are based on the fastaudio baseline results notebook.

The ESC-50 dataset

The first step is to download the data. ESC-50 is already included in fastaudio so we can grab it with untar_data.

# from import *
# from fastaudio.core.all import *
# from fastaudio.augment.all import *

# already in fastaudio, can download with fastai's `untar_data`
# path = untar_data(URLs.ESC50)

The downloaded audio files are inside the aptly named audio folder. Below we use the ls method, a fastai addition to python’s pathlib.Path, to check the contents of this folder.

# wavs = (path/"audio").ls()
# wavs

The output of ls shows 2,000 audio files. But the filenames are not very descriptive, so how do we know what is actually in each one?
Thankfully, as with many datasets, the download includes a table with more information about the data (aka metadata).

# # read the audio metadata and show the first few rows
# df = pd.read_csv(path/"meta"/"esc50.csv")
# df.head()

The key info from this table are in the filename and category columns.
filename gives the name of a file inside of the audio folder.
category tells us which class a file belongs to.

The last file in the data directory will be our working example for normalization. We can index into the metadata table above using this file’s name to learn more about it.

# # pick the row where "filename" matches the file's "name".
# df.loc[df.filename == wavs[-1].name]

This is a recording of crickets!
We can load this file with the AudioTensor class in fastaudio. Its create function reads the audio samples straight into a torch.Tensor.

# # create an AudioTensor from a file path
# sample = AudioTensor.create(wavs[-1])

An AudioTensor can plot and even play the audio with its show method.

# print(f'Audio shape [channels, samples]: {sample.shape}')

Each “burst” in the plot above is a cricket chirp. There are three full chirps and the early starts of a fourth chirp.

Normalizing an audio waveform

The first step is normalizing the audio waveform itself. We give it a mean of zero and unit variance in the usual way:

\[\text{normedAudio} = \frac{\text{audio} - mean(\text{audio})}{std(\text{audio})} \]

# # normalize the waveform
# norm_sample = (sample - sample.mean()) / sample.std()

Let’s check if the mean is roughly \(0\) and the variance is roughly \(1\):

# # checking the mean
# print(f'Original audio mean:   {sample.mean()}')
# print(f'Normalized audio mean: {norm_sample.mean()}')
# # checking the standard deviation
# print(f'Original audio standard dev:   {sample.var()}')
# print(f'Normalized audio standard dev: {norm_sample.var()}')

Success! The waveform is normalized.

For convenience later on, we define the AudioNormalize transform to normalize waveforms in a fastai training loop.

# class AudioNormalize(Transform):
#     "Normalizes a single `AudioTensor`."
#     def encodes(self, x:AudioTensor): return (x-x.mean()) / x.std()
# # checking if the Transform normalized the waveform
# wav_norm = AudioNormalize()
# norm_sample = wav_norm(sample)
# print(f'Audio mean after transform: {norm_sample.mean()}')
# print(f'Audio standard dev after transform: {norm_sample.var()}')

Extracting spectrograms from audio

The next step is to extract a spectrogram from the normalized audio. We can do this with the AudioToSpec class in fastaudio. This class takes an AudioTensor as input and, as we might expect, returns an AudioSpectrogram.

# # create a fastaudio Transform to convert audio into spectrograms
# cfg = AudioConfig.BasicSpectrogram() # with default torchaudio parameters
# audio2spec = AudioToSpec.from_cfg(cfg)

# # extract the spectrogram
# spec = audio2spec(norm_sample)

The show method of the AudioSpectrogram is a great, quick way to plot the spectrogram.

# print(f'Spectrogram shape [channels, bins, time_steps]: {spec.shape}')

The colorbar on the right showing the power in the signal is especially helpful since matplotlib always scales the values in a plot to the same color range. Without this colorbar, it is impossible to know or even guess the specific values in a spectrogram plot.

Finding spectrogram normalization stats

To get the normalization stats, we have to step through the training set and find the mean and standard deviation of each mini-batch. Then we average all the mini-batch statistics to get a single pair of (\(\mu,\sigma)\) normalization statistics. Note that normalization statistics must alway come from the training set. This is a crucial place to avoid data leakage.

One small detail: if your training dataset is large enough it is not necessary to go through the whole set. Sampling 10% to 20% of the dataset can be enough for accurate statistics. However, since ESC-50 is small we find (\(\mu,\sigma)\) from the whole set.

To accumulate these statistics over mini-batches we can borrow and slightly refactor a class from this very helpful post. The StatsRecorder class below tracks the mean and standard deviation across mini-batches.

# class StatsRecorder:
#     def __init__(self, red_dims=(0,2,3)):
#         """Accumulates normalization statistics across mini-batches.
#         ref:
#         """
#         self.red_dims = red_dims # which mini-batch dimensions to average over
#         self.nobservations = 0   # running number of observations

#     def update(self, data):
#         """
#         data: ndarray, shape (nobservations, ndimensions)
#         """
#         # initialize stats and dimensions on first batch
#         if self.nobservations == 0:
#             self.mean = data.mean(dim=self.red_dims, keepdim=True)
#             self.std  = data.std (dim=self.red_dims,keepdim=True)
#             self.nobservations = data.shape[0]
#             self.ndimensions   = data.shape[1]
#         else:
#             if data.shape[1] != self.ndimensions:
#                 raise ValueError('Data dims do not match previous observations.')
#             # find mean of new mini batch
#             newmean = data.mean(dim=self.red_dims, keepdim=True)
#             newstd  = data.std(dim=self.red_dims, keepdim=True)
#             # update number of observations
#             m = self.nobservations * 1.0
#             n = data.shape[0]

#             # update running statistics
#             tmp = self.mean
#             self.mean = m/(m+n)*tmp + n/(m+n)*newmean
#             self.std  = m/(m+n)*self.std**2 + n/(m+n)*newstd**2 +\
#                         m*n/(m+n)**2 * (tmp - newmean)**2
#             self.std  = torch.sqrt(self.std)
#             # update total number of seen samples
#             self.nobservations += n

By default StatsRecorder averages over the image channel dimensions (grayscale or RGB). The red_dims argument might look familiar from normalization code in other Computer Vision tasks (also the Normalize in fastai).
To average over spectrogram channels instead we only need to pass a different red_dims.

Building the dataset loader

The setup below follows the fastaudio ESC-50 baseline to step through the training dataset. It is worth mentioning that the files in ESC-50 are sampled 44.1 kHz, but fastaudio will resample them to 16 kHz by default. Downsampling like this risks throwing away some information. But, keeping the higher sampling rate almost triples the “width” (aka time) of the spectrogram. This larger image will take up more memory in the GPU and limits our batch size and architecture choices. We keep this downsampling since it gives the spectrograms a very reasonable shape of [201, 401], compared with the much larger shape of [201, 1103] if we don’t downsample.

# def CrossValidationSplitter(col='fold', fold=1):
#     "Split `items` (supposed to be a dataframe) by fold in `col`"
#     def _inner(o):
#         assert isinstance(o, pd.DataFrame), "ColSplitter only works when your items are a pandas DataFrame"
#         col_values = o.iloc[:,col] if isinstance(col, int) else o[col]
#         valid_idx = (col_values == fold).values.astype('bool')
#         return IndexSplitter(mask2idxs(valid_idx))(o)
#     return _inner

# auds = DataBlock(blocks=(AudioBlock, CategoryBlock),  
#                  get_x=ColReader("filename", pref=path/"audio"), 
#                  splitter=CrossValidationSplitter(fold=1),
#                  item_tfms = [AudioNormalize],
#                  batch_tfms = [audio2spec],
#                  get_y=ColReader("category"))
# dbunch = auds.dataloaders(df, bs=64)
# dbunch.show_batch(figsize=(7,7))

Calculating the statistics

Next we make two recorders: one for global statistics and the other for channel-based statistics. Then we step through the training dataset to find both sets of stats.

# # create recorders
# global_stats  = StatsRecorder()
# channel_stats = StatsRecorder(red_dims=(0,1,3))

# # step through the training dataset
# with torch.no_grad():
#     for idx,(x,y) in enumerate(iter(dbunch.train)):
#         # update normalization statistics
#         global_stats.update(x)
#         channel_stats.update(x)
# # parse out both sets of stats
# global_mean,global_std = global_stats.mean,global_stats.std
# channel_mean,channel_std = channel_stats.mean,channel_stats.std

We can check the shape of the statistics to make sure they are correct. For the global statistics, we expect a shape of: [1,1,1,1]. With spectrogram channel normalizations, we expect one value per spectrogram bin for a shape of [1,1,201,1].

# print(f'Shape of global mean: {global_mean.shape}')
# print(f'Shape of global standard dev: {global_std.shape}')
# print(f'Shape of channel mean: {channel_mean.shape}')
# print(f'Shape of channel standard dev: {channel_std.shape}')

Training with normalizations

Now for the moment of truth. We train with the two different spectrogram normalizations and measure their impact. For this we again follow the fastaudio baseline and train each type of normalization for 20 epochs. The final score is the averaged accuracy of five runs.

# epochs = 20
# num_runs = 5

Transforms to normalize mini-batches

We need to extend the fastai Normalize class in order to use the spectrogram normalization statistics. The reason is type dispatch. fastai normalization uses ImageNet statistics due to the focus on transfer learning with color images. But this ImageNet normalization is only applied on RGB images of the TensorImage class, while AudioSpectrogram subclasses the different TensorImageBase. The solution is to define encodes and decodes for TensorImageBase instead.

# class SpecNormalize(Normalize):
#     "Normalize/denorm batch of `TensorImage`"
#     def encodes(self, x:TensorImageBase): return (x-self.mean) / self.std
#     def decodes(self, x:TensorImageBase):
#         f = to_cpu if x.device.type=='cpu' else noop
#         return (x*f(self.std) + f(self.mean))
# # make global and channel normalizers
# GlobalSpecNorm  = SpecNormalize(global_mean,  global_std,  axes=(0,2,3))
# ChannelSpecNorm = SpecNormalize(channel_mean, channel_std, axes=(0,1,3))

Training helpers

To avoid repeating ourselves, the helper functions below build the dataloaders and run the training loops.
The get_dls function makes it clear which normalization is being applied. The train_loops function repeats training runs a given number of times.

# def get_dls(bs=64, item_tfms=[], batch_tfms=[]):
#     "Get dataloaders with given `bs` and batch/item tfms."
#     auds = DataBlock(blocks=(AudioBlock, CategoryBlock),  
#                      get_x=ColReader("filename", pref=path/"audio"), 
#                      splitter=CrossValidationSplitter(fold=1),
#                      item_tfms=item_tfms,   # for waveform normalization
#                      batch_tfms=batch_tfms, # for spectrogram normalization
#                      get_y=ColReader("category"))
#     dls = auds.dataloaders(df, bs=bs)
#     return dls

# def make_xresnet_grayscale(model, n_in=1):
#     "Modifies xresnet `model` for single-channel images." 
#     model[0][0].in_channels = n_in
#     # sum weights to reduce dimension
#     model[0][0].weight = torch.nn.parameter.Parameter(model[0][0].weight.mean(1, keepdim=True))

# def train_loops(dls, name, num_runs=num_runs, epochs=epochs, num_cls=50):
#     "Runs `num_runs` training loops with `dls` for given `epochs`."
#     accuracies = []
#     for i in range(num_runs):
#         # make new grayscale xresnet
#         model = xresnet18(pretrained=False, n_out=num_cls)
#         make_xresnet_grayscale(model, n_in=1)
#         # get learner for this run
#         learn = Learner(dls, model, metrics=[accuracy])
#         # train network and track accuracy
#         learn.fit_one_cycle(epochs)
#         accuracies.append(learn.recorder.values[-1][-1])
#     print(f'Average accuracy for "{name}": {sum(accuracies) / num_runs}')

Baseline performance

Before getting carried away with normalization, we have to first set a baseline without normalizations. This allows us to evaluate the impact of normalization later on, else there is no way to know if normalization helps at all.

# # data without normalization
# dls = get_dls(batch_tfms=[audio2spec])
# # run training loops
# train_loops(dls, name='No Norm')

Performance with global normalization

Next we normalize each audio waveform and the spectrograms with global, scalar statistics.

# # data with waveform and global normalization
# dls = get_dls(item_tfms=[AudioNormalize],
#               batch_tfms=[audio2spec, GlobalSpecNorm])
# # run training loops
# train_loops(dls, name='Global Norm')

Performance with channel normalization

Finally, we normalize each audio waveform and the spectrograms with channel-based statistics.

# # get data with waveform and channel normalization
# dls = get_dls(item_tfms=[AudioNormalize],
#               batch_tfms=[audio2spec, ChannelSpecNorm])
# # run training loops
# train_loops(dls, name='Channel Norm')


The results are:

Normalization Average Accuracy
None .7110
Global .7315
Channel .7144

I ran the cells above several times to make sure these patterns held. Overall, there is a gain from global normalization. Channel-based normalization shows a smaller benefit. While these increases in performance are a good starting point, there are several explanations for this that point us towards other approaches.

For starters, the spectrograms in ESC-50 are very different both within and across classes. In other words the activity in each spectrogram channel changes a lot from sample to sample. A global statistic likely fares better under these unpredictable conditions. If all the audio came from a similar source, like speech, then the per-channel normalization might fare better.

We also process the entire five second files at once, which is a large analysis window by audio standards. This large window means that each sample looks exactly the same in every epoch. If we used a smaller analysis window, say 2 seconds, we could randomly “crop” many spectrogram regions from a single example as a kind of data augmentation. The risk here is grabbing a silent region without any information but still giving it a class label (though an energy threshold can prevent this). Cropping with a smaller analysis window is one way to expose the networks to more samples and variability.

Using the entire waveform at once also means that the waveform statistics need to model a very long-term relationship. Going back to the cricket recording example: we would not expect good normalization statistics for the chirps to be the same as good statistics for the pauses in between chirps. To counter this it is possible to do a “short-time” normalization. Here we pick a sliding window, often much smaller than the whole waveform, and only normalize the data inside as it steps through the waveform. This “short-time” normalization can be applied with or without the global waveform normalization.

Furthermore, the spectrogram is a high-dimensional feature with 201 frequency bins. It is common in audio tasks to reduce this dimension by combining nearby bins. This is done with something called “filterbanks” which usually operate at the Mel frequency scale. This tutorial is one of my favorites and gives an incredibly clear description of Mel frequency and the filterbank process. There are other options such as Gammatone filterbanks as well. While this might seem like an expert handcrafted feature, there is good reason for using filterbanks in audio tasks. If we feed in a raw spectrogram, the early convolutional layers tend to learn something like a filterbank anyway! So directly feeding a filterbank into the network lets it focus on more complicated relationships. As a bonus, the channel-based normalization discussed here also works on filterbank features.

We are also training a powerful 18-layer model from scratch with only 1600 images. While deep learning can handle datasets this small, it is usually only through Transfer Learning. But, we stayed away from Transfer Learning because pretrained networks are tightly coupled to their original dataset’s normalization statistics. And the main idea here was to learn our own spectrogram scalings. It is possible that a smaller, simpler network will perform better. Looking at the training logs above, it seems the validation loss was still decreasing. So we’d still have to train for longer to check if the network is actually overfitting and a simpler model is needed.

Lastly, there is no data augmentation even though it is almost de facto when training CNNs. It is possible to use image augmentations (flips, rotations, etc) even though they do not technically make sense on a spectrogram. It might be better to use augmentations directly inspired by signal processing like SpecAugment. By the way, SpecAugment is already included in fastaudio! Along with many other waveform and spectrogram augmentations.

To recap, there are many good reasons why normalization only helped a little on the ESC-50 dataset. The points above described some possible next steps to increase performance.


In this post we saw how spectrograms are fundamentally different than natural images. We then explored two ways of normalizing spectrograms when training neural networks: global normalization and channel-based normalization.

Next we implemented these two normalization techniques and tested them against an unnormalized baseline on the ESC-50 dataset. Both normalizations showed a gain in performance, with global normalization outperforming channel-based normalization. We then offered some next steps that could further boost performance.

In the end, the choice of spectrogram normalization will depend on how the system is used. For example, if the system will be deployed in an environment similar to the training environment, then normalizing by spectrogram channels makes more sense. This is because the training statistics will be a good match for the similar patterns and distributions in the deployed environment. However, it is critical to monitor the system in this environment and update the statistics as needed to avoid shifting out of domain.

If the system will instead be used in a completely different environment, of which you have no knowledge, then the global statistics could be a better fit. While not as technically sound, the model will (hopefully) be less surprised by radically new activity across the channels.

To recap, there is no one universally correct way to normalize spectrograms for every audio task. Like many aspects of deep learning, the final choice will be experimental and based on the specifics of both the problem and domain.

I hope this post gave you an idea of how to normalize spectrograms. Even moreso, I hope that it gave you new ideas to try out. The ESC-50 is a great playground for any new ideas. Happy experimenting!

from nbdev.showdoc import *

::: {#cell-76 .cell 0=‘h’ 1=‘i’ 2=‘d’ 3=‘e’}

import nbdev; nbdev.nbdev_export()


::: {#cell-77 .cell 0=‘h’ 1=‘i’ 2=‘d’ 3=‘e’}

from nbdev.showdoc import *