Learn

The learn module includes classes that make it possible to define processing graphs whose leaves are trained machine learning models.

While much of zounds.soundfile, zounds.spectral, and zounds.timeseries focus on processing nodes that can be composed into a processing graph to extract features from a single piece of audio, the learn module focuses on defining graphs that extract features or trained models from an entire corpus of audio.

PyTorch Modules

class zounds.learn.FilterBank(samplerate, kernel_size, scale, scaling_factors, normalize_filters=True, a_weighting=True)[source]

A torch module that convolves a 1D input signal with a bank of morlet filters.

Parameters:
  • samplerate (SampleRate) – the samplerate of the input signal
  • kernel_size (int) – the length in samples of each filter
  • scale (FrequencyScale) – a scale whose center frequencies determine the fundamental frequency of each filer
  • scaling_factors (int or list of int) – Scaling factors for each band, which determine the time-frequency resolution tradeoff. The number(s) should fall between 0 and 1, with smaller numbers achieving better frequency resolution, and larget numbers better time resolution
  • normalize_filters (bool) – When true, ensure that each filter in the bank has unit norm
  • a_weighting (bool) – When true, apply a perceptually-motivated weighting of the filters
class zounds.learn.SincLayer(scale, taps, samplerate)[source]

A layer as described in the paper “Speaker Recognition from raw waveform with SincNet”

This paper proposes a novel CNN architecture, called SincNet, that encourages the first convolutional layer to discover more meaningful filters. SincNet is based on parametrized sinc functions, which implement band-pass filters. In contrast to standard CNNs, that learn all elements of each filter, only low and high cutoff frequencies are directly learned from data with the proposed method. This offers a very compact and efficient way to derive a customized filter bank specifically tuned for the desired application. Our experiments, conducted on both speaker identification and speaker verification tasks, show that the proposed architecture converges faster and performs better than a standard CNN on raw waveforms.

https://arxiv.org/abs/1808.00158

Parameters:
  • scale (FrequencyScale) – A scale defining the initial bandpass filters
  • taps (int) – The length of the filter in samples
  • samplerate (SampleRate) – The sampling rate of incoming samples

The Basics

class zounds.learn.PreprocessingPipeline(needs=None)[source]

A PreprocessingPipeline is a node in the graph that can be connected to one or more Preprocessor nodes, whose output it will assemble into a re-usable pipeline.

Parameters:needs (list or tuple of Node) – the Preprocessor nodes on whose output this pipeline depends

Here’s an example of a learning pipeline that will first find the feature-wise mean and standard deviation of a dataset, and will then learn K-Means clusters from the dataset. This will result in a re-usable pipeline that can use statistics from the original dataset to normalize new examples, assign them to a cluster, and finally, reconstruct them.

import featureflow as ff
import zounds
from random import choice

samplerate = zounds.SR44100()
STFT = zounds.stft(resample_to=samplerate)


@zounds.simple_in_memory_settings
class Sound(STFT):
    bark = zounds.ArrayWithUnitsFeature(
        zounds.BarkBands,
        samplerate=samplerate,
        needs=STFT.fft,
        store=True)


@zounds.simple_in_memory_settings
class ExamplePipeline(ff.BaseModel):
    docs = ff.PickleFeature(
        ff.IteratorNode,
        needs=None)

    shuffled = ff.PickleFeature(
        zounds.ShuffledSamples,
        nsamples=100,
        needs=docs,
        store=False)

    meanstd = ff.PickleFeature(
        zounds.MeanStdNormalization,
        needs=docs,
        store=False)

    kmeans = ff.PickleFeature(
        zounds.KMeans,
        needs=meanstd,
        centroids=32)

    pipeline = ff.PickleFeature(
        zounds.PreprocessingPipeline,
        needs=(meanstd, kmeans),
        store=True)

# apply the Sound processing graph to individual audio files
for metadata in zounds.InternetArchive('TheR.H.SFXLibrary'):
    print 'processing {url}'.format(url=metadata.request.url)
    Sound.process(meta=metadata)

# apply the ExamplePipeline processing graph to the entire corpus of audio
_id = ExamplePipeline.process(docs=(snd.bark for snd in Sound))
learned = ExamplePipeline(_id)

snd = choice(list(Sound))
result = learned.pipeline.transform(snd.bark)
print result.data  # print the assigned centroids for each FFT frame
inverted = result.inverse_transform()
print inverted  # the reconstructed FFT frames
class zounds.learn.Pipeline(preprocess_results)[source]
class zounds.learn.Preprocessor(needs=None)[source]

Preprocessor is the common base class for nodes in a processing graph that will produce PreprocessingResult instances that end up as part of a Pipeline.

Parameters:needs (Node) – previous processing node(s) on which this one depends for its data
class zounds.learn.PreprocessResult(data, op, inversion_data=None, inverse=None, name=None)[source]

PreprocessResult are the output of Preprocessor nodes, and can participate in a Pipeline.

Parameters:
  • data – the data on which the node in the graph was originally trained
  • op (Op) – a callable that can transform data
  • inversion_data – data extracted in the forward pass of the model, that can be used to invert the result
  • inverse (Op) – a callable that given the output of op, and inversion_data, can invert the result
class zounds.learn.PipelineResult(data, processors, inversion_data, wrap_data)[source]

Custom Losses

class zounds.learn.PerceptualLoss(scale, samplerate, frequency_window=<ZoundsDocsMock name='mock()' id='139787043611368'>, basis_size=512, lap=2, log_factor=100, frequency_weighting=None, cosine_similarity=True)[source]

PerceptualLoss computes loss/distance in a feature space that roughly approximates early stages of the human audio processing pipeline, instead of computing raw sample loss. It decomposes a 1D (audio) signal into frequency bands using an FIR filter bank whose frequencies are centered according to a user-defined scale, performs half-wave rectification, puts amplitudes on a log scale, and finally optionally applies a re-weighting of frequency bands.

Parameters:
  • scale (FrequencyScale) – a scale defining frequencies at which the FIR filters will be centered
  • samplerate (SampleRate) – samplerate needed to construct the FIR filter bank
  • frequency_window (ndarray) – window determining how narrow or wide filter responses should be
  • basis_size (int) – The kernel size, or number of “taps” for each filter
  • lap (int) – The filter stride
  • log_factor (int) – How much compression should be applied in the log amplitude stage
  • frequency_weighting (FrequencyWeighting) – an optional frequency weighting to be applied after log amplitude scaling
  • cosine_similarity (bool) – If True, compute the cosine similarity between spectrograms, otherwise, compute the mean squared error

Data Preparation

class zounds.learn.UnitNorm(needs=None)[source]
class zounds.learn.MuLawCompressed(needs=None)[source]
class zounds.learn.MeanStdNormalization(needs=None)[source]
class zounds.learn.InstanceScaling(max_value=1, needs=None)[source]
class zounds.learn.Weighted(weighting, needs=None)[source]

Sampling

class zounds.learn.ShuffledSamples(nsamples=None, multiplexed=False, dtype=None, needs=None)[source]

Machine Learning Models

class zounds.learn.KMeans(centroids=None, needs=None)[source]
class zounds.learn.SklearnModel(model=None, needs=None)[source]
class zounds.learn.PyTorchNetwork(trainer=None, post_training_func=None, needs=None, training_set_prep=None, chunksize=None)[source]
class zounds.learn.PyTorchGan(apply_network='generator', trainer=None, needs=None)[source]
class zounds.learn.PyTorchAutoEncoder(trainer=None, needs=None)[source]
class zounds.learn.SupervisedTrainer(model, loss, optimizer, epochs, batch_size, holdout_percent=0.0, data_preprocessor=<function SupervisedTrainer.<lambda>>, label_preprocessor=<function SupervisedTrainer.<lambda>>, checkpoint_epochs=1)[source]
class zounds.learn.TripletEmbeddingTrainer(network, epochs, batch_size, anchor_slice, deformations=None, checkpoint_epochs=1)[source]

Learn an embedding by applying the triplet loss to anchor examples, negative examples, and deformed or adjacent examples, akin to:

Parameters:
  • network (nn.Module) – the neural network to train
  • epochs (int) – the desired number of passes over the entire dataset
  • batch_size (int) – the number of examples in each minibatch
  • anchor_slice (slice) – since choosing examples near the anchor example is one possible transformation that can be applied to find a positive example, batches generally consist of examples that are longer (temporally) than the examples that will be fed to the network, so that adjacent examples may be chosen. This slice indicates which part of the minibatch examples comprises the anchor
  • deformations (callable) – a collection of other deformations or transformations that can be applied to anchor examples to derive positive examples. These callables should take two arguments: the anchor examples from the minibatch, as well as the “wider” minibatch examples that include temporally adjacent events
class zounds.learn.WassersteinGanTrainer(network, latent_dimension, n_critic_iterations, epochs, batch_size, preprocess_minibatch=None, kwargs_factory=None, debug_gradient=False, checkpoint_epochs=1)[source]
Parameters:
  • network (nn.Module) – the network to train
  • latent_dimension (tuple) – A tuple that defines the shape of the latent dimension (noise) that is the generator’s input
  • n_critic_iterations (int) – The number of minibatches the critic sees for every minibatch the generator sees
  • epochs – The total number of passes over the training set
  • batch_size – The size of a minibatch
  • preprocess_minibatch (function) – function that takes the current epoch, and a minibatch, and mutates the minibatch
  • kwargs_factory (callable) – function that takes the current epoch and outputs args to pass to the generator and discriminator

Hashing

class zounds.learn.SimHash(bits=None, packbits=False, needs=None)[source]

Hash feature vectors by computing on which side of N hyperplanes those features lie.

Parameters:
  • bits (int) – The number of hyperplanes, and hence, the number of bits in the resulting hash
  • packbits (bool) – Should the result be bit-packed?
  • needs (Preprocessor) – the processing node on which this node relies for its data

Learned Models in Audio Processing Graphs

class zounds.learn.Learned(learned=None, version=None, wrapper=None, pipeline_func=None, needs=None, dtype=None)[source]