Learn¶

The learn module includes classes that make it possible to define processing graphs whose leaves are trained machine learning models.

While much of zounds.soundfile, zounds.spectral, and zounds.timeseries focus on processing nodes that can be composed into a processing graph to extract features from a single piece of audio, the learn module focuses on defining graphs that extract features or trained models from an entire corpus of audio.

PyTorch Modules¶

class zounds.learn.FilterBank(samplerate, kernel_size, scale, scaling_factors, normalize_filters=True, a_weighting=True)[source]¶

A torch module that convolves a 1D input signal with a bank of morlet filters.

Parameters:

samplerate (SampleRate) – the samplerate of the input signal
kernel_size (int) – the length in samples of each filter
scale (FrequencyScale) – a scale whose center frequencies determine the fundamental frequency of each filer
scaling_factors (int or list of int) – Scaling factors for each band, which determine the time-frequency resolution tradeoff. The number(s) should fall between 0 and 1, with smaller numbers achieving better frequency resolution, and larget numbers better time resolution
normalize_filters (bool) – When true, ensure that each filter in the bank has unit norm
a_weighting (bool) – When true, apply a perceptually-motivated weighting of the filters

See also

AWeighting morlet_filter_bank()

class zounds.learn.SincLayer(scale, taps, samplerate)[source]¶

A layer as described in the paper “Speaker Recognition from raw waveform with SincNet”

This paper proposes a novel CNN architecture, called SincNet, that encourages the first convolutional layer to discover more meaningful filters. SincNet is based on parametrized sinc functions, which implement band-pass filters. In contrast to standard CNNs, that learn all elements of each filter, only low and high cutoff frequencies are directly learned from data with the proposed method. This offers a very compact and efficient way to derive a customized filter bank specifically tuned for the desired application. Our experiments, conducted on both speaker identification and speaker verification tasks, show that the proposed architecture converges faster and performs better than a standard CNN on raw waveforms.

—https://arxiv.org/abs/1808.00158

Parameters:	scale (FrequencyScale) – A scale defining the initial bandpass filters taps (int) – The length of the filter in samples samplerate (SampleRate) – The sampling rate of incoming samples

See also

FrequencyScale SampleRate

The Basics¶

class zounds.learn.PreprocessingPipeline(needs=None)[source]¶

A PreprocessingPipeline is a node in the graph that can be connected to one or more Preprocessor nodes, whose output it will assemble into a re-usable pipeline.

Parameters:	needs (list or tuple of Node) – the `Preprocessor` nodes on whose output this pipeline depends

Here’s an example of a learning pipeline that will first find the feature-wise mean and standard deviation of a dataset, and will then learn K-Means clusters from the dataset. This will result in a re-usable pipeline that can use statistics from the original dataset to normalize new examples, assign them to a cluster, and finally, reconstruct them.

import featureflow as ff
import zounds
from random import choice

samplerate = zounds.SR44100()
STFT = zounds.stft(resample_to=samplerate)


@zounds.simple_in_memory_settings
class Sound(STFT):
    bark = zounds.ArrayWithUnitsFeature(
        zounds.BarkBands,
        samplerate=samplerate,
        needs=STFT.fft,
        store=True)


@zounds.simple_in_memory_settings
class ExamplePipeline(ff.BaseModel):
    docs = ff.PickleFeature(
        ff.IteratorNode,
        needs=None)

    shuffled = ff.PickleFeature(
        zounds.ShuffledSamples,
        nsamples=100,
        needs=docs,
        store=False)

    meanstd = ff.PickleFeature(
        zounds.MeanStdNormalization,
        needs=docs,
        store=False)

    kmeans = ff.PickleFeature(
        zounds.KMeans,
        needs=meanstd,
        centroids=32)

    pipeline = ff.PickleFeature(
        zounds.PreprocessingPipeline,
        needs=(meanstd, kmeans),
        store=True)

# apply the Sound processing graph to individual audio files
for metadata in zounds.InternetArchive('TheR.H.SFXLibrary'):
    print 'processing {url}'.format(url=metadata.request.url)
    Sound.process(meta=metadata)

# apply the ExamplePipeline processing graph to the entire corpus of audio
_id = ExamplePipeline.process(docs=(snd.bark for snd in Sound))
learned = ExamplePipeline(_id)

snd = choice(list(Sound))
result = learned.pipeline.transform(snd.bark)
print result.data  # print the assigned centroids for each FFT frame
inverted = result.inverse_transform()
print inverted  # the reconstructed FFT frames

class zounds.learn.Pipeline(preprocess_results)[source]¶

class zounds.learn.Preprocessor(needs=None)[source]¶

Preprocessor is the common base class for nodes in a processing graph that will produce PreprocessingResult instances that end up as part of a Pipeline.

Parameters:	needs (Node) – previous processing node(s) on which this one depends for its data

class zounds.learn.PreprocessResult(data, op, inversion_data=None, inverse=None, name=None)[source]¶

PreprocessResult are the output of Preprocessor nodes, and can participate in a Pipeline.

Parameters:	data – the data on which the node in the graph was originally trained op (Op) – a callable that can transform data inversion_data – data extracted in the forward pass of the model, that can be used to invert the result inverse (Op) – a callable that given the output of op, and inversion_data, can invert the result

class zounds.learn.PipelineResult(data, processors, inversion_data, wrap_data)[source]¶

Custom Losses¶

class zounds.learn.PerceptualLoss(scale, samplerate, frequency_window=<ZoundsDocsMock name='mock()' id='139787043611368'>, basis_size=512, lap=2, log_factor=100, frequency_weighting=None, cosine_similarity=True)[source]¶

PerceptualLoss computes loss/distance in a feature space that roughly approximates early stages of the human audio processing pipeline, instead of computing raw sample loss. It decomposes a 1D (audio) signal into frequency bands using an FIR filter bank whose frequencies are centered according to a user-defined scale, performs half-wave rectification, puts amplitudes on a log scale, and finally optionally applies a re-weighting of frequency bands.

Parameters:

scale (FrequencyScale) – a scale defining frequencies at which the FIR filters will be centered
samplerate (SampleRate) – samplerate needed to construct the FIR filter bank
frequency_window (ndarray) – window determining how narrow or wide filter responses should be
basis_size (int) – The kernel size, or number of “taps” for each filter
lap (int) – The filter stride
log_factor (int) – How much compression should be applied in the log amplitude stage
frequency_weighting (FrequencyWeighting) – an optional frequency weighting to be applied after log amplitude scaling
cosine_similarity (bool) – If True, compute the cosine similarity between spectrograms, otherwise, compute the mean squared error

Data Preparation¶

class zounds.learn.UnitNorm(needs=None)[source]¶

class zounds.learn.MuLawCompressed(needs=None)[source]¶

class zounds.learn.MeanStdNormalization(needs=None)[source]¶

class zounds.learn.InstanceScaling(max_value=1, needs=None)[source]¶

class zounds.learn.Weighted(weighting, needs=None)[source]¶

Sampling¶

class zounds.learn.ShuffledSamples(nsamples=None, multiplexed=False, dtype=None, needs=None)[source]¶

Machine Learning Models¶

class zounds.learn.KMeans(centroids=None, needs=None)[source]¶

class zounds.learn.SklearnModel(model=None, needs=None)[source]¶

class zounds.learn.PyTorchNetwork(trainer=None, post_training_func=None, needs=None, training_set_prep=None, chunksize=None)[source]¶

class zounds.learn.PyTorchGan(apply_network='generator', trainer=None, needs=None)[source]¶

class zounds.learn.PyTorchAutoEncoder(trainer=None, needs=None)[source]¶

class zounds.learn.SupervisedTrainer(model, loss, optimizer, epochs, batch_size, holdout_percent=0.0, data_preprocessor=<function SupervisedTrainer.<lambda>>, label_preprocessor=<function SupervisedTrainer.<lambda>>, checkpoint_epochs=1)[source]¶

class zounds.learn.TripletEmbeddingTrainer(network, epochs, batch_size, anchor_slice, deformations=None, checkpoint_epochs=1)[source]¶

Learn an embedding by applying the triplet loss to anchor examples, negative examples, and deformed or adjacent examples, akin to:

UNSUPERVISED LEARNING OF SEMANTIC AUDIO REPRESENTATIONS <https://arxiv.org/pdf/1711.02209.pdf>

Parameters:

network (nn.Module) – the neural network to train
epochs (int) – the desired number of passes over the entire dataset
batch_size (int) – the number of examples in each minibatch
anchor_slice (slice) – since choosing examples near the anchor example is one possible transformation that can be applied to find a positive example, batches generally consist of examples that are longer (temporally) than the examples that will be fed to the network, so that adjacent examples may be chosen. This slice indicates which part of the minibatch examples comprises the anchor
deformations (callable) – a collection of other deformations or transformations that can be applied to anchor examples to derive positive examples. These callables should take two arguments: the anchor examples from the minibatch, as well as the “wider” minibatch examples that include temporally adjacent events

class zounds.learn.WassersteinGanTrainer(network, latent_dimension, n_critic_iterations, epochs, batch_size, preprocess_minibatch=None, kwargs_factory=None, debug_gradient=False, checkpoint_epochs=1)[source]¶

Parameters:

network (nn.Module) – the network to train
latent_dimension (tuple) – A tuple that defines the shape of the latent dimension (noise) that is the generator’s input
n_critic_iterations (int) – The number of minibatches the critic sees for every minibatch the generator sees
epochs – The total number of passes over the training set
batch_size – The size of a minibatch
preprocess_minibatch (function) – function that takes the current epoch, and a minibatch, and mutates the minibatch
kwargs_factory (callable) – function that takes the current epoch and outputs args to pass to the generator and discriminator

Hashing¶

class zounds.learn.SimHash(bits=None, packbits=False, needs=None)[source]¶

Hash feature vectors by computing on which side of N hyperplanes those features lie.

Parameters:	bits (int) – The number of hyperplanes, and hence, the number of bits in the resulting hash packbits (bool) – Should the result be bit-packed? needs (Preprocessor) – the processing node on which this node relies for its data

Learned Models in Audio Processing Graphs¶

class zounds.learn.Learned(learned=None, version=None, wrapper=None, pipeline_func=None, needs=None, dtype=None)[source]¶